Last updated 23th August, 2019
Objective
OMNI is a part of the OVH Studio solution. It is a specialised alerting system, based on metrics. With it, you will be able to send notifications (SMS, emails, etc) when anomalies occur within your metrics.
This guide will show you how to set up alerts for your OVH Metrics solution.
Requirements
- an OVH Metrics service
- an OVH Studio account
Instructions
To build an alerting project, you have to write a flow definition, describing how incidents are handled. An incident is any event within your metrics that shouldn't occur. For example: too many requests per seconds on a webserver.
The four entities
Four entities define your alerting system, each of which is represented by a yaml file:
- Endpoints are used to access your data, and manage access endpoints and authentication. You can use them to access metrics, logs or SQL tables.
- Plans are actions lists that define how to deal with an incident.
- Drones are scripts that will run on a dataset and look for anomalies.
- Alerts are central entities that link an endpoint, a drone and plans.
How the entities are linked
The alert entity creates a link between a drone, an endpoint and several plans. In an alert, you have to specify which drone should be used. Each drone is global to a alert. The drone used in an alert has to work on a dataset, so you must give the drone access to data with an endpoint. Finally, you will trigger plans if your drone detects anomalies on a dataset. You can trigger as many plans as you want, each with a different data filter and drone parameters.
Endpoint definition
An endpoint is an access to a data source, including authentication.
Endpoint definition in a Yaml file:
Key | Type | Required | Description |
---|---|---|---|
name | string | Endpoint name | |
type | string | Data type of the datasource (metrics) | |
description | string | A custom description | |
endpoint | string | URL to access your data | |
token | string | Authentication (read only) |
Endpoint example:
name: gra1
type: metrics
description: Metrics cluster
endpoint: https://warp10.gra1.metrics.ovh.net
token: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
Plan definition
A plan defines how to deal with an incident, and will define who to contact, how to do so, and what to say. A plan is a list of steps, each step representing an escalation level, with the first step as the trigger. Finally, a step is list of actions. All actions in a step are performed at the same time.
This is because steps represent the escalation process, i.e. do step A, then step B, then step C, and so on, until the incident is resolved. It also allow you to make lots of thing at each step like API call, SMS or email.
Plan definition in a Yaml file:
Key | Type | Required? | Description |
---|---|---|---|
name | string | The plan's name | |
aggregations | aggregation[] | An aggregation list to do/compute | |
steps | step[] | A Step list |
Aggregation definition:
Key | type | Required? | Description |
---|---|---|---|
group | string | A metric label or attribute name | |
by | string | A metric label or attribute name | |
threshold | number (0-1) | If more than (threshold) percent of the matching series are in alert mode, then notify the (group) value instead of each series |
Step definition:
Key | Type | Required? | Description |
---|---|---|---|
name | string | The step's name | |
retry | number (>0) | Times a step must be performed before proceeding to the next one (if 0 as retry, step will be called 1 time) | |
duration | ISO8601 | A step duration is the lifetime of the step, all retries will be done in this period and when its' over, the next step is started | |
actions | action[] | An action list to execute |
Action definition:
Key | Type | Required? | Description |
---|---|---|---|
where | string | A contact provider (direct, Omni, oncall) | |
who | string | If where is 'direct' then the value should be a mail address or a phone number. Otherwise, it should be a collaborator name | |
how | string | A contact way (sms,email,push) | |
when | string | An active period, during which this action will be performed ("08:00/18:00") |
Example:
name: apcalypse
aggregations:
- group: host
by: rack
threshold: 0.8
- group: rack
by: datacenter
steps:
- name: A
retry: 2
duration: PT30M
actions:
- where: omni
who: paul.bocuse
how: sms
when: "08:00/18:00"
- where: oncall
who: run-team
how: email
- name: B
actions:
- where: direct
who: critical-incident@mycorp.net
how: email
Drone definition
A drone is a script run with custom parameters, which returns incidents from a metrics subset. You don't necssarily have to write drones, as basics ones are available on the Registry. Life drones, for example, will checks that a metric is recently pushed, and range drones will check that your metric's value is within a defined range.
If you wish to write your own, the script must be a WarpScript™, which takes and returns a specific structure.
Drone definition in a Yaml file:
Key | Type | Required? | Description |
---|---|---|---|
name | string | Name used in registry | |
type | string | Datatype on which this drone can work (metrics, in this context) | |
version | version | Version used in drone registry | |
description | string | A small description of the drone and its behaviour (can be a file path) | |
public | boolean | Set to 'true' to publish your drone in the registry | |
lang | string | Script language used (ws for WarpScript™) | |
params | map(parameter_name) -> parameter_config | A map of drone parameters with their configurations | |
script | string | Can be a path to a file (./mydrone.ws) or the litteral script |
A drone can have required or optional parameters, which you can define.
Parameter configuration:
Key | Type | Required? | Description |
---|---|---|---|
type | string | Can be string, number, date, period... | |
required | boolean | This is a condition your drone requires to function | |
description | string | This parameter is used to... | |
default | - | This depends of the parameter type |
Drone example:
name: range
type: metrics
version: 1.0.0
description: return an anomalie if time series values are not in the MIN-MAX range
public: true
lang: ws
script: ./range.ws
params:
min:
type: number
required: true
description: lower bound of the range
max:
type: number
required: true
description: uppper bound of the range
window:
type: duration
description: fetched time window
default: 1h
Writing your own drone script
If the registry doesn't have a drone to suit your specific requirements, you can write your own drone scripts using WarpScript™. For more on this, refer to the official documentation or the Warp 10™ tour.
Your script will require the following variables:
- $token The Read token you must use to authenticate
- $selector Metric name (selector)
- $labels Labels (part of the metric selector)
- $now Current timestamp
Alert parameters are added on top of your WarpScript™. Just prefix each parameter name with $ to use them. Optional parameters are not defined, so you will have to check if they exist.
OMNI expects your script to have only one entry in the stack at the end. This entry must be an array of incidents, each of which must have a unique name (like 'series selector'), and can have a reason and details.
WarpScript™ returns:
[
{ 'name' 'os.mem{host=A}' 'reason' 'MAX reached' 'details' '85% of memory used is to high' }
{ 'name' 'os.mem{host=D}' 'reason' 'MAX reached' 'details' '93% of memory used is to high' }
]
Range drone explanation
[ $token $selector $labels $now $window ] FETCH // Fetch data with defined parameters
<% 'average' DEFINED %> // Check if optional parameter 'average' is defined
<%
[ SWAP bucketizer.mean $now $average 0 ] BUCKETIZE FILLPREVIOUS FILLNEXT // Perform a downsampling with mean method
%>
IFT
[ ] SWAP // Declare an empty array of incidents
<%
<% $isCounter true == %> // If the metric is flagged as continuous, growing counters
<%
[ SWAP mapper.rate 1 0 $gts VALUES SIZE 1 - -1 * ] MAP // Apply a rate to the metrics
%>
IFT
'gts' STORE // Store the current metric in the loop
<% $gts VALUES SIZE 1 >= %> // Metric must have at least one value
<%
<% 'max' DEFINED %> // If optional MAX parameter is defined
<%
$gts [ SWAP bucketizer.max $now 0 1 ] BUCKETIZE // Get the maximum of the metric values
VALUES 0 GET 0 GET 'v' STORE
<% 'precision' DEFINED %> // Round the value
<%
$v 10 $precision ** TODOUBLE *
ROUND
10 $precision ** TODOUBLE /
'v' STORE
%>
IFT
<% $v $max > %> // If the value is higher than MAX
// Add a new incident in the list
<% { 'name' $gts TOSELECTOR 'reason' 'MAX' 'details' $v '>' $max ' ' 3 JOIN } 1 ->LIST APPEND %>
IFT
%>
IFT
<% 'min' DEFINED %> // If optional MIN parameter is defined
<%
$gts [ SWAP bucketizer.min $now 0 1 ] BUCKETIZE // Get the minimal value of the metric
VALUES 0 GET 0 GET 'v' STORE
<% 'precision' DEFINED %> // Round the value
<%
$v 10 $precision ** TODOUBLE *
ROUND
10 $precision ** TODOUBLE /
'v' STORE
%>
IFT
<% $v $min < %> // If the value is lower than the MIN
// Append a new incident in the list
<% { 'name' $gts TOSELECTOR 'reason' 'MIN' 'details' $v '<' $min ' ' 3 JOIN } 1 ->LIST APPEND %>
IFT
%>
IFT
%> IFT
%> FOREACH // Iter on all metrics
Alert definition
An alert is the link between a drone, an endpoint, and plans.
Alert definition in a Yaml file:
Key | Type | Required? | Description |
---|---|---|---|
name | string | Alert name | |
message | string | Content send when notification is sent, explain what happens | |
drone | string | Drone name used for this alert | |
endpoint | string | Endpoint name used for this alert | |
schedule | string | Drone scan frequency (1h 1m 15m) | |
params | map(parameter_name)->parameter_value | Drone parameters at Alert scope | |
notify | map(plan_name)->drone_config[] | A map of the plan to trigger with their configuration |
Drone configuration:
Key | Type | Required? | Description |
---|---|---|---|
selector | string | Metrics selector ("~os.cpu | |
params | map(parameter_name)->parameter_value | Selector specific parameters (override Alert's ones) |
Alert example:
name: Low memory
message: |
An host is critically low on memory. Low memory conditions are dangerous, because they could cause the OOM killer to activate.
The OOM killer is unpredictable, and can kill many of the processes wthat are essential to the proper functioning of the system.
You should investigate what is causing the box to run low on memory.
Some actionable steps you can take to resolve this alert are:
* Fix memory leaks in the applications running on the box
* Spin up your service on instances with larger memory sizes
drone: range
endpoint: gra1
schedule: 1m
params:
window: 5 m
average: 1 m
precison: 2
notify:
apocalypse:
- selector: 'os.mem{}'
params:
max: 80
- selector: 'os.mem{host~xtrem.*}'
params:
max: 95
Package your alerting
Since each entity is a yaml file, you can create a directory with your alerting project name. This directory will receive four sub-directories, named after the four entities: drones, plans, endpoints and alerts.
You must define one entity per file in the corresponding directory. Each file must have a .yaml or .yml extension.
Each entity must also be named. You can set a name key in the file, otherwise the file name will be taken as the entity name.
An alerting project is a git project. You can initialise your project directory this way with the command below:
$ git init
Project lifecycle
Link your project
An alerting project is started with a git repository. When you have a ready-to-use alerting definition in a git project, it's time to link your repository to OMNI.
In the OVH Studio's OMNI panel, you have two choices, depending of your git platform provider (Github, Gilab, Bitbucket etc.).
You can choose either a Github integration, or a manual link (additional provider integrations are planned).
Github provider
Making an OAuth to Github with OVH Studio is the easiest way to link repositories.
The OAuth can be done via user settings.
After a successful OAuth, you will see a list of your Github repositories in OMNI. You just have to clic on the link button to link it. The following actions will be performed: add an OMNI public SSH key (allowed to clone the repo) set a hook on push events (i.e. notify OMNI when a new version is committed)
You don't have anything else to do.
Manual link
A manual link can be done in three steps: configuration, security and hook.
The first step will ask for the name and description, which are free fields.
The git clone URL must a valid url, where OMNI is allowed to clone your project, that follows this pattern:
ssh://GIT_USER@DOMAIN.TLD/REPOSITORY.git
The default behaviour is to deploy your alerting project when a commit is pushed on the master branch, but you can also define a custom branch name.
During the security step, you can copy the OMNI public SSH key that will be used to clone your repository in your git platform provider's solution.
Finally, during the hook step, you will see the hook URL your provider must call when a new commit is pushed. In most of the cases, it will be a 'Hooks' configuration panel.
Push modifications
Your repository is now linked, and OMNI is ready for your next pushed commit. Several actions will then take place. For each commit a build step is started, you will see the result in the builds panel. Some verifications will then be performed (i.e. 'Is the drone used in the alert defined?', 'Can we parse this file?' or 'Does this secret exist?'). If the build is successful, you will be able to deploy it.
If the commit is done on your configured master branch and the build step is successful, then it can be deployed.
Setup your team's notifications
On OVH Studio, you can set up email and/or phone notifications. You can also enable push notification system on your laptop/smartphone.
Go further
- Documentation: Guides
- Vizualize your data: https://grafana.metrics.ovh.net/login
- Community hub: https://community.ovh.com
- Create an account: Try it free!