123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387 |
- [role="xpack"]
- [[ml-configuring-alerts]]
- = Generating alerts for {anomaly-jobs}
- beta::[]
- {kib} {alert-features} include support for {ml} rules, which run scheduled
- checks for anomalies in one or more {anomaly-jobs} or check the health of the
- job with certain conditions. If the conditions of the rule are met, an alert is
- created and the associated action is triggered. For example, you can create a
- rule to check an {anomaly-job} every fifteen minutes for critical anomalies and
- to notify you in an email. To learn more about {kib} {alert-features}, refer to
- {kibana-ref}/alerting-getting-started.html#alerting-getting-started[Alerting].
- The following {ml} rules are available:
- {anomaly-detect-cap} alert::
- Checks if the {anomaly-job} results contain anomalies that match the rule
- conditions.
- {anomaly-jobs-cap} health::
- Monitors job health and alerts if an operational issue occurred that may
- prevent the job from detecting anomalies.
- TIP: If you have created rules for specific {anomaly-jobs} and you want to
- monitor whether these jobs work as expected, {anomaly-jobs} health rules are
- ideal for this purpose.
- [[creating-ml-rules]]
- == Creating a rule
- In *{stack-manage-app} > {rules-ui}*, you can create both types of {ml} rules:
- [role="screenshot"]
- image::images/ml-rule.png["Creating a new machine learning rule",500]
- // NOTE: This is an autogenerated screenshot. Do not edit it directly.
- In the *{ml-app}* app, you can create only {anomaly-detect} alert rules; create
- them from the {anomaly-job} wizard after you start the job or from the
- {anomaly-job} list.
- [[creating-anomaly-alert-rules]]
- === {anomaly-detect-cap} alert
- When you create an {anomaly-detect} alert rule, you must select the job that
- the rule applies to.
- You must also select a type of {ml} result. In particular, you can create rules
- based on bucket, record, or influencer results.
- [role="screenshot"]
- image::images/ml-anomaly-alert-severity.png["Selecting result type, severity, and test interval", 500]
- // NOTE: This is an autogenerated screenshot. Do not edit it directly.
- For each rule, you can configure the `anomaly_score` that triggers the action.
- The `anomaly_score` indicates the significance of a given anomaly compared to
- previous anomalies. The default severity threshold is 75 which means every
- anomaly with an `anomaly_score` of 75 or higher triggers the associated action.
- You can select whether you want to include interim results. Interim results are
- created by the {anomaly-job} before a bucket is finalized. These results might
- disappear after the bucket is fully processed. Include interim results if you
- want to be notified earlier about a potential anomaly even if it might be a
- false positive. If you want to get notified only about anomalies of fully
- processed buckets, do not include interim results.
- You can also configure advanced settings. _Lookback interval_ sets an interval
- that is used to query previous anomalies during each condition check. Its value
- is derived from the bucket span of the job and the query delay of the {dfeed} by
- default. It is not recommended to set the lookback interval lower than the
- default value as it might result in missed anomalies. _Number of latest buckets_
- sets how many buckets to check to obtain the highest anomaly from all the
- anomalies that are found during the _Lookback interval_. An alert is created
- based on the anomaly with the highest anomaly score from the most anomalous
- bucket.
- You can also test the configured conditions against your existing data and check
- the sample results by providing a valid interval for your data. The generated
- preview contains the number of potentially created alerts during the relative
- time range you defined.
- TIP: You must also provide a _check interval_ that defines how often to
- evaluate the rule conditions. It is recommended to select an interval that is
- close to the bucket span of the job.
- As the last step in the rule creation process, <<defining-actions,define its actions>>.
- [[creating-anomaly-jobs-health-rules]]
- === {anomaly-jobs-cap} health
- When you create an {anomaly-jobs} health rule, you must select the job or group
- that the rule applies to. If you assign more jobs to the group, they are
- included the next time the rule conditions are checked.
- You can also use a special character (`*`) to apply the rule to all your jobs.
- Jobs created after the rule are automatically included. You can exclude jobs
- that are not critically important by using the _Exclude_ field.
- Enable the health check types that you want to apply. All checks are enabled by
- default. At least one check needs to be enabled to create the rule. The
- following health checks are available:
- _Datafeed is not started_::
- Notifies if the corresponding {dfeed} of the job is not started but the job is
- in an opened state. The notification message recommends the necessary
- actions to solve the error.
- _Model memory limit reached_::
- Notifies if the model memory status of the job reaches the soft or hard model
- memory limit. Optimize your job by following
- <<detector-configuration,these guidelines>> or consider
- <<set-model-memory-limit,amending the model memory limit>>.
- _Data delay has occurred_::
- Notifies when the job missed some data. You can define the threshold for the
- amount of missing documents you get alerted on by setting
- _Number of documents_. You can control the lookback interval for checking
- delayed data with _Time interval_. Refer to the
- <<ml-delayed-data-detection>> page to see what to do about delayed data.
- _Errors in job messages_::
- Notifies when the job messages contain error messages. Review the
- notification; it contains the error messages, the corresponding job IDs and
- recommendations on how to fix the issue. This check looks for job errors
- that occur after the rule is created; it does not look at historic behavior.
- [role="screenshot"]
- image::images/ml-health-check-config.png["Selecting health checkers",500]
- // NOTE: This is an autogenerated screenshot. Do not edit it directly.
- TIP: You must also provide a _check interval_ that defines how often to
- evaluate the rule conditions. It is recommended to select an interval that is
- close to the bucket span of the job.
- As the last step in the rule creation process, define its actions.
-
- [[defining-actions]]
- == Defining actions
- //tag::define-actions[]
- You can add one or more actions to your rule to generate notifications when its
- conditions are met and when they are no longer met.
- Each action uses a connector, which stores connection information for a {kib}
- service or supported third-party integration, depending on where you want to
- send the notifications. For example, you can use a Slack connector to send a
- message to a channel. Or you can use an index connector that writes an JSON
- object to a specific index. For details about creating connectors, refer to
- {kibana-ref}/action-types.html[Connectors].
- You must set the action frequency, which involves choosing how often to run
- the action (for example, at each check interval, only when the alert status
- changes, or at a custom action interval). Each rule type also has a list of
- valid action groups and you must choose one of these groups (for example, the
- action runs when the issue is detected or when it is recovered).
- TIP: If you choose a custom action interval, it cannot be shorter than the
- rule's check interval.
- //end::define-actions[]
- It's also possible to customize the notification messages for each action. There
- is a set of variables that you can include in the message depending on the rule
- type; refer to <<action-variables>>.
- [role="screenshot"]
- image::images/ml-anomaly-alert-messages.png["Customizing your message",500]
- // NOTE: This is an autogenerated screenshot. Do not edit it directly.
- After you save the configurations, the rule appears in the
- *{stack-manage-app} > {rules-ui}* list; you can check its status and see the
- overview of its configuration information.
- When an alert occurs, it is always the same name as the job ID of the associated
- {anomaly-job} that triggered it. If necessary, you can snooze rules to prevent
- them from generating actions. For more details, refer to
- {kibana-ref}/create-and-manage-rules.html#controlling-rules[Snooze and disable rules].
- [[action-variables]]
- == Action variables
- The following variables are specific to the {ml} rule types. An asterisk (`*`)
- marks the variables that you can use in actions related to recovered alerts.
- [[anomaly-alert-action-variables]]
- === {anomaly-detect-cap} alert action variables
- Every {anomaly-detect} alert has the following action variables:
- `context`.`anomalyExplorerUrl` ^*^::
- URL to open in the Anomaly Explorer.
- `context`.`isInterim`::
- Indicates if top hits contain interim results.
- `context`.`jobIds` ^*^::
- List of job IDs that triggered the alert.
- `context`.`message` ^*^::
- A preconstructed message for the alert.
- `context`.`score`::
- Anomaly score at the time of the notification action.
- `context`.`timestamp`::
- The bucket timestamp of the anomaly.
- `context`.`timestampIso8601`::
- The bucket timestamp of the anomaly in ISO8601 format.
- `context`.`topInfluencers`::
- The list of top influencers.
- +
- .Properties of `context.topInfluencers`
- [%collapsible%open]
- ====
- `influencer_field_name`:::
- The field name of the influencer.
- `influencer_field_value`:::
- The entity that influenced, contributed to, or was to blame for the anomaly.
- `score`:::
- The influencer score. A normalized score between 0-100 which shows the
- influencer's overall contribution to the anomalies.
- ====
- `context`.`topRecords`::
- The list of top records.
- +
- .Properties of `context.topRecords`
- [%collapsible%open]
- ====
- `actual`:::
- The actual value for the bucket.
- `by_field_value`:::
- The value of the by field.
- `field_name`:::
- Certain functions require a field to operate on, for example, `sum()`. For those
- functions, this value is the name of the field to be analyzed.
- `function`:::
- The function in which the anomaly occurs, as specified in the detector
- configuration. For example, `max`.
- `over_field_name`:::
- The field used to split the data.
- `partition_field_value`:::
- The field used to segment the analysis.
- `score`:::
- A normalized score between 0-100, which is based on the probability of the
- anomalousness of this record.
- `typical`:::
- The typical value for the bucket, according to analytical modeling.
- ====
- [[anomaly-jobs-health-action-variables]]
- === {anomaly-jobs-cap} health action variables
- Every health check has two main variables: `context.message` and
- `context.results`. The properties of `context.results` may vary based on the
- type of check. You can find the possible properties for all the checks below.
- ==== _Datafeed is not started_
- `context.message` ^*^::
- A preconstructed message for the alert.
- `context.results`::
- Contains the following properties:
- +
- .Properties of `context.results`
- [%collapsible%open]
- ====
- `datafeed_id` ^*^:::
- The {dfeed} identifier.
- `datafeed_state` ^*^:::
- The state of the {dfeed}. It can be `starting`, `started`,
- `stopping`, `stopped`.
- `job_id` ^*^:::
- The job identifier.
- `job_state` ^*^:::
- The state of the job. It can be `opening`, `opened`, `closing`,
- `closed`, or `failed`.
- ====
- ==== _Model memory limit reached_
- `context.message` ^*^::
- A preconstructed message for the rule.
- `context.results`::
- Contains the following properties:
- +
- .Properties of `context.results`
- [%collapsible%open]
- ====
- `job_id` ^*^:::
- The job identifier.
- `memory_status` ^*^:::
- The status of the mathematical model. It can have one of the following values:
- * `soft_limit`: The model used more than 60% of the configured memory limit and
- older unused models will be pruned to free up space. In categorization jobs no
- further category examples will be stored.
- * `hard_limit`: The model used more space than the configured memory limit. As a
- result, not all incoming data was processed.
- The `memory_status` is `ok` for recovered alerts.
- `model_bytes` ^*^:::
- The number of bytes of memory used by the models.
- `model_bytes_exceeded` ^*^:::
- The number of bytes over the high limit for memory usage at the last allocation
- failure.
- `model_bytes_memory_limit` ^*^:::
- The upper limit for model memory usage.
- `log_time` ^*^:::
- The timestamp of the model size statistics according to server time. Time
- formatting is based on the {kib} settings.
- `peak_model_bytes` ^*^:::
- The peak number of bytes of memory ever used by the model.
- ====
- ==== _Data delay has occurred_
- `context.message` ^*^::
- A preconstructed message for the rule.
- `context.results`::
- For recovered alerts, `context.results` is either empty (when there is no
- delayed data) or the same as for an active alert (when the number of missing
- documents is less than the _Number of documents_ treshold set by the user).
- Contains the following properties:
- +
- .Properties of `context.results`
- [%collapsible%open]
- ====
- `annotation` ^*^:::
- The annotation corresponding to the data delay in the job.
- `end_timestamp` ^*^:::
- Timestamp of the latest finalized buckets with missing documents. Time
- formatting is based on the {kib} settings.
- `job_id` ^*^:::
- The job identifier.
- `missed_docs_count` ^*^:::
- The number of missed documents.
- ====
- ==== _Error in job messages_
- `context.message` ^*^::
- A preconstructed message for the rule.
- `context.results`::
- Contains the following properties:
- +
- .Properties of `context.results`
- [%collapsible%open]
- ====
- `timestamp`:::
- Timestamp of the latest finalized buckets with missing documents.
- `job_id`:::
- The job identifier.
- `message`:::
- The error message.
- `node_name`:::
- The name of the node that runs the job.
- ====
|