ml-configuring-alerts.asciidoc 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415
  1. [[ml-configuring-alerts]]
  2. = Generating alerts for {anomaly-jobs}
  3. :frontmatter-description: Create {anomaly-detect} alert and {anomaly-jobs} health rules.
  4. :frontmatter-tags-products: [ml, alerting]
  5. :frontmatter-tags-content-type: [how-to]
  6. :frontmatter-tags-user-goals: [configure]
  7. {kib} {alert-features} include support for {ml} rules, which run scheduled
  8. checks for anomalies in one or more {anomaly-jobs} or check the health of the
  9. job with certain conditions. If the conditions of the rule are met, an alert is
  10. created and the associated action is triggered. For example, you can create a
  11. rule to check an {anomaly-job} every fifteen minutes for critical anomalies and
  12. to notify you in an email. To learn more about {kib} {alert-features}, refer to
  13. {kibana-ref}/alerting-getting-started.html#alerting-getting-started[Alerting].
  14. The following {ml} rules are available:
  15. {anomaly-detect-cap} alert::
  16. Checks if the {anomaly-job} results contain anomalies that match the rule
  17. conditions.
  18. {anomaly-jobs-cap} health::
  19. Monitors job health and alerts if an operational issue occurred that may
  20. prevent the job from detecting anomalies.
  21. TIP: If you have created rules for specific {anomaly-jobs} and you want to
  22. monitor whether these jobs work as expected, {anomaly-jobs} health rules are
  23. ideal for this purpose.
  24. In *{stack-manage-app} > {rules-ui}*, you can create both types of {ml} rules:
  25. [role="screenshot"]
  26. image::images/ml-rule.png["Creating a new machine learning rule",500]
  27. // NOTE: This is an autogenerated screenshot. Do not edit it directly.
  28. In the *{ml-app}* app, you can create only {anomaly-detect} alert rules; create
  29. them from the {anomaly-job} wizard after you start the job or from the
  30. {anomaly-job} list.
  31. [[creating-anomaly-alert-rules]]
  32. == {anomaly-detect-cap} alert rules
  33. When you create an {anomaly-detect} alert rule, you must select the job that
  34. the rule applies to.
  35. You must also select a type of {ml} result. In particular, you can create rules
  36. based on bucket, record, or influencer results.
  37. [role="screenshot"]
  38. image::images/ml-anomaly-alert-severity.png["Selecting result type, severity, and test interval", 500]
  39. // NOTE: This is an autogenerated screenshot. Do not edit it directly.
  40. For each rule, you can configure the `anomaly_score` that triggers the action.
  41. The `anomaly_score` indicates the significance of a given anomaly compared to
  42. previous anomalies. The default severity threshold is 75 which means every
  43. anomaly with an `anomaly_score` of 75 or higher triggers the associated action.
  44. You can select whether you want to include interim results. Interim results are
  45. created by the {anomaly-job} before a bucket is finalized. These results might
  46. disappear after the bucket is fully processed. Include interim results if you
  47. want to be notified earlier about a potential anomaly even if it might be a
  48. false positive. If you want to get notified only about anomalies of fully
  49. processed buckets, do not include interim results.
  50. You can also configure advanced settings. _Lookback interval_ sets an interval
  51. that is used to query previous anomalies during each condition check. Its value
  52. is derived from the bucket span of the job and the query delay of the {dfeed} by
  53. default. It is not recommended to set the lookback interval lower than the
  54. default value as it might result in missed anomalies. _Number of latest buckets_
  55. sets how many buckets to check to obtain the highest anomaly from all the
  56. anomalies that are found during the _Lookback interval_. An alert is created
  57. based on the anomaly with the highest anomaly score from the most anomalous
  58. bucket.
  59. You can also test the configured conditions against your existing data and check
  60. the sample results by providing a valid interval for your data. The generated
  61. preview contains the number of potentially created alerts during the relative
  62. time range you defined.
  63. TIP: You must also provide a _check interval_ that defines how often to
  64. evaluate the rule conditions. It is recommended to select an interval that is
  65. close to the bucket span of the job.
  66. As the last step in the rule creation process, define its <<ml-configuring-alert-actions,actions>>.
  67. [[creating-anomaly-jobs-health-rules]]
  68. == {anomaly-jobs-cap} health rules
  69. When you create an {anomaly-jobs} health rule, you must select the job or group
  70. that the rule applies to. If you assign more jobs to the group, they are
  71. included the next time the rule conditions are checked.
  72. You can also use a special character (`*`) to apply the rule to all your jobs.
  73. Jobs created after the rule are automatically included. You can exclude jobs
  74. that are not critically important by using the _Exclude_ field.
  75. Enable the health check types that you want to apply. All checks are enabled by
  76. default. At least one check needs to be enabled to create the rule. The
  77. following health checks are available:
  78. _Datafeed is not started_::
  79. Notifies if the corresponding {dfeed} of the job is not started but the job is
  80. in an opened state. The notification message recommends the necessary
  81. actions to solve the error.
  82. _Model memory limit reached_::
  83. Notifies if the model memory status of the job reaches the soft or hard model
  84. memory limit. Optimize your job by following
  85. <<detector-configuration,these guidelines>> or consider
  86. <<set-model-memory-limit,amending the model memory limit>>.
  87. _Data delay has occurred_::
  88. Notifies when the job missed some data. You can define the threshold for the
  89. amount of missing documents you get alerted on by setting
  90. _Number of documents_. You can control the lookback interval for checking
  91. delayed data with _Time interval_. Refer to the
  92. <<ml-delayed-data-detection>> page to see what to do about delayed data.
  93. _Errors in job messages_::
  94. Notifies when the job messages contain error messages. Review the
  95. notification; it contains the error messages, the corresponding job IDs and
  96. recommendations on how to fix the issue. This check looks for job errors
  97. that occur after the rule is created; it does not look at historic behavior.
  98. [role="screenshot"]
  99. image::images/ml-health-check-config.png["Selecting health checkers",500]
  100. // NOTE: This is an autogenerated screenshot. Do not edit it directly.
  101. TIP: You must also provide a _check interval_ that defines how often to
  102. evaluate the rule conditions. It is recommended to select an interval that is
  103. close to the bucket span of the job.
  104. As the last step in the rule creation process, define its actions.
  105. [[ml-configuring-alert-actions]]
  106. == Actions
  107. You can optionally send notifications when the rule conditions are met and when
  108. they are no longer met. In particular, these rules support:
  109. * alert summaries
  110. * actions that run when the anomaly score matches the conditions (for {anomaly-detect} alert rules)
  111. * actions that run when an issue is detected (for {anomaly-jobs} health rules)
  112. * recovery actions that run when the conditions are no longer met
  113. Each action uses a connector, which stores connection information for a {kib}
  114. service or supported third-party integration, depending on where you want to
  115. send the notifications. For example, you can use a Slack connector to send a
  116. message to a channel. Or you can use an index connector that writes a JSON
  117. object to a specific index. For details about creating connectors, refer to
  118. {kibana-ref}/action-types.html[Connectors].
  119. After you select a connector, you must set the action frequency. You can choose
  120. to create a summary of alerts on each check interval or on a custom interval.
  121. For example, send slack notifications that summarize the new, ongoing, and
  122. recovered alerts:
  123. [role="screenshot"]
  124. image::images/ml-anomaly-alert-action-summary.png["Adding an alert summary action to the rule",500]
  125. // NOTE: This is an autogenerated screenshot. Do not edit it directly.
  126. TIP: If you choose a custom action interval, it cannot be shorter than the
  127. rule's check interval.
  128. Alternatively, you can set the action frequency such that actions run for each
  129. alert. Choose how often the action runs (at each check interval, only when the
  130. alert status changes, or at a custom action interval). For {anomaly-detect}
  131. alert rules, you must also choose whether the action runs when the anomaly score
  132. matches the condition or when the alert recovers:
  133. [role="screenshot"]
  134. image::images/ml-anomaly-alert-action-score-matched.png["Adding an action for each alert in the rule",500]
  135. // NOTE: This is an autogenerated screenshot. Do not edit it directly.
  136. In {anomaly-jobs} health rules, choose whether the action runs when the issue is
  137. detected or when it is recovered:
  138. [role="screenshot"]
  139. image::images/ml-health-check-action.png["Adding an action for each alert in the rule",500]
  140. // NOTE: This is an autogenerated screenshot. Do not edit it directly.
  141. You can further refine the rule by specifying that actions run only when they
  142. match a KQL query or when an alert occurs within a specific time frame.
  143. There is a set of variables that you can use to customize the notification
  144. messages for each action. Click the icon above the message text box to get the
  145. list of variables or refer to <<action-variables>>. For example:
  146. [role="screenshot"]
  147. image::images/ml-anomaly-alert-messages.png["Customizing your message",500]
  148. // NOTE: This is an autogenerated screenshot. Do not edit it directly.
  149. After you save the configurations, the rule appears in the
  150. *{stack-manage-app} > {rules-ui}* list; you can check its status and see the
  151. overview of its configuration information.
  152. When an alert occurs for an {anomaly-detect} alert rule, it is always the same
  153. name as the job ID of the associated {anomaly-job} that triggered it. You can
  154. review how the alerts that are occured correlate with the {anomaly-detect}
  155. results in the **Anomaly explorer** by using the **Anomaly timeline** swimlane
  156. and the **Alerts** panel.
  157. If necessary, you can snooze rules to prevent them from generating actions. For
  158. more details, refer to
  159. {kibana-ref}/create-and-manage-rules.html#controlling-rules[Snooze and disable rules].
  160. [[action-variables]]
  161. == Action variables
  162. The following variables are specific to the {ml} rule types. An asterisk (`*`)
  163. marks the variables that you can use in actions related to recovered alerts.
  164. You can also specify {kibana-ref}/rule-action-variables.html[variables common to all rules].
  165. [[anomaly-alert-action-variables]]
  166. === {anomaly-detect-cap} alert action variables
  167. Every {anomaly-detect} alert has the following action variables:
  168. `context`.`anomalyExplorerUrl` ^*^::
  169. URL to open in the Anomaly Explorer.
  170. `context`.`isInterim`::
  171. Indicates if top hits contain interim results.
  172. `context`.`jobIds` ^*^::
  173. List of job IDs that triggered the alert.
  174. `context`.`message` ^*^::
  175. A preconstructed message for the alert.
  176. `context`.`score`::
  177. Anomaly score at the time of the notification action.
  178. `context`.`timestamp`::
  179. The bucket timestamp of the anomaly.
  180. `context`.`timestampIso8601`::
  181. The bucket timestamp of the anomaly in ISO8601 format.
  182. `context`.`topInfluencers`::
  183. The list of top influencers.
  184. +
  185. .Properties of `context.topInfluencers`
  186. [%collapsible%open]
  187. ====
  188. `influencer_field_name`:::
  189. The field name of the influencer.
  190. `influencer_field_value`:::
  191. The entity that influenced, contributed to, or was to blame for the anomaly.
  192. `score`:::
  193. The influencer score. A normalized score between 0-100 which shows the
  194. influencer's overall contribution to the anomalies.
  195. ====
  196. `context`.`topRecords`::
  197. The list of top records.
  198. +
  199. .Properties of `context.topRecords`
  200. [%collapsible%open]
  201. ====
  202. `actual`:::
  203. The actual value for the bucket.
  204. `by_field_value`:::
  205. The value of the by field.
  206. `field_name`:::
  207. Certain functions require a field to operate on, for example, `sum()`. For those
  208. functions, this value is the name of the field to be analyzed.
  209. `function`:::
  210. The function in which the anomaly occurs, as specified in the detector
  211. configuration. For example, `max`.
  212. `over_field_name`:::
  213. The field used to split the data.
  214. `partition_field_value`:::
  215. The field used to segment the analysis.
  216. `score`:::
  217. A normalized score between 0-100, which is based on the probability of the
  218. anomalousness of this record.
  219. `typical`:::
  220. The typical value for the bucket, according to analytical modeling.
  221. ====
  222. [[anomaly-jobs-health-action-variables]]
  223. === {anomaly-jobs-cap} health action variables
  224. Every health check has two main variables: `context.message` and
  225. `context.results`. The properties of `context.results` may vary based on the
  226. type of check. You can find the possible properties for all the checks below.
  227. ==== _Datafeed is not started_
  228. `context.message` ^*^::
  229. A preconstructed message for the alert.
  230. `context.results`::
  231. Contains the following properties:
  232. +
  233. .Properties of `context.results`
  234. [%collapsible%open]
  235. ====
  236. `datafeed_id` ^*^:::
  237. The {dfeed} identifier.
  238. `datafeed_state` ^*^:::
  239. The state of the {dfeed}. It can be `starting`, `started`,
  240. `stopping`, `stopped`.
  241. `job_id` ^*^:::
  242. The job identifier.
  243. `job_state` ^*^:::
  244. The state of the job. It can be `opening`, `opened`, `closing`,
  245. `closed`, or `failed`.
  246. ====
  247. ==== _Model memory limit reached_
  248. `context.message` ^*^::
  249. A preconstructed message for the rule.
  250. `context.results`::
  251. Contains the following properties:
  252. +
  253. .Properties of `context.results`
  254. [%collapsible%open]
  255. ====
  256. `job_id` ^*^:::
  257. The job identifier.
  258. `memory_status` ^*^:::
  259. The status of the mathematical model. It can have one of the following values:
  260. * `soft_limit`: The model used more than 60% of the configured memory limit and
  261. older unused models will be pruned to free up space. In categorization jobs no
  262. further category examples will be stored.
  263. * `hard_limit`: The model used more space than the configured memory limit. As a
  264. result, not all incoming data was processed.
  265. The `memory_status` is `ok` for recovered alerts.
  266. `model_bytes` ^*^:::
  267. The number of bytes of memory used by the models.
  268. `model_bytes_exceeded` ^*^:::
  269. The number of bytes over the high limit for memory usage at the last allocation
  270. failure.
  271. `model_bytes_memory_limit` ^*^:::
  272. The upper limit for model memory usage.
  273. `log_time` ^*^:::
  274. The timestamp of the model size statistics according to server time. Time
  275. formatting is based on the {kib} settings.
  276. `peak_model_bytes` ^*^:::
  277. The peak number of bytes of memory ever used by the model.
  278. ====
  279. ==== _Data delay has occurred_
  280. `context.message` ^*^::
  281. A preconstructed message for the rule.
  282. `context.results`::
  283. For recovered alerts, `context.results` is either empty (when there is no
  284. delayed data) or the same as for an active alert (when the number of missing
  285. documents is less than the _Number of documents_ treshold set by the user).
  286. Contains the following properties:
  287. +
  288. .Properties of `context.results`
  289. [%collapsible%open]
  290. ====
  291. `annotation` ^*^:::
  292. The annotation corresponding to the data delay in the job.
  293. `end_timestamp` ^*^:::
  294. Timestamp of the latest finalized buckets with missing documents. Time
  295. formatting is based on the {kib} settings.
  296. `job_id` ^*^:::
  297. The job identifier.
  298. `missed_docs_count` ^*^:::
  299. The number of missed documents.
  300. ====
  301. ==== _Error in job messages_
  302. `context.message` ^*^::
  303. A preconstructed message for the rule.
  304. `context.results`::
  305. Contains the following properties:
  306. +
  307. .Properties of `context.results`
  308. [%collapsible%open]
  309. ====
  310. `timestamp`:::
  311. Timestamp of the latest finalized buckets with missing documents.
  312. `job_id`:::
  313. The job identifier.
  314. `message`:::
  315. The error message.
  316. `node_name`:::
  317. The name of the node that runs the job.
  318. ====