ml-configuring-alerts.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374
  1. [role="xpack"]
  2. [[ml-configuring-alerts]]
  3. = Generating alerts for {anomaly-jobs}
  4. beta::[]
  5. {kib} {alert-features} include support for {ml} rules, which run scheduled
  6. checks for anomalies in one or more {anomaly-jobs} or check the
  7. health of the job with certain conditions. If the conditions of the rule are met, an
  8. alert is created and the associated action is triggered. For example, you can
  9. create a rule to check an {anomaly-job} every fifteen minutes for critical
  10. anomalies and to notify you in an email. To learn more about {kib}
  11. {alert-features}, refer to
  12. {kibana-ref}/alerting-getting-started.html#alerting-getting-started[Alerting].
  13. The following {ml} rules are available:
  14. {anomaly-detect-cap} alert::
  15. Checks if the {anomaly-job} results contain anomalies that match the rule
  16. conditions.
  17. {anomaly-jobs-cap} health::
  18. Monitors job health and alerts if an operational issue occurred that may
  19. prevent the job from detecting anomalies.
  20. TIP: If you have created rules for specific {anomaly-jobs} and you want to
  21. monitor whether these jobs work as expected, {anomaly-jobs} health rules are
  22. ideal for this purpose.
  23. [[creating-ml-rules]]
  24. == Creating a rule
  25. You can create {ml} rules in the {anomaly-job} wizard after you start the job,
  26. from the job list, or under **{stack-manage-app} > {alerts-ui}**.
  27. On the *Create rule* window, give a name to the rule and optionally provide
  28. tags. Specify the time interval for the rule to check detected anomalies or job
  29. health changes. It is recommended to select an interval that is close to the
  30. bucket span of the job. You can also select a notification option with the
  31. _Notify_ selector. An alert remains active as long as the configured conditions
  32. are met during the check interval. When there is no matching condition in the
  33. next interval, the `Recovered` action group is invoked and the status of the
  34. alert changes to `OK`. For more details, refer to the documentation of
  35. {kibana-ref}/create-and-manage-rules.html#defining-rules-general-details[general rule details].
  36. Select the rule type you want to create under the {ml} section and continue to
  37. configure it depending on whether it is an
  38. <<creating-anomaly-alert-rules, {anomaly-detect} alert>> or an
  39. <<creating-anomaly-jobs-health-rules, {anomaly-job} health>> rule.
  40. [role="screenshot"]
  41. image::images/ml-rule.jpg["Creating a new machine learning rule"]
  42. [[creating-anomaly-alert-rules]]
  43. === {anomaly-detect-cap} alert
  44. Select the job that the rule applies to.
  45. You must select a type of {ml} result. In particular, you can create rules based
  46. on bucket, record, or influencer results.
  47. [role="screenshot"]
  48. image::images/ml-anomaly-alert-severity.jpg["Selecting result type, severity, and test interval"]
  49. For each rule, you can configure the `anomaly_score` that triggers the action.
  50. The `anomaly_score` indicates the significance of a given anomaly compared to
  51. previous anomalies. The default severity threshold is 75 which means every
  52. anomaly with an `anomaly_score` of 75 or higher triggers the associated action.
  53. You can select whether you want to include interim results. Interim results are
  54. created by the {anomaly-job} before a bucket is finalized. These results might
  55. disappear after the bucket is fully processed. Include interim results if you
  56. want to be notified earlier about a potential anomaly even if it might be a
  57. false positive. If you want to get notified only about anomalies of fully
  58. processed buckets, do not include interim results.
  59. You can also configure advanced settings. _Lookback interval_ sets an interval
  60. that is used to query previous anomalies during each condition check. Its value
  61. is derived from the bucket span of the job and the query delay of the {dfeed} by
  62. default. It is not recommended to set the lookback interval lower than the
  63. default value as it might result in missed anomalies. _Number of latest buckets_
  64. sets how many buckets to check to obtain the highest anomaly from all the
  65. anomalies that are found during the _Lookback interval_. An alert is created
  66. based on the anomaly with the highest anomaly score from the most anomalous
  67. bucket.
  68. You can also test the configured conditions against your existing data and check
  69. the sample results by providing a valid interval for your data. The generated
  70. preview contains the number of potentially created alerts during the relative
  71. time range you defined.
  72. As the last step in the rule creation process,
  73. <<defining-actions, define the actions>> that occur when the conditions
  74. are met.
  75. [[creating-anomaly-jobs-health-rules]]
  76. === {anomaly-jobs-cap} health
  77. Select the job or group that
  78. the rule applies to. If you assign more jobs to the group, they are
  79. included the next time the rule conditions are checked.
  80. You can also use a special character (`*`) to apply the rule to all your jobs.
  81. Jobs created after the rule are automatically included. You can exclude jobs
  82. that are not critically important by using the _Exclude_ field.
  83. Enable the health check types that you want to apply. All checks are enabled by
  84. default. At least one check needs to be enabled to create the rule. The
  85. following health checks are available:
  86. _Datafeed is not started_::
  87. Notifies if the corresponding {dfeed} of the job is not started but the job is
  88. in an opened state. The notification message recommends the necessary
  89. actions to solve the error.
  90. _Model memory limit reached_::
  91. Notifies if the model memory status of the job reaches the soft or hard model
  92. memory limit. Optimize your job by following
  93. <<detector-configuration, these guidelines>> or consider
  94. <<set-model-memory-limit, amending the model memory limit>>.
  95. _Data delay has occurred_::
  96. Notifies when the job missed some data. You can define the threshold for the
  97. amount of missing documents you get alerted on by setting
  98. _Number of documents_. You can control the lookback interval for checking
  99. delayed data with _Time interval_. Refer to the
  100. <<ml-delayed-data-detection>> page to see what to do about delayed data.
  101. _Errors in job messages_::
  102. Notifies when the job messages contain error messages. Review the
  103. notification; it contains the error messages, the corresponding job IDs and
  104. recommendations on how to fix the issue. This check looks for job errors
  105. that occur after the rule is created; it does not look at historic behavior.
  106. [role="screenshot"]
  107. image::images/ml-health-check-config.jpg["Selecting health checkers"]
  108. As the last step in the rule creation process,
  109. <<defining-actions, define the actions>> that occur when the conditions
  110. are met.
  111. [[defining-actions]]
  112. == Defining actions
  113. Connect your rule to actions that use supported built-in integrations by
  114. selecting a connector type. Connectors are {kib} services or third-party
  115. integrations that perform an action when the rule conditions are met.
  116. [role="screenshot"]
  117. image::images/ml-anomaly-alert-actions.jpg["Selecting connector type"]
  118. For example, you can choose _Slack_ as a connector type and configure it to send
  119. a message to a channel you selected. You can also create an index connector that
  120. writes the JSON object you configure to a specific index. It's also possible to
  121. customize the notification messages. A list of variables is available to include
  122. in the message, like job ID, anomaly score, time, top influencers, {dfeed} ID,
  123. memory status and so on based on the selected rule type. Refer to
  124. <<action-variables>> to see the full list of available variables by rule type.
  125. [role="screenshot"]
  126. image::images/ml-anomaly-alert-messages.jpg["Customizing your message"]
  127. After you save the configurations, the rule appears in the *{alerts-ui}* list
  128. where you can check its status and see the overview of its configuration
  129. information.
  130. The name of an alert is always the same as the job ID of the associated
  131. {anomaly-job} that triggered it. You can mute the notifications for a particular
  132. {anomaly-job} on the page of the rule that lists the individual alerts. You can
  133. open it via *{alerts-ui}* by selecting the rule name.
  134. [[action-variables]]
  135. == Action variables
  136. You can add different variables to your action. The following variables are
  137. specific to the {ml} rule types.
  138. [[anomaly-alert-action-variables]]
  139. === {anomaly-detect-cap} alert action variables
  140. Every {anomaly-detect} alert has the following action variables:
  141. `context`.`anomalyExplorerUrl`::
  142. URL to open in the Anomaly Explorer.
  143. `context`.`isInterim`::
  144. Indicates if top hits contain interim results.
  145. `context`.`jobIds`::
  146. List of job IDs that triggered the alert.
  147. `context`.`message`::
  148. A preconstructed message for the alert.
  149. `context`.`score`::
  150. Anomaly score at the time of the notification action.
  151. `context`.`timestamp`::
  152. The bucket timestamp of the anomaly.
  153. `context`.`timestampIso8601`::
  154. The bucket timestamp of the anomaly in ISO8601 format.
  155. `context`.`topInfluencers`::
  156. The list of top influencers.
  157. +
  158. .Properties of `context.topInfluencers`
  159. [%collapsible%open]
  160. ====
  161. `influencer_field_name`:::
  162. The field name of the influencer.
  163. `influencer_field_value`:::
  164. The entity that influenced, contributed to, or was to blame for the anomaly.
  165. `score`:::
  166. The influencer score. A normalized score between 0-100 which shows the
  167. influencer's overall contribution to the anomalies.
  168. ====
  169. `context`.`topRecords`::
  170. The list of top records.
  171. +
  172. .Properties of `context.topRecords`
  173. [%collapsible%open]
  174. ====
  175. `by_field_value`:::
  176. The value of the by field.
  177. `field_name`:::
  178. Certain functions require a field to operate on, for example, `sum()`. For those
  179. functions, this value is the name of the field to be analyzed.
  180. `function`:::
  181. The function in which the anomaly occurs, as specified in the detector
  182. configuration. For example, `max`.
  183. `over_field_name`:::
  184. The field used to split the data.
  185. `partition_field_value`:::
  186. The field used to segment the analysis.
  187. `score`:::
  188. A normalized score between 0-100, which is based on the probability of the
  189. anomalousness of this record.
  190. ====
  191. [[anomaly-jobs-health-action-variables]]
  192. === {anomaly-jobs-cap} health action variables
  193. Every health check has two main variables: `context.message` and
  194. `context.results`. The properties of `context.results` may vary based on the
  195. type of check. You can find the possible properties for all the checks below.
  196. ==== _Datafeed is not started_
  197. `context.message`::
  198. A preconstructed message for the alert.
  199. `context.results`::
  200. Contains the following properties:
  201. +
  202. .Properties of `context.results`
  203. [%collapsible%open]
  204. ====
  205. `datafeed_id`:::
  206. The {dfeed} identifier.
  207. `datafeed_state`:::
  208. The state of the {dfeed}. It can be `starting`, `started`,
  209. `stopping`, `stopped`.
  210. `job_id`:::
  211. The job identifier.
  212. `job_state`:::
  213. The state of the job. It can be `opening`, `opened`, `closing`,
  214. `closed`, or `failed`.
  215. ====
  216. ==== _Model memory limit reached_
  217. `context.message`::
  218. A preconstructed message for the rule.
  219. `context.results`::
  220. Contains the following properties:
  221. +
  222. .Properties of `context.results`
  223. [%collapsible%open]
  224. ====
  225. `job_id`:::
  226. The job identifier.
  227. `memory_status`:::
  228. The status of the mathematical model. It can have one of the following values:
  229. * `soft_limit`: The model used more than 60% of the configured memory limit and
  230. older unused models will be pruned to free up space. In categorization jobs no
  231. further category examples will be stored.
  232. * `hard_limit`: The model used more space than the configured memory limit. As a
  233. result, not all incoming data was processed.
  234. `model_bytes`:::
  235. The number of bytes of memory used by the models.
  236. `model_bytes_exceeded`:::
  237. The number of bytes over the high limit for memory usage at the last allocation
  238. failure.
  239. `model_bytes_memory_limit`:::
  240. The upper limit for model memory usage.
  241. `log_time`:::
  242. The timestamp of the model size statistics according to server time. Time
  243. formatting is based on the {kib} settings.
  244. `peak_model_bytes`:::
  245. The peak number of bytes of memory ever used by the model.
  246. ====
  247. ==== _Data delay has occured_
  248. `context.message`::
  249. A preconstructed message for the rule.
  250. `context.results`::
  251. Contains the following properties:
  252. +
  253. .Properties of `context.results`
  254. [%collapsible%open]
  255. ====
  256. `annotation`:::
  257. The annotation corresponding to the data delay in the job.
  258. `end_timestamp`:::
  259. Timestamp of the latest finalized buckets with missing documents. Time
  260. formatting is based on the {kib} settings.
  261. `job_id`:::
  262. The job identifier.
  263. `missed_docs_count`:::
  264. The number of missed documents.
  265. ====
  266. ==== _Error in job messages_
  267. `context.message`::
  268. A preconstructed message for the rule.
  269. `context.results`::
  270. Contains the following properties:
  271. +
  272. .Properties of `context.results`
  273. [%collapsible%open]
  274. ====
  275. `timestamp`:::
  276. Timestamp of the latest finalized buckets with missing documents.
  277. `job_id`:::
  278. The job identifier.
  279. `message`:::
  280. The error message.
  281. `node_name`:::
  282. The name of the node that runs the job.
  283. ====