1
0

ml-configuring-categories.asciidoc 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262
  1. [role="xpack"]
  2. [testenv="platinum"]
  3. [[ml-configuring-categories]]
  4. = Detecting anomalous categories of data
  5. Categorization is a {ml} process that tokenizes a text field, clusters similar
  6. data together, and classifies it into categories. It works best on
  7. machine-written messages and application output that typically consist of
  8. repeated elements. For example, it works well on logs that contain a finite set
  9. of possible messages:
  10. //Obtained from it_ops_new_app_logs.json
  11. [source,js]
  12. ----------------------------------
  13. {"@timestamp":1549596476000,
  14. "message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]",
  15. "type":"logs"}
  16. ----------------------------------
  17. //NOTCONSOLE
  18. Categorization is tuned to work best on data like log messages by taking token
  19. order into account, including stop words, and not considering synonyms in its
  20. analysis. Complete sentences in human communication or literary text (for
  21. example email, wiki pages, prose, or other human-generated content) can be
  22. extremely diverse in structure. Since categorization is tuned for machine data,
  23. it gives poor results for human-generated data. It would create so many
  24. categories that they couldn't be handled effectively. Categorization is _not_
  25. natural language processing (NLP).
  26. When you create a categorization {anomaly-job}, the {ml} model learns what
  27. volume and pattern is normal for each category over time. You can then detect
  28. anomalies and surface rare events or unusual types of messages by using
  29. <<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
  30. In {kib}, there is a categorization wizard to help you create this type of
  31. {anomaly-job}. For example, the following job generates categories from the
  32. contents of the `message` field and uses the count function to determine when
  33. certain categories are occurring at anomalous rates:
  34. [role="screenshot"]
  35. image::images/ml-category-wizard.jpg["Creating a categorization job in Kibana"]
  36. [%collapsible]
  37. .API example
  38. ====
  39. [source,console]
  40. ----------------------------------
  41. PUT _ml/anomaly_detectors/it_ops_app_logs
  42. {
  43. "description" : "IT ops application logs",
  44. "analysis_config" : {
  45. "categorization_field_name": "message",<1>
  46. "bucket_span":"30m",
  47. "detectors" :[{
  48. "function":"count",
  49. "by_field_name": "mlcategory"<2>
  50. }]
  51. },
  52. "data_description" : {
  53. "time_field":"@timestamp"
  54. }
  55. }
  56. ----------------------------------
  57. // TEST[skip:needs-licence]
  58. <1> This field is used to derive categories.
  59. <2> The categories are used in a detector by setting `by_field_name`,
  60. `over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
  61. do not specify this keyword in one of those properties, the API request fails.
  62. ====
  63. You can use the **Anomaly Explorer** in {kib} to view the analysis results:
  64. [role="screenshot"]
  65. image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
  66. For this type of job, the results contain extra information for each anomaly:
  67. the name of the category (for example, `mlcategory 2`) and examples of the
  68. messages in that category. You can use these details to investigate occurrences
  69. of unusually high message counts.
  70. If you use the advanced {anomaly-job} wizard in {kib} or the
  71. {ref}/ml-put-job.html[create {anomaly-jobs} API], there are additional
  72. configuration options. For example, the optional `categorization_examples_limit`
  73. property specifies the maximum number of examples that are stored in memory and
  74. in the results data store for each category. The default value is `4`. Note that
  75. this setting does not affect the categorization; it just affects the list of
  76. visible examples. If you increase this value, more examples are available, but
  77. you must have more storage available. If you set this value to `0`, no examples
  78. are stored.
  79. Another advanced option is the `categorization_filters` property, which can
  80. contain an array of regular expressions. If a categorization field value matches
  81. the regular expression, the portion of the field that is matched is not taken
  82. into consideration when defining categories. The categorization filters are
  83. applied in the order they are listed in the job configuration, which enables you
  84. to disregard multiple sections of the categorization field value. In this
  85. example, you might create a filter like `[ "\\[statement:.*\\]"]` to remove the
  86. SQL statement from the categorization algorithm.
  87. [discrete]
  88. [[ml-per-partition-categorization]]
  89. == Per-partition categorization
  90. If you enable per-partition categorization, categories are determined
  91. independently for each partition. For example, if your data includes messages
  92. from multiple types of logs from different applications, you can use a field
  93. like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
  94. `partition_field_name` and categorize the messages for each type of log
  95. separately.
  96. If your job has multiple detectors, every detector that uses the `mlcategory`
  97. keyword must also define a `partition_field_name`. You must use the same
  98. `partition_field_name` value in all of these detectors. Otherwise, when you
  99. create or update a job and enable per-partition categorization, it fails.
  100. When per-partition categorization is enabled, you can also take advantage of a
  101. `stop_on_warn` configuration option. If the categorization status for a
  102. partition changes to `warn`, it doesn't categorize well and can cause a lot of
  103. unnecessary resource usage. When you set `stop_on_warn` to `true`, the job stops
  104. analyzing these problematic partitions. You can thus avoid an ongoing
  105. performance cost for partitions that are unsuitable for categorization.
  106. [discrete]
  107. [[ml-configuring-analyzer]]
  108. == Customizing the categorization analyzer
  109. Categorization uses English dictionary words to identify log message categories.
  110. By default, it also uses English tokenization rules. For this reason, if you use
  111. the default categorization analyzer, only English language log messages are
  112. supported, as described in the <<ml-limitations>>.
  113. If you use the categorization wizard in {kib}, you can see which categorization
  114. analyzer it uses and highlighted examples of the tokens that it identifies. You
  115. can also change the tokenization rules by customizing the way the categorization
  116. field values are interpreted:
  117. [role="screenshot"]
  118. image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in Kibana"]
  119. The categorization analyzer can refer to a built-in {es} analyzer or a
  120. combination of zero or more character filters, a tokenizer, and zero or more
  121. token filters. In this example, adding a
  122. {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
  123. achieves exactly the same behavior as the `categorization_filters` job
  124. configuration option described earlier. For more details about these properties,
  125. see the
  126. {ref}/ml-put-job.html#ml-put-job-request-body[`categorization_analyzer` API object].
  127. If you use the default categorization analyzer in {kib} or omit the
  128. `categorization_analyzer` property from the API, the following default values
  129. are used:
  130. [source,console]
  131. --------------------------------------------------
  132. POST _ml/anomaly_detectors/_validate
  133. {
  134. "analysis_config" : {
  135. "categorization_analyzer" : {
  136. "tokenizer" : "ml_classic",
  137. "filter" : [
  138. { "type" : "stop", "stopwords": [
  139. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  140. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  141. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  142. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  143. "GMT", "UTC"
  144. ] }
  145. ]
  146. },
  147. "categorization_field_name": "message",
  148. "detectors" :[{
  149. "function":"count",
  150. "by_field_name": "mlcategory"
  151. }]
  152. },
  153. "data_description" : {
  154. }
  155. }
  156. --------------------------------------------------
  157. If you specify any part of the `categorization_analyzer`, however, any omitted
  158. sub-properties are _not_ set to default values.
  159. The `ml_classic` tokenizer and the day and month stopword filter are more or
  160. less equivalent to the following analyzer, which is defined using only built-in
  161. {es} {ref}/analysis-tokenizers.html[tokenizers] and
  162. {ref}/analysis-tokenfilters.html[token filters]:
  163. [source,console]
  164. ----------------------------------
  165. PUT _ml/anomaly_detectors/it_ops_new_logs3
  166. {
  167. "description" : "IT Ops Application Logs",
  168. "analysis_config" : {
  169. "categorization_field_name": "message",
  170. "bucket_span":"30m",
  171. "detectors" :[{
  172. "function":"count",
  173. "by_field_name": "mlcategory",
  174. "detector_description": "Unusual message counts"
  175. }],
  176. "categorization_analyzer":{
  177. "tokenizer": {
  178. "type" : "simple_pattern_split",
  179. "pattern" : "[^-0-9A-Za-z_.]+" <1>
  180. },
  181. "filter": [
  182. { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
  183. { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
  184. { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
  185. { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
  186. { "type" : "stop", "stopwords": [
  187. "",
  188. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  189. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  190. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  191. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  192. "GMT", "UTC"
  193. ] }
  194. ]
  195. }
  196. },
  197. "analysis_limits":{
  198. "categorization_examples_limit": 5
  199. },
  200. "data_description" : {
  201. "time_field":"time",
  202. "time_format": "epoch_ms"
  203. }
  204. }
  205. ----------------------------------
  206. // TEST[skip:needs-licence]
  207. <1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
  208. <2> By default, categorization ignores tokens that begin with a digit.
  209. <3> By default, categorization also ignores tokens that are hexadecimal numbers.
  210. <4> Underscores, hyphens, and dots are removed from the beginning of tokens.
  211. <5> Underscores, hyphens, and dots are also removed from the end of tokens.
  212. The key difference between the default `categorization_analyzer` and this
  213. example analyzer is that using the `ml_classic` tokenizer is several times
  214. faster. The difference in behavior is that this custom analyzer does not include
  215. accented letters in tokens whereas the `ml_classic` tokenizer does, although
  216. that could be fixed by using more complex regular expressions.
  217. If you are categorizing non-English messages in a language where words are
  218. separated by spaces, you might get better results if you change the day or month
  219. words in the stop token filter to the appropriate words in your language. If you
  220. are categorizing messages in a language where words are not separated by spaces,
  221. you must use a different tokenizer as well in order to get sensible
  222. categorization results.
  223. It is important to be aware that analyzing for categorization of machine
  224. generated log messages is a little different from tokenizing for search.
  225. Features that work well for search, such as stemming, synonym substitution, and
  226. lowercasing are likely to make the results of categorization worse. However, in
  227. order for drill down from {ml} results to work correctly, the tokens that the
  228. categorization analyzer produces must be similar to those produced by the search
  229. analyzer. If they are sufficiently similar, when you search for the tokens that
  230. the categorization analyzer produces then you find the original document that
  231. the categorization field value came from.