ml-configuring-categories.asciidoc 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270
  1. [role="xpack"]
  2. [[ml-configuring-categories]]
  3. = Detecting anomalous categories of data
  4. Categorization is a {ml} process that tokenizes a text field, clusters similar
  5. data together, and classifies it into categories. It works best on
  6. machine-written messages and application output that typically consist of
  7. repeated elements. For example, it works well on logs that contain a finite set
  8. of possible messages:
  9. //Obtained from it_ops_new_app_logs.json
  10. [source,js]
  11. ----------------------------------
  12. {"@timestamp":1549596476000,
  13. "message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]",
  14. "type":"logs"}
  15. ----------------------------------
  16. //NOTCONSOLE
  17. Categorization is tuned to work best on data like log messages by taking token
  18. order into account, including stop words, and not considering synonyms in its
  19. analysis. Complete sentences in human communication or literary text (for
  20. example email, wiki pages, prose, or other human-generated content) can be
  21. extremely diverse in structure. Since categorization is tuned for machine data,
  22. it gives poor results for human-generated data. It would create so many
  23. categories that they couldn't be handled effectively. Categorization is _not_
  24. natural language processing (NLP).
  25. When you create a categorization {anomaly-job}, the {ml} model learns what
  26. volume and pattern is normal for each category over time. You can then detect
  27. anomalies and surface rare events or unusual types of messages by using
  28. <<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
  29. In {kib}, there is a categorization wizard to help you create this type of
  30. {anomaly-job}. For example, the following job generates categories from the
  31. contents of the `message` field and uses the count function to determine when
  32. certain categories are occurring at anomalous rates:
  33. [role="screenshot"]
  34. image::images/ml-category-wizard.jpg["Creating a categorization job in Kibana"]
  35. [%collapsible]
  36. .API example
  37. ====
  38. [source,console]
  39. ----------------------------------
  40. PUT _ml/anomaly_detectors/it_ops_app_logs
  41. {
  42. "description" : "IT ops application logs",
  43. "analysis_config" : {
  44. "categorization_field_name": "message",<1>
  45. "bucket_span":"30m",
  46. "detectors" :[{
  47. "function":"count",
  48. "by_field_name": "mlcategory"<2>
  49. }]
  50. },
  51. "data_description" : {
  52. "time_field":"@timestamp"
  53. }
  54. }
  55. ----------------------------------
  56. // TEST[skip:needs-licence]
  57. <1> This field is used to derive categories.
  58. <2> The categories are used in a detector by setting `by_field_name`,
  59. `over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
  60. do not specify this keyword in one of those properties, the API request fails.
  61. ====
  62. You can use the **Anomaly Explorer** in {kib} to view the analysis results:
  63. [role="screenshot"]
  64. image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
  65. For this type of job, the results contain extra information for each anomaly:
  66. the name of the category (for example, `mlcategory 2`) and examples of the
  67. messages in that category. You can use these details to investigate occurrences
  68. of unusually high message counts.
  69. If you use the advanced {anomaly-job} wizard in {kib} or the
  70. {ref}/ml-put-job.html[create {anomaly-jobs} API], there are additional
  71. configuration options. For example, the optional `categorization_examples_limit`
  72. property specifies the maximum number of examples that are stored in memory and
  73. in the results data store for each category. The default value is `4`. Note that
  74. this setting does not affect the categorization; it just affects the list of
  75. visible examples. If you increase this value, more examples are available, but
  76. you must have more storage available. If you set this value to `0`, no examples
  77. are stored.
  78. Another advanced option is the `categorization_filters` property, which can
  79. contain an array of regular expressions. If a categorization field value matches
  80. the regular expression, the portion of the field that is matched is not taken
  81. into consideration when defining categories. The categorization filters are
  82. applied in the order they are listed in the job configuration, which enables you
  83. to disregard multiple sections of the categorization field value. In this
  84. example, you might create a filter like `[ "\\[statement:.*\\]"]` to remove the
  85. SQL statement from the categorization algorithm.
  86. [discrete]
  87. [[ml-per-partition-categorization]]
  88. == Per-partition categorization
  89. If you enable per-partition categorization, categories are determined
  90. independently for each partition. For example, if your data includes messages
  91. from multiple types of logs from different applications, you can use a field
  92. like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
  93. `partition_field_name` and categorize the messages for each type of log
  94. separately.
  95. If your job has multiple detectors, every detector that uses the `mlcategory`
  96. keyword must also define a `partition_field_name`. You must use the same
  97. `partition_field_name` value in all of these detectors. Otherwise, when you
  98. create or update a job and enable per-partition categorization, it fails.
  99. When per-partition categorization is enabled, you can also take advantage of a
  100. `stop_on_warn` configuration option. If the categorization status for a
  101. partition changes to `warn`, it doesn't categorize well and can cause a lot of
  102. unnecessary resource usage. When you set `stop_on_warn` to `true`, the job stops
  103. analyzing these problematic partitions. You can thus avoid an ongoing
  104. performance cost for partitions that are unsuitable for categorization.
  105. [discrete]
  106. [[ml-configuring-analyzer]]
  107. == Customizing the categorization analyzer
  108. Categorization uses English dictionary words to identify log message categories.
  109. By default, it also uses English tokenization rules. For this reason, if you use
  110. the default categorization analyzer, only English language log messages are
  111. supported, as described in the <<ml-limitations>>.
  112. If you use the categorization wizard in {kib}, you can see which categorization
  113. analyzer it uses and highlighted examples of the tokens that it identifies. You
  114. can also change the tokenization rules by customizing the way the categorization
  115. field values are interpreted:
  116. [role="screenshot"]
  117. image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in Kibana"]
  118. The categorization analyzer can refer to a built-in {es} analyzer or a
  119. combination of zero or more character filters, a tokenizer, and zero or more
  120. token filters. In this example, adding a
  121. {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
  122. achieves exactly the same behavior as the `categorization_filters` job
  123. configuration option described earlier. For more details about these properties,
  124. see the
  125. {ref}/ml-put-job.html#ml-put-job-request-body[`categorization_analyzer` API object].
  126. If you use the default categorization analyzer in {kib} or omit the
  127. `categorization_analyzer` property from the API, the following default values
  128. are used:
  129. [source,console]
  130. --------------------------------------------------
  131. POST _ml/anomaly_detectors/_validate
  132. {
  133. "analysis_config" : {
  134. "categorization_analyzer" : {
  135. "char_filter" : [
  136. "first_line_with_letters"
  137. ],
  138. "tokenizer" : "ml_standard",
  139. "filter" : [
  140. { "type" : "stop", "stopwords": [
  141. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  142. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  143. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  144. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  145. "GMT", "UTC"
  146. ] }
  147. ]
  148. },
  149. "categorization_field_name": "message",
  150. "detectors" :[{
  151. "function":"count",
  152. "by_field_name": "mlcategory"
  153. }]
  154. },
  155. "data_description" : {
  156. }
  157. }
  158. --------------------------------------------------
  159. If you specify any part of the `categorization_analyzer`, however, any omitted
  160. sub-properties are _not_ set to default values.
  161. The `ml_standard` tokenizer and the day and month stopword filter are more or
  162. less equivalent to the following analyzer, which is defined using only built-in
  163. {es} {ref}/analysis-tokenizers.html[tokenizers] and
  164. {ref}/analysis-tokenfilters.html[token filters]:
  165. [source,console]
  166. ----------------------------------
  167. PUT _ml/anomaly_detectors/it_ops_new_logs3
  168. {
  169. "description" : "IT Ops Application Logs",
  170. "analysis_config" : {
  171. "categorization_field_name": "message",
  172. "bucket_span":"30m",
  173. "detectors" :[{
  174. "function":"count",
  175. "by_field_name": "mlcategory",
  176. "detector_description": "Unusual message counts"
  177. }],
  178. "categorization_analyzer":{
  179. "char_filter" : [
  180. "first_line_with_letters" <1>
  181. ],
  182. "tokenizer": {
  183. "type" : "simple_pattern_split",
  184. "pattern" : "[^-0-9A-Za-z_./]+" <2>
  185. },
  186. "filter": [
  187. { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <3>
  188. { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <4>
  189. { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <5>
  190. { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <6>
  191. { "type" : "stop", "stopwords": [
  192. "",
  193. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  194. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  195. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  196. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  197. "GMT", "UTC"
  198. ] }
  199. ]
  200. }
  201. },
  202. "analysis_limits":{
  203. "categorization_examples_limit": 5
  204. },
  205. "data_description" : {
  206. "time_field":"time",
  207. "time_format": "epoch_ms"
  208. }
  209. }
  210. ----------------------------------
  211. // TEST[skip:needs-licence]
  212. <1> Only consider the first line of the message with letters for categorization purposes.
  213. <2> Tokens basically consist of hyphens, digits, letters, underscores, dots and slashes.
  214. <3> By default, categorization ignores tokens that begin with a digit.
  215. <4> By default, categorization also ignores tokens that are hexadecimal numbers.
  216. <5> Underscores, hyphens, and dots are removed from the beginning of tokens.
  217. <6> Underscores, hyphens, and dots are also removed from the end of tokens.
  218. The key difference between the default `categorization_analyzer` and this
  219. example analyzer is that using the `ml_standard` tokenizer is several times
  220. faster. The `ml_standard` tokenizer also tries to preserve URLs, Windows paths
  221. and email addresses as single tokens. Another difference in behavior is that
  222. this custom analyzer does not include accented letters in tokens whereas the
  223. `ml_standard` tokenizer does, although that could be fixed by using more complex
  224. regular expressions.
  225. If you are categorizing non-English messages in a language where words are
  226. separated by spaces, you might get better results if you change the day or month
  227. words in the stop token filter to the appropriate words in your language. If you
  228. are categorizing messages in a language where words are not separated by spaces,
  229. you must use a different tokenizer as well in order to get sensible
  230. categorization results.
  231. It is important to be aware that analyzing for categorization of machine
  232. generated log messages is a little different from tokenizing for search.
  233. Features that work well for search, such as stemming, synonym substitution, and
  234. lowercasing are likely to make the results of categorization worse. However, in
  235. order for drill down from {ml} results to work correctly, the tokens that the
  236. categorization analyzer produces must be similar to those produced by the search
  237. analyzer. If they are sufficiently similar, when you search for the tokens that
  238. the categorization analyzer produces then you find the original document that
  239. the categorization field value came from.