ml-configuring-categories.asciidoc 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239
  1. [role="xpack"]
  2. [testenv="platinum"]
  3. [[ml-configuring-categories]]
  4. = Detecting anomalous categories of data
  5. Categorization is a {ml} process that tokenizes a text field, clusters similar
  6. data together, and classifies it into categories. It works best on
  7. machine-written messages and application output that typically consist of
  8. repeated elements. For example, it works well on logs that contain a finite set
  9. of possible messages:
  10. //Obtained from it_ops_new_app_logs.json
  11. [source,js]
  12. ----------------------------------
  13. {"@timestamp":1549596476000,
  14. "message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]",
  15. "type":"logs"}
  16. ----------------------------------
  17. //NOTCONSOLE
  18. Categorization is tuned to work best on data like log messages by taking token
  19. order into account, including stop words, and not considering synonyms in its
  20. analysis. Complete sentences in human communication or literary text (for
  21. example email, wiki pages, prose, or other human-generated content) can be
  22. extremely diverse in structure. Since categorization is tuned for machine data,
  23. it gives poor results for human-generated data. It would create so many
  24. categories that they couldn't be handled effectively. Categorization is _not_
  25. natural language processing (NLP).
  26. When you create a categorization {anomaly-job}, the {ml} model learns what
  27. volume and pattern is normal for each category over time. You can then detect
  28. anomalies and surface rare events or unusual types of messages by using
  29. <<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
  30. In {kib}, there is a categorization wizard to help you create this type of
  31. {anomaly-job}. For example, the following job generates categories from the
  32. contents of the `message` field and uses the count function to determine when
  33. certain categories are occurring at anomalous rates:
  34. [role="screenshot"]
  35. image::images/ml-category-wizard.jpg["Creating a categorization job in Kibana"]
  36. [%collapsible]
  37. .API example
  38. ====
  39. [source,console]
  40. ----------------------------------
  41. PUT _ml/anomaly_detectors/it_ops_app_logs
  42. {
  43. "description" : "IT ops application logs",
  44. "analysis_config" : {
  45. "categorization_field_name": "message",<1>
  46. "bucket_span":"30m",
  47. "detectors" :[{
  48. "function":"count",
  49. "by_field_name": "mlcategory"<2>
  50. }]
  51. },
  52. "data_description" : {
  53. "time_field":"@timestamp"
  54. }
  55. }
  56. ----------------------------------
  57. // TEST[skip:needs-licence]
  58. <1> This field is used to derive categories.
  59. <2> The categories are used in a detector by setting `by_field_name`,
  60. `over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
  61. do not specify this keyword in one of those properties, the API request fails.
  62. ====
  63. You can use the **Anomaly Explorer** in {kib} to view the analysis results:
  64. [role="screenshot"]
  65. image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
  66. For this type of job, the results contain extra information for each anomaly:
  67. the name of the category (for example, `mlcategory 2`) and examples of the
  68. messages in that category. You can use these details to investigate occurrences
  69. of unusually high message counts.
  70. If you use the advanced {anomaly-job} wizard in {kib} or the
  71. {ref}/ml-put-job.html[create {anomaly-jobs} API], there are additional
  72. configuration options. For example, the optional `categorization_examples_limit`
  73. property specifies the maximum number of examples that are stored in memory and
  74. in the results data store for each category. The default value is `4`. Note that
  75. this setting does not affect the categorization; it just affects the list of
  76. visible examples. If you increase this value, more examples are available, but
  77. you must have more storage available. If you set this value to `0`, no examples
  78. are stored.
  79. Another advanced option is the `categorization_filters` property, which can
  80. contain an array of regular expressions. If a categorization field value matches
  81. the regular expression, the portion of the field that is matched is not taken
  82. into consideration when defining categories. The categorization filters are
  83. applied in the order they are listed in the job configuration, which enables you
  84. to disregard multiple sections of the categorization field value. In this
  85. example, you might create a filter like `[ "\\[statement:.*\\]"]` to remove the
  86. SQL statement from the categorization algorithm.
  87. [discrete]
  88. [[ml-configuring-analyzer]]
  89. == Customizing the categorization analyzer
  90. Categorization uses English dictionary words to identify log message categories.
  91. By default, it also uses English tokenization rules. For this reason, if you use
  92. the default categorization analyzer, only English language log messages are
  93. supported, as described in the <<ml-limitations>>.
  94. If you use the categorization wizard in {kib}, you can see which categorization
  95. analyzer it uses and highlighted examples of the tokens that it identifies. You
  96. can also change the tokenization rules by customizing the way the categorization
  97. field values are interpreted:
  98. [role="screenshot"]
  99. image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in Kibana"]
  100. The categorization analyzer can refer to a built-in {es} analyzer or a
  101. combination of zero or more character filters, a tokenizer, and zero or more
  102. token filters. In this example, adding a
  103. {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
  104. achieves exactly the same behavior as the `categorization_filters` job
  105. configuration option described earlier. For more details about these properties,
  106. see the
  107. {ref}/ml-put-job.html#ml-put-job-request-body[`categorization_analyzer` API object].
  108. If you use the default categorization analyzer in {kib} or omit the
  109. `categorization_analyzer` property from the API, the following default values
  110. are used:
  111. [source,console]
  112. --------------------------------------------------
  113. POST _ml/anomaly_detectors/_validate
  114. {
  115. "analysis_config" : {
  116. "categorization_analyzer" : {
  117. "tokenizer" : "ml_classic",
  118. "filter" : [
  119. { "type" : "stop", "stopwords": [
  120. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  121. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  122. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  123. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  124. "GMT", "UTC"
  125. ] }
  126. ]
  127. },
  128. "categorization_field_name": "message",
  129. "detectors" :[{
  130. "function":"count",
  131. "by_field_name": "mlcategory"
  132. }]
  133. },
  134. "data_description" : {
  135. }
  136. }
  137. --------------------------------------------------
  138. If you specify any part of the `categorization_analyzer`, however, any omitted
  139. sub-properties are _not_ set to default values.
  140. The `ml_classic` tokenizer and the day and month stopword filter are more or
  141. less equivalent to the following analyzer, which is defined using only built-in
  142. {es} {ref}/analysis-tokenizers.html[tokenizers] and
  143. {ref}/analysis-tokenfilters.html[token filters]:
  144. [source,console]
  145. ----------------------------------
  146. PUT _ml/anomaly_detectors/it_ops_new_logs3
  147. {
  148. "description" : "IT Ops Application Logs",
  149. "analysis_config" : {
  150. "categorization_field_name": "message",
  151. "bucket_span":"30m",
  152. "detectors" :[{
  153. "function":"count",
  154. "by_field_name": "mlcategory",
  155. "detector_description": "Unusual message counts"
  156. }],
  157. "categorization_analyzer":{
  158. "tokenizer": {
  159. "type" : "simple_pattern_split",
  160. "pattern" : "[^-0-9A-Za-z_.]+" <1>
  161. },
  162. "filter": [
  163. { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
  164. { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
  165. { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
  166. { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
  167. { "type" : "stop", "stopwords": [
  168. "",
  169. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  170. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  171. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  172. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  173. "GMT", "UTC"
  174. ] }
  175. ]
  176. }
  177. },
  178. "analysis_limits":{
  179. "categorization_examples_limit": 5
  180. },
  181. "data_description" : {
  182. "time_field":"time",
  183. "time_format": "epoch_ms"
  184. }
  185. }
  186. ----------------------------------
  187. // TEST[skip:needs-licence]
  188. <1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
  189. <2> By default, categorization ignores tokens that begin with a digit.
  190. <3> By default, categorization also ignores tokens that are hexadecimal numbers.
  191. <4> Underscores, hyphens, and dots are removed from the beginning of tokens.
  192. <5> Underscores, hyphens, and dots are also removed from the end of tokens.
  193. The key difference between the default `categorization_analyzer` and this
  194. example analyzer is that using the `ml_classic` tokenizer is several times
  195. faster. The difference in behavior is that this custom analyzer does not include
  196. accented letters in tokens whereas the `ml_classic` tokenizer does, although
  197. that could be fixed by using more complex regular expressions.
  198. If you are categorizing non-English messages in a language where words are
  199. separated by spaces, you might get better results if you change the day or month
  200. words in the stop token filter to the appropriate words in your language. If you
  201. are categorizing messages in a language where words are not separated by spaces,
  202. you must use a different tokenizer as well in order to get sensible
  203. categorization results.
  204. It is important to be aware that analyzing for categorization of machine
  205. generated log messages is a little different from tokenizing for search.
  206. Features that work well for search, such as stemming, synonym substitution, and
  207. lowercasing are likely to make the results of categorization worse. However, in
  208. order for drill down from {ml} results to work correctly, the tokens that the
  209. categorization analyzer produces must be similar to those produced by the search
  210. analyzer. If they are sufficiently similar, when you search for the tokens that
  211. the categorization analyzer produces then you find the original document that
  212. the categorization field value came from.