categories.asciidoc 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238
  1. [role="xpack"]
  2. [[ml-configuring-categories]]
  3. === Detecting anomalous categories of data
  4. Categorization is a {ml} process that tokenizes a text field, clusters similar
  5. data together, and classifies it into categories. It works best on
  6. machine-written messages and application output that typically consist of
  7. repeated elements. For example, it works well on logs that contain a finite set
  8. of possible messages:
  9. //Obtained from it_ops_new_app_logs.json
  10. [source,js]
  11. ----------------------------------
  12. {"@timestamp":1549596476000,
  13. "message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]",
  14. "type":"logs"}
  15. ----------------------------------
  16. //NOTCONSOLE
  17. Categorization is tuned to work best on data like log messages by taking token
  18. order into account, including stop words, and not considering synonyms in its
  19. analysis. Complete sentences in human communication or literary text (for
  20. example email, wiki pages, prose, or other human-generated content) can be
  21. extremely diverse in structure. Since categorization is tuned for machine data,
  22. it gives poor results for human-generated data. It would create so many
  23. categories that they couldn't be handled effectively. Categorization is _not_
  24. natural language processing (NLP).
  25. When you create a categorization {anomaly-job}, the {ml} model learns what
  26. volume and pattern is normal for each category over time. You can then detect
  27. anomalies and surface rare events or unusual types of messages by using
  28. <<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
  29. In {kib}, there is a categorization wizard to help you create this type of
  30. {anomaly-job}. For example, the following job generates categories from the
  31. contents of the `message` field and uses the count function to determine when
  32. certain categories are occurring at anomalous rates:
  33. [role="screenshot"]
  34. image::images/ml-category-wizard.jpg["Creating a categorization job in Kibana"]
  35. [%collapsible]
  36. .API example
  37. ====
  38. [source,console]
  39. ----------------------------------
  40. PUT _ml/anomaly_detectors/it_ops_app_logs
  41. {
  42. "description" : "IT ops application logs",
  43. "analysis_config" : {
  44. "categorization_field_name": "message",<1>
  45. "bucket_span":"30m",
  46. "detectors" :[{
  47. "function":"count",
  48. "by_field_name": "mlcategory"<2>
  49. }]
  50. },
  51. "data_description" : {
  52. "time_field":"@timestamp"
  53. }
  54. }
  55. ----------------------------------
  56. // TEST[skip:needs-licence]
  57. <1> This field is used to derive categories.
  58. <2> The categories are used in a detector by setting `by_field_name`,
  59. `over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
  60. do not specify this keyword in one of those properties, the API request fails.
  61. ====
  62. You can use the **Anomaly Explorer** in {kib} to view the analysis results:
  63. [role="screenshot"]
  64. image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
  65. For this type of job, the results contain extra information for each anomaly:
  66. the name of the category (for example, `mlcategory 2`) and examples of the
  67. messages in that category. You can use these details to investigate occurrences
  68. of unusually high message counts.
  69. If you use the advanced {anomaly-job} wizard in {kib} or the
  70. {ref}/ml-put-job.html[create {anomaly-jobs} API], there are additional
  71. configuration options. For example, the optional `categorization_examples_limit`
  72. property specifies the maximum number of examples that are stored in memory and
  73. in the results data store for each category. The default value is `4`. Note that
  74. this setting does not affect the categorization; it just affects the list of
  75. visible examples. If you increase this value, more examples are available, but
  76. you must have more storage available. If you set this value to `0`, no examples
  77. are stored.
  78. Another advanced option is the `categorization_filters` property, which can
  79. contain an array of regular expressions. If a categorization field value matches
  80. the regular expression, the portion of the field that is matched is not taken
  81. into consideration when defining categories. The categorization filters are
  82. applied in the order they are listed in the job configuration, which enables you
  83. to disregard multiple sections of the categorization field value. In this
  84. example, you might create a filter like `[ "\\[statement:.*\\]"]` to remove the
  85. SQL statement from the categorization algorithm.
  86. [discrete]
  87. [[ml-configuring-analyzer]]
  88. ==== Customizing the categorization analyzer
  89. Categorization uses English dictionary words to identify log message categories.
  90. By default, it also uses English tokenization rules. For this reason, if you use
  91. the default categorization analyzer, only English language log messages are
  92. supported, as described in the <<ml-limitations>>.
  93. If you use the categorization wizard in {kib}, you can see which categorization
  94. analyzer it uses and highlighted examples of the tokens that it identifies. You
  95. can also change the tokenization rules by customizing the way the categorization
  96. field values are interpreted:
  97. [role="screenshot"]
  98. image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in Kibana"]
  99. The categorization analyzer can refer to a built-in {es} analyzer or a
  100. combination of zero or more character filters, a tokenizer, and zero or more
  101. token filters. In this example, adding a
  102. {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
  103. achieves exactly the same behavior as the `categorization_filters` job
  104. configuration option described earlier. For more details about these properties,
  105. see the
  106. {ref}/ml-put-job.html#ml-put-job-request-body[`categorization_analyzer` API object].
  107. If you use the default categorization analyzer in {kib} or omit the
  108. `categorization_analyzer` property from the API, the following default values
  109. are used:
  110. [source,console]
  111. --------------------------------------------------
  112. POST _ml/anomaly_detectors/_validate
  113. {
  114. "analysis_config" : {
  115. "categorization_analyzer" : {
  116. "tokenizer" : "ml_classic",
  117. "filter" : [
  118. { "type" : "stop", "stopwords": [
  119. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  120. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  121. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  122. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  123. "GMT", "UTC"
  124. ] }
  125. ]
  126. },
  127. "categorization_field_name": "message",
  128. "detectors" :[{
  129. "function":"count",
  130. "by_field_name": "mlcategory"
  131. }]
  132. },
  133. "data_description" : {
  134. }
  135. }
  136. --------------------------------------------------
  137. If you specify any part of the `categorization_analyzer`, however, any omitted
  138. sub-properties are _not_ set to default values.
  139. The `ml_classic` tokenizer and the day and month stopword filter are more or
  140. less equivalent to the following analyzer, which is defined using only built-in
  141. {es} {ref}/analysis-tokenizers.html[tokenizers] and
  142. {ref}/analysis-tokenfilters.html[token filters]:
  143. [source,console]
  144. ----------------------------------
  145. PUT _ml/anomaly_detectors/it_ops_new_logs3
  146. {
  147. "description" : "IT Ops Application Logs",
  148. "analysis_config" : {
  149. "categorization_field_name": "message",
  150. "bucket_span":"30m",
  151. "detectors" :[{
  152. "function":"count",
  153. "by_field_name": "mlcategory",
  154. "detector_description": "Unusual message counts"
  155. }],
  156. "categorization_analyzer":{
  157. "tokenizer": {
  158. "type" : "simple_pattern_split",
  159. "pattern" : "[^-0-9A-Za-z_.]+" <1>
  160. },
  161. "filter": [
  162. { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
  163. { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
  164. { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
  165. { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
  166. { "type" : "stop", "stopwords": [
  167. "",
  168. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  169. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  170. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  171. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  172. "GMT", "UTC"
  173. ] }
  174. ]
  175. }
  176. },
  177. "analysis_limits":{
  178. "categorization_examples_limit": 5
  179. },
  180. "data_description" : {
  181. "time_field":"time",
  182. "time_format": "epoch_ms"
  183. }
  184. }
  185. ----------------------------------
  186. // TEST[skip:needs-licence]
  187. <1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
  188. <2> By default, categorization ignores tokens that begin with a digit.
  189. <3> By default, categorization also ignores tokens that are hexadecimal numbers.
  190. <4> Underscores, hyphens, and dots are removed from the beginning of tokens.
  191. <5> Underscores, hyphens, and dots are also removed from the end of tokens.
  192. The key difference between the default `categorization_analyzer` and this
  193. example analyzer is that using the `ml_classic` tokenizer is several times
  194. faster. The difference in behavior is that this custom analyzer does not include
  195. accented letters in tokens whereas the `ml_classic` tokenizer does, although
  196. that could be fixed by using more complex regular expressions.
  197. If you are categorizing non-English messages in a language where words are
  198. separated by spaces, you might get better results if you change the day or month
  199. words in the stop token filter to the appropriate words in your language. If you
  200. are categorizing messages in a language where words are not separated by spaces,
  201. you must use a different tokenizer as well in order to get sensible
  202. categorization results.
  203. It is important to be aware that analyzing for categorization of machine
  204. generated log messages is a little different from tokenizing for search.
  205. Features that work well for search, such as stemming, synonym substitution, and
  206. lowercasing are likely to make the results of categorization worse. However, in
  207. order for drill down from {ml} results to work correctly, the tokens that the
  208. categorization analyzer produces must be similar to those produced by the search
  209. analyzer. If they are sufficiently similar, when you search for the tokens that
  210. the categorization analyzer produces then you find the original document that
  211. the categorization field value came from.