categories.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301
  1. [role="xpack"]
  2. [[ml-configuring-categories]]
  3. === Categorizing data
  4. Categorization is a {ml} process that considers a tokenization of a field,
  5. clusters similar data together, and classifies them into categories. However,
  6. categorization doesn't work equally well on different data types. It works
  7. best on machine-written messages and application outputs, typically on data that
  8. consists of repeated elements, for example log messages for the purpose of
  9. system troubleshooting. Log categorization groups unstructured log messages into
  10. categories, then you can use {anomaly-detect} to model and identify rare or
  11. unusual counts of log message categories.
  12. Categorization is tuned to work best on data like log messages by taking token
  13. order into account, not considering synonyms, and including stop words in its
  14. analysis. Complete sentences in human communication or literary text (for
  15. example emails, wiki pages, prose, or other human generated content) can be
  16. extremely diverse in structure. Since categorization is tuned for machine data
  17. it will give poor results on such human generated data. For example, the
  18. categorization job would create so many categories that couldn't be handled
  19. effectively. Categorization is _not_ natural language processing (NLP).
  20. [float]
  21. [[ml-categorization-log-messages]]
  22. ==== Categorizing log messages
  23. Application log events are often unstructured and contain variable data. For
  24. example:
  25. //Obtained from it_ops_new_app_logs.json
  26. [source,js]
  27. ----------------------------------
  28. {"time":1454516381000,"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]","type":"logs"}
  29. ----------------------------------
  30. //NOTCONSOLE
  31. You can use {ml} to observe the static parts of the message, cluster similar
  32. messages together, and classify them into message categories.
  33. The {ml} model learns what volume and pattern is normal for each category over
  34. time. You can then detect anomalies and surface rare events or unusual types of
  35. messages by using count or rare functions. For example:
  36. //Obtained from it_ops_new_app_logs.sh
  37. [source,console]
  38. ----------------------------------
  39. PUT _ml/anomaly_detectors/it_ops_new_logs
  40. {
  41. "description" : "IT Ops Application Logs",
  42. "analysis_config" : {
  43. "categorization_field_name": "message", <1>
  44. "bucket_span":"30m",
  45. "detectors" :[{
  46. "function":"count",
  47. "by_field_name": "mlcategory", <2>
  48. "detector_description": "Unusual message counts"
  49. }],
  50. "categorization_filters":[ "\\[statement:.*\\]"]
  51. },
  52. "analysis_limits":{
  53. "categorization_examples_limit": 5
  54. },
  55. "data_description" : {
  56. "time_field":"time",
  57. "time_format": "epoch_ms"
  58. }
  59. }
  60. ----------------------------------
  61. // TEST[skip:needs-licence]
  62. <1> The `categorization_field_name` property indicates which field will be
  63. categorized.
  64. <2> The resulting categories are used in a detector by setting `by_field_name`,
  65. `over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
  66. do not specify this keyword in one of those properties, the API request fails.
  67. The optional `categorization_examples_limit` property specifies the
  68. maximum number of examples that are stored in memory and in the results data
  69. store for each category. The default value is `4`. Note that this setting does
  70. not affect the categorization; it just affects the list of visible examples. If
  71. you increase this value, more examples are available, but you must have more
  72. storage available. If you set this value to `0`, no examples are stored.
  73. The optional `categorization_filters` property can contain an array of regular
  74. expressions. If a categorization field value matches the regular expression, the
  75. portion of the field that is matched is not taken into consideration when
  76. defining categories. The categorization filters are applied in the order they
  77. are listed in the job configuration, which allows you to disregard multiple
  78. sections of the categorization field value. In this example, we have decided that
  79. we do not want the detailed SQL to be considered in the message categorization.
  80. This particular categorization filter removes the SQL statement from the
  81. categorization algorithm.
  82. If your data is stored in {es}, you can create an advanced {anomaly-job} with
  83. these same properties:
  84. [role="screenshot"]
  85. image::images/ml-category-advanced.jpg["Advanced job configuration options related to categorization"]
  86. NOTE: To add the `categorization_examples_limit` property, you must use the
  87. **Edit JSON** tab and copy the `analysis_limits` object from the API example.
  88. [float]
  89. [[ml-configuring-analyzer]]
  90. ===== Customizing the categorization analyzer
  91. Categorization uses English dictionary words to identify log message categories.
  92. By default, it also uses English tokenization rules. For this reason, if you use
  93. the default categorization analyzer, only English language log messages are
  94. supported, as described in the <<ml-limitations>>.
  95. You can, however, change the tokenization rules by customizing the way the
  96. categorization field values are interpreted. For example:
  97. [source,console]
  98. ----------------------------------
  99. PUT _ml/anomaly_detectors/it_ops_new_logs2
  100. {
  101. "description" : "IT Ops Application Logs",
  102. "analysis_config" : {
  103. "categorization_field_name": "message",
  104. "bucket_span":"30m",
  105. "detectors" :[{
  106. "function":"count",
  107. "by_field_name": "mlcategory",
  108. "detector_description": "Unusual message counts"
  109. }],
  110. "categorization_analyzer":{
  111. "char_filter": [
  112. { "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
  113. ],
  114. "tokenizer": "ml_classic", <2>
  115. "filter": [
  116. { "type" : "stop", "stopwords": [
  117. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  118. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  119. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  120. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  121. "GMT", "UTC"
  122. ] } <3>
  123. ]
  124. }
  125. },
  126. "analysis_limits":{
  127. "categorization_examples_limit": 5
  128. },
  129. "data_description" : {
  130. "time_field":"time",
  131. "time_format": "epoch_ms"
  132. }
  133. }
  134. ----------------------------------
  135. // TEST[skip:needs-licence]
  136. <1> The
  137. {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
  138. here achieves exactly the same as the `categorization_filters` in the first
  139. example.
  140. <2> The `ml_classic` tokenizer works like the non-customizable tokenization
  141. that was used for categorization in older versions of machine learning. If you
  142. want the same categorization behavior as older versions, use this property
  143. value.
  144. <3> By default, English day or month words are filtered from log messages before
  145. categorization. If your logs are in a different language and contain
  146. dates, you might get better results by filtering the day or month words in your
  147. language.
  148. The optional `categorization_analyzer` property allows even greater customization
  149. of how categorization interprets the categorization field value. It can refer to
  150. a built-in {es} analyzer or a combination of zero or more character filters,
  151. a tokenizer, and zero or more token filters. If you omit the
  152. `categorization_analyzer`, the following default values are used:
  153. [source,console]
  154. --------------------------------------------------
  155. POST _ml/anomaly_detectors/_validate
  156. {
  157. "analysis_config" : {
  158. "categorization_analyzer" : {
  159. "tokenizer" : "ml_classic",
  160. "filter" : [
  161. { "type" : "stop", "stopwords": [
  162. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  163. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  164. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  165. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  166. "GMT", "UTC"
  167. ] }
  168. ]
  169. },
  170. "categorization_field_name": "message",
  171. "detectors" :[{
  172. "function":"count",
  173. "by_field_name": "mlcategory"
  174. }]
  175. },
  176. "data_description" : {
  177. }
  178. }
  179. --------------------------------------------------
  180. If you specify any part of the `categorization_analyzer`, however, any omitted
  181. sub-properties are _not_ set to default values.
  182. The `ml_classic` tokenizer and the day and month stopword filter are more or
  183. less equivalent to the following analyzer, which is defined using only built-in
  184. {es} {ref}/analysis-tokenizers.html[tokenizers] and
  185. {ref}/analysis-tokenfilters.html[token filters]:
  186. [source,console]
  187. ----------------------------------
  188. PUT _ml/anomaly_detectors/it_ops_new_logs3
  189. {
  190. "description" : "IT Ops Application Logs",
  191. "analysis_config" : {
  192. "categorization_field_name": "message",
  193. "bucket_span":"30m",
  194. "detectors" :[{
  195. "function":"count",
  196. "by_field_name": "mlcategory",
  197. "detector_description": "Unusual message counts"
  198. }],
  199. "categorization_analyzer":{
  200. "tokenizer": {
  201. "type" : "simple_pattern_split",
  202. "pattern" : "[^-0-9A-Za-z_.]+" <1>
  203. },
  204. "filter": [
  205. { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
  206. { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
  207. { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
  208. { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
  209. { "type" : "stop", "stopwords": [
  210. "",
  211. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  212. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  213. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  214. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  215. "GMT", "UTC"
  216. ] }
  217. ]
  218. }
  219. },
  220. "analysis_limits":{
  221. "categorization_examples_limit": 5
  222. },
  223. "data_description" : {
  224. "time_field":"time",
  225. "time_format": "epoch_ms"
  226. }
  227. }
  228. ----------------------------------
  229. // TEST[skip:needs-licence]
  230. <1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
  231. <2> By default, categorization ignores tokens that begin with a digit.
  232. <3> By default, categorization also ignores tokens that are hexadecimal numbers.
  233. <4> Underscores, hyphens, and dots are removed from the beginning of tokens.
  234. <5> Underscores, hyphens, and dots are also removed from the end of tokens.
  235. The key difference between the default `categorization_analyzer` and this
  236. example analyzer is that using the `ml_classic` tokenizer is several times
  237. faster. The difference in behavior is that this custom analyzer does not include
  238. accented letters in tokens whereas the `ml_classic` tokenizer does, although
  239. that could be fixed by using more complex regular expressions.
  240. If you are categorizing non-English messages in a language where words are
  241. separated by spaces, you might get better results if you change the day or month
  242. words in the stop token filter to the appropriate words in your language. If you
  243. are categorizing messages in a language where words are not separated by spaces,
  244. you must use a different tokenizer as well in order to get sensible
  245. categorization results.
  246. It is important to be aware that analyzing for categorization of machine
  247. generated log messages is a little different from tokenizing for search.
  248. Features that work well for search, such as stemming, synonym substitution, and
  249. lowercasing are likely to make the results of categorization worse. However, in
  250. order for drill down from {ml} results to work correctly, the tokens that the
  251. categorization analyzer produces must be similar to those produced by the search
  252. analyzer. If they are sufficiently similar, when you search for the tokens that
  253. the categorization analyzer produces then you find the original document that
  254. the categorization field value came from.
  255. NOTE: To add the `categorization_analyzer` property in {kib}, you must use the
  256. **Edit JSON** tab and copy the `categorization_analyzer` object from one of the
  257. API examples above.
  258. [float]
  259. [[ml-viewing-categories]]
  260. ===== Viewing categorization results
  261. After you open the job and start the {dfeed} or supply data to the job, you can
  262. view the categorization results in {kib}. For example:
  263. [role="screenshot"]
  264. image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]
  265. For this type of job, the **Anomaly Explorer** contains extra information for
  266. each anomaly: the name of the category (for example, `mlcategory 11`) and
  267. examples of the messages in that category. In this case, you can use these
  268. details to investigate occurrences of unusually high message counts for specific
  269. message categories.