categories.asciidoc 9.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232
  1. [role="xpack"]
  2. [[ml-configuring-categories]]
  3. === Categorizing log messages
  4. Application log events are often unstructured and contain variable data. For
  5. example:
  6. //Obtained from it_ops_new_app_logs.json
  7. [source,js]
  8. ----------------------------------
  9. {"time":1454516381000,"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]","type":"logs"}
  10. ----------------------------------
  11. //NOTCONSOLE
  12. You can use {ml} to observe the static parts of the message, cluster similar
  13. messages together, and classify them into message categories.
  14. The {ml} model learns what volume and pattern is normal for each category over
  15. time. You can then detect anomalies and surface rare events or unusual types of
  16. messages by using count or rare functions. For example:
  17. //Obtained from it_ops_new_app_logs.sh
  18. [source,js]
  19. ----------------------------------
  20. PUT _ml/anomaly_detectors/it_ops_new_logs
  21. {
  22. "description" : "IT Ops Application Logs",
  23. "analysis_config" : {
  24. "categorization_field_name": "message", <1>
  25. "bucket_span":"30m",
  26. "detectors" :[{
  27. "function":"count",
  28. "by_field_name": "mlcategory", <2>
  29. "detector_description": "Unusual message counts"
  30. }],
  31. "categorization_filters":[ "\\[statement:.*\\]"]
  32. },
  33. "analysis_limits":{
  34. "categorization_examples_limit": 5
  35. },
  36. "data_description" : {
  37. "time_field":"time",
  38. "time_format": "epoch_ms"
  39. }
  40. }
  41. ----------------------------------
  42. //CONSOLE
  43. // TEST[skip:needs-licence]
  44. <1> The `categorization_field_name` property indicates which field will be
  45. categorized.
  46. <2> The resulting categories are used in a detector by setting `by_field_name`,
  47. `over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
  48. do not specify this keyword in one of those properties, the API request fails.
  49. The optional `categorization_examples_limit` property specifies the
  50. maximum number of examples that are stored in memory and in the results data
  51. store for each category. The default value is `4`. Note that this setting does
  52. not affect the categorization; it just affects the list of visible examples. If
  53. you increase this value, more examples are available, but you must have more
  54. storage available. If you set this value to `0`, no examples are stored.
  55. The optional `categorization_filters` property can contain an array of regular
  56. expressions. If a categorization field value matches the regular expression, the
  57. portion of the field that is matched is not taken into consideration when
  58. defining categories. The categorization filters are applied in the order they
  59. are listed in the job configuration, which allows you to disregard multiple
  60. sections of the categorization field value. In this example, we have decided that
  61. we do not want the detailed SQL to be considered in the message categorization.
  62. This particular categorization filter removes the SQL statement from the categorization
  63. algorithm.
  64. If your data is stored in {es}, you can create an advanced {anomaly-job} with
  65. these same properties:
  66. [role="screenshot"]
  67. image::images/ml-category-advanced.jpg["Advanced job configuration options related to categorization"]
  68. NOTE: To add the `categorization_examples_limit` property, you must use the
  69. **Edit JSON** tab and copy the `analysis_limits` object from the API example.
  70. [float]
  71. [[ml-configuring-analyzer]]
  72. ==== Customizing the categorization analyzer
  73. Categorization uses English dictionary words to identify log message categories.
  74. By default, it also uses English tokenization rules. For this reason, if you use
  75. the default categorization analyzer, only English language log messages are
  76. supported, as described in the <<ml-limitations>>.
  77. You can, however, change the tokenization rules by customizing the way the
  78. categorization field values are interpreted. For example:
  79. [source,js]
  80. ----------------------------------
  81. PUT _ml/anomaly_detectors/it_ops_new_logs2
  82. {
  83. "description" : "IT Ops Application Logs",
  84. "analysis_config" : {
  85. "categorization_field_name": "message",
  86. "bucket_span":"30m",
  87. "detectors" :[{
  88. "function":"count",
  89. "by_field_name": "mlcategory",
  90. "detector_description": "Unusual message counts"
  91. }],
  92. "categorization_analyzer":{
  93. "char_filter": [
  94. { "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
  95. ],
  96. "tokenizer": "ml_classic", <2>
  97. "filter": [
  98. { "type" : "stop", "stopwords": [
  99. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  100. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  101. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  102. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  103. "GMT", "UTC"
  104. ] } <3>
  105. ]
  106. }
  107. },
  108. "analysis_limits":{
  109. "categorization_examples_limit": 5
  110. },
  111. "data_description" : {
  112. "time_field":"time",
  113. "time_format": "epoch_ms"
  114. }
  115. }
  116. ----------------------------------
  117. //CONSOLE
  118. // TEST[skip:needs-licence]
  119. <1> The
  120. {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
  121. here achieves exactly the same as the `categorization_filters` in the first
  122. example.
  123. <2> The `ml_classic` tokenizer works like the non-customizable tokenization
  124. that was used for categorization in older versions of machine learning. If you
  125. want the same categorization behavior as older versions, use this property value.
  126. <3> By default, English day or month words are filtered from log messages before
  127. categorization. If your logs are in a different language and contain
  128. dates, you might get better results by filtering the day or month words in your
  129. language.
  130. The optional `categorization_analyzer` property allows even greater customization
  131. of how categorization interprets the categorization field value. It can refer to
  132. a built-in {es} analyzer or a combination of zero or more character filters,
  133. a tokenizer, and zero or more token filters.
  134. The `ml_classic` tokenizer and the day and month stopword filter are more or less
  135. equivalent to the following analyzer, which is defined using only built-in {es}
  136. {ref}/analysis-tokenizers.html[tokenizers] and
  137. {ref}/analysis-tokenfilters.html[token filters]:
  138. [source,js]
  139. ----------------------------------
  140. PUT _ml/anomaly_detectors/it_ops_new_logs3
  141. {
  142. "description" : "IT Ops Application Logs",
  143. "analysis_config" : {
  144. "categorization_field_name": "message",
  145. "bucket_span":"30m",
  146. "detectors" :[{
  147. "function":"count",
  148. "by_field_name": "mlcategory",
  149. "detector_description": "Unusual message counts"
  150. }],
  151. "categorization_analyzer":{
  152. "tokenizer": {
  153. "type" : "simple_pattern_split",
  154. "pattern" : "[^-0-9A-Za-z_.]+" <1>
  155. },
  156. "filter": [
  157. { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
  158. { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
  159. { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
  160. { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
  161. { "type" : "stop", "stopwords": [
  162. "",
  163. "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
  164. "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
  165. "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
  166. "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
  167. "GMT", "UTC"
  168. ] }
  169. ]
  170. }
  171. },
  172. "analysis_limits":{
  173. "categorization_examples_limit": 5
  174. },
  175. "data_description" : {
  176. "time_field":"time",
  177. "time_format": "epoch_ms"
  178. }
  179. }
  180. ----------------------------------
  181. //CONSOLE
  182. // TEST[skip:needs-licence]
  183. <1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
  184. <2> By default, categorization ignores tokens that begin with a digit.
  185. <3> By default, categorization also ignores tokens that are hexadecimal numbers.
  186. <4> Underscores, hyphens, and dots are removed from the beginning of tokens.
  187. <5> Underscores, hyphens, and dots are also removed from the end of tokens.
  188. The key difference between the default `categorization_analyzer` and this example
  189. analyzer is that using the `ml_classic` tokenizer is several times faster. The
  190. difference in behavior is that this custom analyzer does not include accented
  191. letters in tokens whereas the `ml_classic` tokenizer does, although that could
  192. be fixed by using more complex regular expressions.
  193. For more information about the `categorization_analyzer` property, see
  194. {ref}/ml-job-resource.html#ml-categorizationanalyzer[Categorization analyzer].
  195. NOTE: To add the `categorization_analyzer` property in {kib}, you must use the
  196. **Edit JSON** tab and copy the `categorization_analyzer` object from one of the
  197. API examples above.
  198. [float]
  199. [[ml-viewing-categories]]
  200. ==== Viewing categorization results
  201. After you open the job and start the {dfeed} or supply data to the job, you can
  202. view the categorization results in {kib}. For example:
  203. [role="screenshot"]
  204. image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]
  205. For this type of job, the **Anomaly Explorer** contains extra information for
  206. each anomaly: the name of the category (for example, `mlcategory 11`) and
  207. examples of the messages in that category. In this case, you can use these
  208. details to investigate occurrences of unusually high message counts for specific
  209. message categories.