| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232 | [role="xpack"][[ml-configuring-categories]]=== Categorizing log messagesApplication log events are often unstructured and contain variable data. Forexample://Obtained from it_ops_new_app_logs.json[source,js]----------------------------------{"time":1454516381000,"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]","type":"logs"}----------------------------------//NOTCONSOLEYou can use {ml} to observe the static parts of the message, cluster similarmessages together, and classify them into message categories.The {ml} model learns what volume and pattern is normal for each category overtime. You can then detect anomalies and surface rare events or unusual types ofmessages by using count or rare functions. For example://Obtained from it_ops_new_app_logs.sh[source,js]----------------------------------PUT _xpack/ml/anomaly_detectors/it_ops_new_logs{  "description" : "IT Ops Application Logs",  "analysis_config" : {    "categorization_field_name": "message", <1>    "bucket_span":"30m",    "detectors" :[{      "function":"count",      "by_field_name": "mlcategory", <2>      "detector_description": "Unusual message counts"    }],    "categorization_filters":[ "\\[statement:.*\\]"]  },  "analysis_limits":{    "categorization_examples_limit": 5  },  "data_description" : {    "time_field":"time",    "time_format": "epoch_ms"  }}----------------------------------//CONSOLE// TEST[skip:needs-licence]<1> The `categorization_field_name` property indicates which field will becategorized.<2> The resulting categories are used in a detector by setting `by_field_name`,`over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If youdo not specify this keyword in one of those properties, the API request fails.The optional `categorization_examples_limit` property specifies themaximum number of examples that are stored in memory and in the results datastore for each category. The default value is `4`. Note that this setting doesnot affect the categorization; it just affects the list of visible examples. Ifyou increase this value, more examples are available, but you must have morestorage available. If you set this value to `0`, no examples are stored.The optional `categorization_filters` property can contain an array of regularexpressions. If a categorization field value matches the regular expression, theportion of the field that is matched is not taken into consideration whendefining categories. The categorization filters are applied in the order theyare listed in the job configuration, which allows you to disregard multiplesections of the categorization field value. In this example, we have decided thatwe do not want the detailed SQL to be considered in the message categorization.This particular categorization filter removes the SQL statement from the categorizationalgorithm.If your data is stored in {es}, you can create an advanced job with these sameproperties:[role="screenshot"]image::images/ml-category-advanced.jpg["Advanced job configuration options related to categorization"]NOTE: To add the `categorization_examples_limit` property, you must use the**Edit JSON** tab and copy the `analysis_limits` object from the API example.[float][[ml-configuring-analyzer]]==== Customizing the categorization analyzerCategorization uses English dictionary words to identify log message categories.By default, it also uses English tokenization rules. For this reason, if you usethe default categorization analyzer, only English language log messages aresupported, as described in the <<ml-limitations>>.You can, however, change the tokenization rules by customizing the way thecategorization field values are interpreted. For example:[source,js]----------------------------------PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2{  "description" : "IT Ops Application Logs",  "analysis_config" : {    "categorization_field_name": "message",    "bucket_span":"30m",    "detectors" :[{      "function":"count",      "by_field_name": "mlcategory",      "detector_description": "Unusual message counts"    }],    "categorization_analyzer":{      "char_filter": [        { "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>      ],      "tokenizer": "ml_classic", <2>      "filter": [        { "type" : "stop", "stopwords": [          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",          "GMT", "UTC"        ] } <3>      ]    }  },  "analysis_limits":{    "categorization_examples_limit": 5  },  "data_description" : {    "time_field":"time",    "time_format": "epoch_ms"  }}----------------------------------//CONSOLE// TEST[skip:needs-licence]<1> The{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]here achieves exactly the same as the `categorization_filters` in the firstexample.<2> The `ml_classic` tokenizer works like the non-customizable tokenizationthat was used for categorization in older versions of machine learning. If youwant the same categorization behavior as older versions, use this property value.<3> By default, English day or month words are filtered from log messages beforecategorization. If your logs are in a different language and containdates, you might get better results by filtering the day or month words in yourlanguage.The optional `categorization_analyzer` property allows even greater customizationof how categorization interprets the categorization field value. It can refer toa built-in {es} analyzer or a combination of zero or more character filters,a tokenizer, and zero or more token filters.The `ml_classic` tokenizer and the day and month stopword filter are more or lessequivalent to the following analyzer, which is defined using only built-in {es}{ref}/analysis-tokenizers.html[tokenizers] and{ref}/analysis-tokenfilters.html[token filters]:[source,js]----------------------------------PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3{  "description" : "IT Ops Application Logs",  "analysis_config" : {    "categorization_field_name": "message",    "bucket_span":"30m",    "detectors" :[{      "function":"count",      "by_field_name": "mlcategory",      "detector_description": "Unusual message counts"    }],    "categorization_analyzer":{      "tokenizer": {        "type" : "simple_pattern_split",        "pattern" : "[^-0-9A-Za-z_.]+" <1>      },      "filter": [        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>        { "type" : "stop", "stopwords": [          "",          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",          "GMT", "UTC"        ] }      ]    }  },  "analysis_limits":{    "categorization_examples_limit": 5  },  "data_description" : {    "time_field":"time",    "time_format": "epoch_ms"  }}----------------------------------//CONSOLE// TEST[skip:needs-licence]<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.<2> By default, categorization ignores tokens that begin with a digit.<3> By default, categorization also ignores tokens that are hexadecimal numbers.<4> Underscores, hyphens, and dots are removed from the beginning of tokens.<5> Underscores, hyphens, and dots are also removed from the end of tokens.The key difference between the default `categorization_analyzer` and this exampleanalyzer is that using the `ml_classic` tokenizer is several times faster. Thedifference in behavior is that this custom analyzer does not include accentedletters in tokens whereas the `ml_classic` tokenizer does, although that couldbe fixed by using more complex regular expressions.For more information about the `categorization_analyzer` property, see{ref}/ml-job-resource.html#ml-categorizationanalyzer[Categorization Analyzer].NOTE: To add the `categorization_analyzer` property in {kib}, you must use the**Edit JSON** tab and copy the `categorization_analyzer` object from one of theAPI examples above.[float][[ml-viewing-categories]]==== Viewing categorization resultsAfter you open the job and start the {dfeed} or supply data to the job, you canview the categorization results in {kib}. For example:[role="screenshot"]image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]For this type of job, the **Anomaly Explorer** contains extra information foreach anomaly: the name of the category (for example, `mlcategory 11`) andexamples of the messages in that category. In this case, you can use thesedetails to investigate occurrences of unusually high message counts for specificmessage categories.
 |