| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239 | [role="xpack"][testenv="platinum"][[ml-configuring-categories]]=== Detecting anomalous categories of dataCategorization is a {ml} process that tokenizes a text field, clusters similardata together, and classifies it into categories. It works best onmachine-written messages and application output that typically consist ofrepeated elements. For example, it works well on logs that contain a finite setof possible messages://Obtained from it_ops_new_app_logs.json[source,js]----------------------------------{"@timestamp":1549596476000,"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]","type":"logs"}----------------------------------//NOTCONSOLECategorization is tuned to work best on data like log messages by taking tokenorder into account, including stop words, and not considering synonyms in itsanalysis. Complete sentences in human communication or literary text (forexample email, wiki pages, prose, or other human-generated content) can be extremely diverse in structure. Since categorization is tuned for machine data, it gives poor results for human-generated data. It would create so manycategories that they couldn't be handled effectively. Categorization is _not_natural language processing (NLP).When you create a categorization {anomaly-job}, the {ml} model learns whatvolume and pattern is normal for each category over time. You can then detectanomalies and surface rare events or unusual types of messages by using<<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.In {kib}, there is a categorization wizard to help you create this type of {anomaly-job}. For example, the following job generates categories from thecontents of the `message` field and uses the count function to determine whencertain categories are occurring at anomalous rates:[role="screenshot"]image::images/ml-category-wizard.jpg["Creating a categorization job in Kibana"][%collapsible].API example====[source,console]----------------------------------PUT _ml/anomaly_detectors/it_ops_app_logs{  "description" : "IT ops application logs",  "analysis_config" : {    "categorization_field_name": "message",<1>    "bucket_span":"30m",    "detectors" :[{      "function":"count",      "by_field_name": "mlcategory"<2>    }]  },  "data_description" : {    "time_field":"@timestamp"  }}----------------------------------// TEST[skip:needs-licence]<1> This field is used to derive categories.<2> The categories are used in a detector by setting `by_field_name`,`over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If youdo not specify this keyword in one of those properties, the API request fails.====You can use the **Anomaly Explorer** in {kib} to view the analysis results: [role="screenshot"]image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]For this type of job, the results contain extra information for each anomaly:the name of the category (for example, `mlcategory 2`) and examples of themessages in that category. You can use these details to investigate occurrencesof unusually high message counts.If you use the advanced {anomaly-job} wizard in {kib} or the{ref}/ml-put-job.html[create {anomaly-jobs} API], there are additionalconfiguration options. For example, the optional `categorization_examples_limit`property specifies the maximum number of examples that are stored in memory andin the results data store for each category. The default value is `4`. Note thatthis setting does not affect the categorization; it just affects the list ofvisible examples. If you increase this value, more examples are available, butyou must have more storage available. If you set this value to `0`, no examplesare stored.Another advanced option is the `categorization_filters` property, which cancontain an array of regular expressions. If a categorization field value matchesthe regular expression, the portion of the field that is matched is not takeninto consideration when defining categories. The categorization filters areapplied in the order they are listed in the job configuration, which enables youto disregard multiple sections of the categorization field value. In thisexample, you might create a filter like `[ "\\[statement:.*\\]"]` to remove theSQL statement from the categorization algorithm.[discrete][[ml-configuring-analyzer]]==== Customizing the categorization analyzerCategorization uses English dictionary words to identify log message categories.By default, it also uses English tokenization rules. For this reason, if you usethe default categorization analyzer, only English language log messages aresupported, as described in the <<ml-limitations>>. If you use the categorization wizard in {kib}, you can see which categorizationanalyzer it uses and highlighted examples of the tokens that it identifies. Youcan also change the tokenization rules by customizing the way the categorizationfield values are interpreted:[role="screenshot"]image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in Kibana"]The categorization analyzer can refer to a built-in {es} analyzer or acombination of zero or more character filters, a tokenizer, and zero or moretoken filters. In this example, adding a {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]achieves exactly the same behavior as the `categorization_filters` jobconfiguration option described earlier. For more details about these properties,see the{ref}/ml-put-job.html#ml-put-job-request-body[`categorization_analyzer` API object].If you use the default categorization analyzer in {kib} or omit the`categorization_analyzer` property from the API, the following default valuesare used:[source,console]--------------------------------------------------POST _ml/anomaly_detectors/_validate{  "analysis_config" : {    "categorization_analyzer" : {      "tokenizer" : "ml_classic",      "filter" : [        { "type" : "stop", "stopwords": [          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",          "GMT", "UTC"        ] }      ]    },    "categorization_field_name": "message",    "detectors" :[{      "function":"count",      "by_field_name": "mlcategory"    }]  },  "data_description" : {  }}--------------------------------------------------If you specify any part of the `categorization_analyzer`, however, any omittedsub-properties are _not_ set to default values.The `ml_classic` tokenizer and the day and month stopword filter are more or less equivalent to the following analyzer, which is defined using only built-in {es} {ref}/analysis-tokenizers.html[tokenizers] and{ref}/analysis-tokenfilters.html[token filters]:[source,console]----------------------------------PUT _ml/anomaly_detectors/it_ops_new_logs3{  "description" : "IT Ops Application Logs",  "analysis_config" : {    "categorization_field_name": "message",    "bucket_span":"30m",    "detectors" :[{      "function":"count",      "by_field_name": "mlcategory",      "detector_description": "Unusual message counts"    }],    "categorization_analyzer":{      "tokenizer": {        "type" : "simple_pattern_split",        "pattern" : "[^-0-9A-Za-z_.]+" <1>      },      "filter": [        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>        { "type" : "stop", "stopwords": [          "",          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",          "GMT", "UTC"        ] }      ]    }  },  "analysis_limits":{    "categorization_examples_limit": 5  },  "data_description" : {    "time_field":"time",    "time_format": "epoch_ms"  }}----------------------------------// TEST[skip:needs-licence]<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.<2> By default, categorization ignores tokens that begin with a digit.<3> By default, categorization also ignores tokens that are hexadecimal numbers.<4> Underscores, hyphens, and dots are removed from the beginning of tokens.<5> Underscores, hyphens, and dots are also removed from the end of tokens.The key difference between the default `categorization_analyzer` and this example analyzer is that using the `ml_classic` tokenizer is several times faster. The difference in behavior is that this custom analyzer does not include accented letters in tokens whereas the `ml_classic` tokenizer does, although that could be fixed by using more complex regular expressions.If you are categorizing non-English messages in a language where words areseparated by spaces, you might get better results if you change the day or monthwords in the stop token filter to the appropriate words in your language. If youare categorizing messages in a language where words are not separated by spaces,you must use a different tokenizer as well in order to get sensiblecategorization results.It is important to be aware that analyzing for categorization of machinegenerated log messages is a little different from tokenizing for search.Features that work well for search, such as stemming, synonym substitution, andlowercasing are likely to make the results of categorization worse. However, inorder for drill down from {ml} results to work correctly, the tokens that thecategorization analyzer produces must be similar to those produced by the searchanalyzer. If they are sufficiently similar, when you search for the tokens thatthe categorization analyzer produces then you find the original document thatthe categorization field value came from.
 |