123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239 |
- [role="xpack"]
- [testenv="platinum"]
- [[ml-configuring-categories]]
- = Detecting anomalous categories of data
- Categorization is a {ml} process that tokenizes a text field, clusters similar
- data together, and classifies it into categories. It works best on
- machine-written messages and application output that typically consist of
- repeated elements. For example, it works well on logs that contain a finite set
- of possible messages:
- //Obtained from it_ops_new_app_logs.json
- [source,js]
- ----------------------------------
- {"@timestamp":1549596476000,
- "message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]",
- "type":"logs"}
- ----------------------------------
- //NOTCONSOLE
- Categorization is tuned to work best on data like log messages by taking token
- order into account, including stop words, and not considering synonyms in its
- analysis. Complete sentences in human communication or literary text (for
- example email, wiki pages, prose, or other human-generated content) can be
- extremely diverse in structure. Since categorization is tuned for machine data,
- it gives poor results for human-generated data. It would create so many
- categories that they couldn't be handled effectively. Categorization is _not_
- natural language processing (NLP).
- When you create a categorization {anomaly-job}, the {ml} model learns what
- volume and pattern is normal for each category over time. You can then detect
- anomalies and surface rare events or unusual types of messages by using
- <<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
- In {kib}, there is a categorization wizard to help you create this type of
- {anomaly-job}. For example, the following job generates categories from the
- contents of the `message` field and uses the count function to determine when
- certain categories are occurring at anomalous rates:
- [role="screenshot"]
- image::images/ml-category-wizard.jpg["Creating a categorization job in Kibana"]
- [%collapsible]
- .API example
- ====
- [source,console]
- ----------------------------------
- PUT _ml/anomaly_detectors/it_ops_app_logs
- {
- "description" : "IT ops application logs",
- "analysis_config" : {
- "categorization_field_name": "message",<1>
- "bucket_span":"30m",
- "detectors" :[{
- "function":"count",
- "by_field_name": "mlcategory"<2>
- }]
- },
- "data_description" : {
- "time_field":"@timestamp"
- }
- }
- ----------------------------------
- // TEST[skip:needs-licence]
- <1> This field is used to derive categories.
- <2> The categories are used in a detector by setting `by_field_name`,
- `over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
- do not specify this keyword in one of those properties, the API request fails.
- ====
- You can use the **Anomaly Explorer** in {kib} to view the analysis results:
- [role="screenshot"]
- image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
- For this type of job, the results contain extra information for each anomaly:
- the name of the category (for example, `mlcategory 2`) and examples of the
- messages in that category. You can use these details to investigate occurrences
- of unusually high message counts.
- If you use the advanced {anomaly-job} wizard in {kib} or the
- {ref}/ml-put-job.html[create {anomaly-jobs} API], there are additional
- configuration options. For example, the optional `categorization_examples_limit`
- property specifies the maximum number of examples that are stored in memory and
- in the results data store for each category. The default value is `4`. Note that
- this setting does not affect the categorization; it just affects the list of
- visible examples. If you increase this value, more examples are available, but
- you must have more storage available. If you set this value to `0`, no examples
- are stored.
- Another advanced option is the `categorization_filters` property, which can
- contain an array of regular expressions. If a categorization field value matches
- the regular expression, the portion of the field that is matched is not taken
- into consideration when defining categories. The categorization filters are
- applied in the order they are listed in the job configuration, which enables you
- to disregard multiple sections of the categorization field value. In this
- example, you might create a filter like `[ "\\[statement:.*\\]"]` to remove the
- SQL statement from the categorization algorithm.
- [discrete]
- [[ml-configuring-analyzer]]
- == Customizing the categorization analyzer
- Categorization uses English dictionary words to identify log message categories.
- By default, it also uses English tokenization rules. For this reason, if you use
- the default categorization analyzer, only English language log messages are
- supported, as described in the <<ml-limitations>>.
- If you use the categorization wizard in {kib}, you can see which categorization
- analyzer it uses and highlighted examples of the tokens that it identifies. You
- can also change the tokenization rules by customizing the way the categorization
- field values are interpreted:
- [role="screenshot"]
- image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in Kibana"]
- The categorization analyzer can refer to a built-in {es} analyzer or a
- combination of zero or more character filters, a tokenizer, and zero or more
- token filters. In this example, adding a
- {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
- achieves exactly the same behavior as the `categorization_filters` job
- configuration option described earlier. For more details about these properties,
- see the
- {ref}/ml-put-job.html#ml-put-job-request-body[`categorization_analyzer` API object].
- If you use the default categorization analyzer in {kib} or omit the
- `categorization_analyzer` property from the API, the following default values
- are used:
- [source,console]
- --------------------------------------------------
- POST _ml/anomaly_detectors/_validate
- {
- "analysis_config" : {
- "categorization_analyzer" : {
- "tokenizer" : "ml_classic",
- "filter" : [
- { "type" : "stop", "stopwords": [
- "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
- "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
- "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
- "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
- "GMT", "UTC"
- ] }
- ]
- },
- "categorization_field_name": "message",
- "detectors" :[{
- "function":"count",
- "by_field_name": "mlcategory"
- }]
- },
- "data_description" : {
- }
- }
- --------------------------------------------------
- If you specify any part of the `categorization_analyzer`, however, any omitted
- sub-properties are _not_ set to default values.
- The `ml_classic` tokenizer and the day and month stopword filter are more or
- less equivalent to the following analyzer, which is defined using only built-in
- {es} {ref}/analysis-tokenizers.html[tokenizers] and
- {ref}/analysis-tokenfilters.html[token filters]:
- [source,console]
- ----------------------------------
- PUT _ml/anomaly_detectors/it_ops_new_logs3
- {
- "description" : "IT Ops Application Logs",
- "analysis_config" : {
- "categorization_field_name": "message",
- "bucket_span":"30m",
- "detectors" :[{
- "function":"count",
- "by_field_name": "mlcategory",
- "detector_description": "Unusual message counts"
- }],
- "categorization_analyzer":{
- "tokenizer": {
- "type" : "simple_pattern_split",
- "pattern" : "[^-0-9A-Za-z_.]+" <1>
- },
- "filter": [
- { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
- { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
- { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
- { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
- { "type" : "stop", "stopwords": [
- "",
- "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
- "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
- "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
- "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
- "GMT", "UTC"
- ] }
- ]
- }
- },
- "analysis_limits":{
- "categorization_examples_limit": 5
- },
- "data_description" : {
- "time_field":"time",
- "time_format": "epoch_ms"
- }
- }
- ----------------------------------
- // TEST[skip:needs-licence]
- <1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
- <2> By default, categorization ignores tokens that begin with a digit.
- <3> By default, categorization also ignores tokens that are hexadecimal numbers.
- <4> Underscores, hyphens, and dots are removed from the beginning of tokens.
- <5> Underscores, hyphens, and dots are also removed from the end of tokens.
- The key difference between the default `categorization_analyzer` and this
- example analyzer is that using the `ml_classic` tokenizer is several times
- faster. The difference in behavior is that this custom analyzer does not include
- accented letters in tokens whereas the `ml_classic` tokenizer does, although
- that could be fixed by using more complex regular expressions.
- If you are categorizing non-English messages in a language where words are
- separated by spaces, you might get better results if you change the day or month
- words in the stop token filter to the appropriate words in your language. If you
- are categorizing messages in a language where words are not separated by spaces,
- you must use a different tokenizer as well in order to get sensible
- categorization results.
- It is important to be aware that analyzing for categorization of machine
- generated log messages is a little different from tokenizing for search.
- Features that work well for search, such as stemming, synonym substitution, and
- lowercasing are likely to make the results of categorization worse. However, in
- order for drill down from {ml} results to work correctly, the tokens that the
- categorization analyzer produces must be similar to those produced by the search
- analyzer. If they are sufficiently similar, when you search for the tokens that
- the categorization analyzer produces then you find the original document that
- the categorization field value came from.
|