|
@@ -21,8 +21,8 @@ of possible messages:
|
|
|
Categorization is tuned to work best on data like log messages by taking token
|
|
|
order into account, including stop words, and not considering synonyms in its
|
|
|
analysis. Complete sentences in human communication or literary text (for
|
|
|
-example email, wiki pages, prose, or other human-generated content) can be
|
|
|
-extremely diverse in structure. Since categorization is tuned for machine data,
|
|
|
+example email, wiki pages, prose, or other human-generated content) can be
|
|
|
+extremely diverse in structure. Since categorization is tuned for machine data,
|
|
|
it gives poor results for human-generated data. It would create so many
|
|
|
categories that they couldn't be handled effectively. Categorization is _not_
|
|
|
natural language processing (NLP).
|
|
@@ -32,7 +32,7 @@ volume and pattern is normal for each category over time. You can then detect
|
|
|
anomalies and surface rare events or unusual types of messages by using
|
|
|
<<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
|
|
|
|
|
|
-In {kib}, there is a categorization wizard to help you create this type of
|
|
|
+In {kib}, there is a categorization wizard to help you create this type of
|
|
|
{anomaly-job}. For example, the following job generates categories from the
|
|
|
contents of the `message` field and uses the count function to determine when
|
|
|
certain categories are occurring at anomalous rates:
|
|
@@ -69,7 +69,7 @@ do not specify this keyword in one of those properties, the API request fails.
|
|
|
====
|
|
|
|
|
|
|
|
|
-You can use the **Anomaly Explorer** in {kib} to view the analysis results:
|
|
|
+You can use the **Anomaly Explorer** in {kib} to view the analysis results:
|
|
|
|
|
|
[role="screenshot"]
|
|
|
image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
|
|
@@ -105,7 +105,7 @@ SQL statement from the categorization algorithm.
|
|
|
If you enable per-partition categorization, categories are determined
|
|
|
independently for each partition. For example, if your data includes messages
|
|
|
from multiple types of logs from different applications, you can use a field
|
|
|
-like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
|
|
|
+like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
|
|
|
`partition_field_name` and categorize the messages for each type of log
|
|
|
separately.
|
|
|
|
|
@@ -116,7 +116,7 @@ create or update a job and enable per-partition categorization, it fails.
|
|
|
|
|
|
When per-partition categorization is enabled, you can also take advantage of a
|
|
|
`stop_on_warn` configuration option. If the categorization status for a
|
|
|
-partition changes to `warn`, it doesn't categorize well and can cause a lot of
|
|
|
+partition changes to `warn`, it doesn't categorize well and can cause a lot of
|
|
|
unnecessary resource usage. When you set `stop_on_warn` to `true`, the job stops
|
|
|
analyzing these problematic partitions. You can thus avoid an ongoing
|
|
|
performance cost for partitions that are unsuitable for categorization.
|
|
@@ -128,7 +128,7 @@ performance cost for partitions that are unsuitable for categorization.
|
|
|
Categorization uses English dictionary words to identify log message categories.
|
|
|
By default, it also uses English tokenization rules. For this reason, if you use
|
|
|
the default categorization analyzer, only English language log messages are
|
|
|
-supported, as described in the <<ml-limitations>>.
|
|
|
+supported, as described in the <<ml-limitations>>.
|
|
|
|
|
|
If you use the categorization wizard in {kib}, you can see which categorization
|
|
|
analyzer it uses and highlighted examples of the tokens that it identifies. You
|
|
@@ -140,7 +140,7 @@ image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in K
|
|
|
|
|
|
The categorization analyzer can refer to a built-in {es} analyzer or a
|
|
|
combination of zero or more character filters, a tokenizer, and zero or more
|
|
|
-token filters. In this example, adding a
|
|
|
+token filters. In this example, adding a
|
|
|
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
|
|
|
achieves exactly the same behavior as the `categorization_filters` job
|
|
|
configuration option described earlier. For more details about these properties,
|
|
@@ -157,7 +157,10 @@ POST _ml/anomaly_detectors/_validate
|
|
|
{
|
|
|
"analysis_config" : {
|
|
|
"categorization_analyzer" : {
|
|
|
- "tokenizer" : "ml_classic",
|
|
|
+ "char_filter" : [
|
|
|
+ "first_non_blank_line"
|
|
|
+ ],
|
|
|
+ "tokenizer" : "ml_standard",
|
|
|
"filter" : [
|
|
|
{ "type" : "stop", "stopwords": [
|
|
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
|
@@ -182,8 +185,8 @@ POST _ml/anomaly_detectors/_validate
|
|
|
If you specify any part of the `categorization_analyzer`, however, any omitted
|
|
|
sub-properties are _not_ set to default values.
|
|
|
|
|
|
-The `ml_classic` tokenizer and the day and month stopword filter are more or
|
|
|
-less equivalent to the following analyzer, which is defined using only built-in
|
|
|
+The `ml_standard` tokenizer and the day and month stopword filter are more or
|
|
|
+less equivalent to the following analyzer, which is defined using only built-in
|
|
|
{es} {ref}/analysis-tokenizers.html[tokenizers] and
|
|
|
{ref}/analysis-tokenfilters.html[token filters]:
|
|
|
|
|
@@ -201,15 +204,18 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
|
|
|
"detector_description": "Unusual message counts"
|
|
|
}],
|
|
|
"categorization_analyzer":{
|
|
|
+ "char_filter" : [
|
|
|
+ "first_non_blank_line" <1>
|
|
|
+ ],
|
|
|
"tokenizer": {
|
|
|
"type" : "simple_pattern_split",
|
|
|
- "pattern" : "[^-0-9A-Za-z_.]+" <1>
|
|
|
+ "pattern" : "[^-0-9A-Za-z_./]+" <2>
|
|
|
},
|
|
|
"filter": [
|
|
|
- { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
|
|
|
- { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
|
|
|
- { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
|
|
|
- { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
|
|
|
+ { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <3>
|
|
|
+ { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <4>
|
|
|
+ { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <5>
|
|
|
+ { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <6>
|
|
|
{ "type" : "stop", "stopwords": [
|
|
|
"",
|
|
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
|
@@ -232,17 +238,20 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
|
|
|
----------------------------------
|
|
|
// TEST[skip:needs-licence]
|
|
|
|
|
|
-<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
|
|
|
-<2> By default, categorization ignores tokens that begin with a digit.
|
|
|
-<3> By default, categorization also ignores tokens that are hexadecimal numbers.
|
|
|
-<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
|
|
|
-<5> Underscores, hyphens, and dots are also removed from the end of tokens.
|
|
|
+<1> Only consider the first non-blank line of the message for categorization purposes.
|
|
|
+<2> Tokens basically consist of hyphens, digits, letters, underscores, dots and slashes.
|
|
|
+<3> By default, categorization ignores tokens that begin with a digit.
|
|
|
+<4> By default, categorization also ignores tokens that are hexadecimal numbers.
|
|
|
+<5> Underscores, hyphens, and dots are removed from the beginning of tokens.
|
|
|
+<6> Underscores, hyphens, and dots are also removed from the end of tokens.
|
|
|
|
|
|
-The key difference between the default `categorization_analyzer` and this
|
|
|
-example analyzer is that using the `ml_classic` tokenizer is several times
|
|
|
-faster. The difference in behavior is that this custom analyzer does not include
|
|
|
-accented letters in tokens whereas the `ml_classic` tokenizer does, although
|
|
|
-that could be fixed by using more complex regular expressions.
|
|
|
+The key difference between the default `categorization_analyzer` and this
|
|
|
+example analyzer is that using the `ml_standard` tokenizer is several times
|
|
|
+faster. The `ml_standard` tokenizer also tries to preserve URLs, Windows paths
|
|
|
+and email addresses as single tokens. Another difference in behavior is that
|
|
|
+this custom analyzer does not include accented letters in tokens whereas the
|
|
|
+`ml_standard` tokenizer does, although that could be fixed by using more complex
|
|
|
+regular expressions.
|
|
|
|
|
|
If you are categorizing non-English messages in a language where words are
|
|
|
separated by spaces, you might get better results if you change the day or month
|