123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534 |
- [[search-aggregations-bucket-categorize-text-aggregation]]
- === Categorize text aggregation
- ++++
- <titleabbrev>Categorize text</titleabbrev>
- ++++
- A multi-bucket aggregation that groups semi-structured text into buckets. Each `text` field is re-analyzed
- using a custom analyzer. The resulting tokens are then categorized creating buckets of similarly formatted
- text values. This aggregation works best with machine generated text like system logs. Only the first 100 analyzed
- tokens are used to categorize the text.
- NOTE: If you have considerable memory allocated to your JVM but are receiving circuit breaker exceptions from this
- aggregation, you may be attempting to categorize text that is poorly formatted for categorization. Consider
- adding `categorization_filters` or running under <<search-aggregations-bucket-sampler-aggregation,sampler>>,
- <<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>>, or
- <<search-aggregations-random-sampler-aggregation,random sampler>> to explore the created categories.
- NOTE: The algorithm used for categorization was completely changed in version 8.3.0. As a result this aggregation
- will not work in a mixed version cluster where some nodes are on version 8.3.0 or higher and others are
- on a version older than 8.3.0. Upgrade all nodes in your cluster to the same version if you experience
- an error related to this change.
- [[bucket-categorize-text-agg-syntax]]
- ==== Parameters
- `categorization_analyzer`::
- (Optional, object or string)
- The categorization analyzer specifies how the text is analyzed and tokenized before
- being categorized. The syntax is very similar to that used to define the `analyzer` in the
- <<indices-analyze,Analyze endpoint>>. This
- property cannot be used at the same time as `categorization_filters`.
- +
- The `categorization_analyzer` field can be specified either as a string or as an
- object. If it is a string it must refer to a
- <<analysis-analyzers,built-in analyzer>> or one added by another plugin. If it
- is an object it has the following properties:
- +
- .Properties of `categorization_analyzer`
- [%collapsible%open]
- =====
- `char_filter`::::
- (array of strings or objects)
- include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=char-filter]
- `tokenizer`::::
- (string or object)
- include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
- `filter`::::
- (array of strings or objects)
- include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
- =====
- `categorization_filters`::
- (Optional, array of strings)
- This property expects an array of regular expressions. The expressions
- are used to filter out matching sequences from the categorization field values.
- You can use this functionality to fine tune the categorization by excluding
- sequences from consideration when categories are defined. For example, you can
- exclude SQL statements that appear in your log files. This
- property cannot be used at the same time as `categorization_analyzer`. If you
- only want to define simple regular expression filters that are applied prior to
- tokenization, setting this property is the easiest method. If you also want to
- customize the tokenizer or post-tokenization filtering, use the
- `categorization_analyzer` property instead and include the filters as
- `pattern_replace` character filters.
- `field`::
- (Required, string)
- The semi-structured text field to categorize.
- `max_matched_tokens`::
- (Optional, integer)
- This parameter does nothing now, but is permitted for compatibility with the original
- pre-8.3.0 implementation.
- `max_unique_tokens`::
- (Optional, integer)
- This parameter does nothing now, but is permitted for compatibility with the original
- pre-8.3.0 implementation.
- `min_doc_count`::
- (Optional, integer)
- The minimum number of documents for a bucket to be returned to the results.
- `shard_min_doc_count`::
- (Optional, integer)
- The minimum number of documents for a bucket to be returned from the shard before
- merging.
- `shard_size`::
- (Optional, integer)
- The number of categorization buckets to return from each shard before merging
- all the results.
- `similarity_threshold`::
- (Optional, integer, default: `70`)
- The minimum percentage of token weight that must match for text to be added to the
- category bucket.
- Must be between 1 and 100. The larger the value the narrower the categories.
- Larger values will increase memory usage and create narrower categories.
- `size`::
- (Optional, integer, default: `10`)
- The number of buckets to return.
- [[bucket-categorize-text-agg-response]]
- ==== Response body
- `key`::
- (string)
- Consists of the tokens (extracted by the `categorization_analyzer`)
- that are common to all values of the input field included in the category.
- `doc_count`::
- (integer)
- Number of documents matching the category.
- `max_matching_length`::
- (integer)
- Categories from short messages containing few tokens may also match
- categories containing many tokens derived from much longer messages.
- `max_matching_length` is an indication of the maximum length of messages
- that should be considered to belong to the category. When searching for
- messages that match the category, any messages longer than
- `max_matching_length` should be excluded. Use this field to prevent a
- search for members of a category of short messages from matching much longer
- ones.
- `regex`::
- (string)
- A regular expression that will match all values of the input field included
- in the category. It is possible that the `regex` does not incorporate every
- term in `key`, if ordering varies between the values included in the
- category. However, in simple cases the `regex` will be the ordered terms
- concatenated into a regular expression that allows for arbitrary sections
- in between them. It is not recommended to use the `regex` as the primary
- mechanism for searching for the original documents that were categorized.
- Search using a regular expression is very slow. Instead the terms in the
- `key` field should be used to search for matching documents, as a terms
- search can use the inverted index and hence be much faster. However, there
- may be situations where it is useful to use the `regex` field to test whether
- a small set of messages that have not been indexed match the category, or to
- confirm that the terms in the `key` occur in the correct order in all the
- matched documents.
- ==== Basic use
- WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
- used in conjunction with <<async-search, Async search>>. Additionally, you may consider
- using the aggregation as a child of either the <<search-aggregations-bucket-sampler-aggregation,sampler>> or
- <<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>> aggregation.
- This will typically improve speed and memory use.
- Example:
- [source,console]
- --------------------------------------------------
- POST log-messages/_search?filter_path=aggregations
- {
- "aggs": {
- "categories": {
- "categorize_text": {
- "field": "message"
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:categorize_text]
- Response:
- [source,console-result]
- --------------------------------------------------
- {
- "aggregations" : {
- "categories" : {
- "buckets" : [
- {
- "doc_count" : 3,
- "key" : "Node shutting down",
- "regex" : ".*?Node.+?shutting.+?down.*?",
- "max_matching_length" : 49
- },
- {
- "doc_count" : 1,
- "key" : "Node starting up",
- "regex" : ".*?Node.+?starting.+?up.*?",
- "max_matching_length" : 47
- },
- {
- "doc_count" : 1,
- "key" : "User foo_325 logging on",
- "regex" : ".*?User.+?foo_325.+?logging.+?on.*?",
- "max_matching_length" : 52
- },
- {
- "doc_count" : 1,
- "key" : "User foo_864 logged off",
- "regex" : ".*?User.+?foo_864.+?logged.+?off.*?",
- "max_matching_length" : 52
- }
- ]
- }
- }
- }
- --------------------------------------------------
- Here is an example using `categorization_filters`
- [source,console]
- --------------------------------------------------
- POST log-messages/_search?filter_path=aggregations
- {
- "aggs": {
- "categories": {
- "categorize_text": {
- "field": "message",
- "categorization_filters": ["\\w+\\_\\d{3}"] <1>
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:categorize_text]
- <1> The filters to apply to the analyzed tokens. It filters
- out tokens like `bar_123`.
- Note how the `foo_<number>` tokens are not part of the
- category results
- [source,console-result]
- --------------------------------------------------
- {
- "aggregations" : {
- "categories" : {
- "buckets" : [
- {
- "doc_count" : 3,
- "key" : "Node shutting down",
- "regex" : ".*?Node.+?shutting.+?down.*?",
- "max_matching_length" : 49
- },
- {
- "doc_count" : 1,
- "key" : "Node starting up",
- "regex" : ".*?Node.+?starting.+?up.*?",
- "max_matching_length" : 47
- },
- {
- "doc_count" : 1,
- "key" : "User logged off",
- "regex" : ".*?User.+?logged.+?off.*?",
- "max_matching_length" : 52
- },
- {
- "doc_count" : 1,
- "key" : "User logging on",
- "regex" : ".*?User.+?logging.+?on.*?",
- "max_matching_length" : 52
- }
- ]
- }
- }
- }
- --------------------------------------------------
- Here is an example using `categorization_filters`.
- The default analyzer uses the `ml_standard` tokenizer which is similar to a whitespace tokenizer
- but filters out tokens that could be interpreted as hexadecimal numbers. The default analyzer
- also uses the `first_line_with_letters` character filter, so that only the first meaningful line
- of multi-line messages is considered.
- But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
- custom `categorization_filters` to filter out those tokens for better categories. These filters may also reduce memory usage as fewer
- tokens are held in memory for the categories. (If there are sufficient examples of different usernames, emails, etc., then
- categories will form that naturally discard them as variables, but for small input data where only one example exists this won't
- happen.)
- [source,console]
- --------------------------------------------------
- POST log-messages/_search?filter_path=aggregations
- {
- "aggs": {
- "categories": {
- "categorize_text": {
- "field": "message",
- "categorization_filters": ["\\w+\\_\\d{3}"], <1>
- "similarity_threshold": 11 <2>
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:categorize_text]
- <1> The filters to apply to the analyzed tokens. It filters
- out tokens like `bar_123`.
- <2> Require 11% of token weight to match before adding a message to an
- existing category rather than creating a new one.
- The resulting categories are now very broad, merging the log groups.
- (A `similarity_threshold` of 11% is generally too low. Settings over
- 50% are usually better.)
- [source,console-result]
- --------------------------------------------------
- {
- "aggregations" : {
- "categories" : {
- "buckets" : [
- {
- "doc_count" : 4,
- "key" : "Node",
- "regex" : ".*?Node.*?",
- "max_matching_length" : 49
- },
- {
- "doc_count" : 2,
- "key" : "User",
- "regex" : ".*?User.*?",
- "max_matching_length" : 52
- }
- ]
- }
- }
- }
- --------------------------------------------------
- This aggregation can have both sub-aggregations and itself be a sub-aggregation. This allows gathering the top daily categories and the
- top sample doc as below.
- [source,console]
- --------------------------------------------------
- POST log-messages/_search?filter_path=aggregations
- {
- "aggs": {
- "daily": {
- "date_histogram": {
- "field": "time",
- "fixed_interval": "1d"
- },
- "aggs": {
- "categories": {
- "categorize_text": {
- "field": "message",
- "categorization_filters": ["\\w+\\_\\d{3}"]
- },
- "aggs": {
- "hit": {
- "top_hits": {
- "size": 1,
- "sort": ["time"],
- "_source": "message"
- }
- }
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:categorize_text]
- [source,console-result]
- --------------------------------------------------
- {
- "aggregations" : {
- "daily" : {
- "buckets" : [
- {
- "key_as_string" : "2016-02-07T00:00:00.000Z",
- "key" : 1454803200000,
- "doc_count" : 3,
- "categories" : {
- "buckets" : [
- {
- "doc_count" : 2,
- "key" : "Node shutting down",
- "regex" : ".*?Node.+?shutting.+?down.*?",
- "max_matching_length" : 49,
- "hit" : {
- "hits" : {
- "total" : {
- "value" : 2,
- "relation" : "eq"
- },
- "max_score" : null,
- "hits" : [
- {
- "_index" : "log-messages",
- "_id" : "1",
- "_score" : null,
- "_source" : {
- "message" : "2016-02-07T00:00:00+0000 Node 3 shutting down"
- },
- "sort" : [
- 1454803260000
- ]
- }
- ]
- }
- }
- },
- {
- "doc_count" : 1,
- "key" : "Node starting up",
- "regex" : ".*?Node.+?starting.+?up.*?",
- "max_matching_length" : 47,
- "hit" : {
- "hits" : {
- "total" : {
- "value" : 1,
- "relation" : "eq"
- },
- "max_score" : null,
- "hits" : [
- {
- "_index" : "log-messages",
- "_id" : "2",
- "_score" : null,
- "_source" : {
- "message" : "2016-02-07T00:00:00+0000 Node 5 starting up"
- },
- "sort" : [
- 1454803320000
- ]
- }
- ]
- }
- }
- }
- ]
- }
- },
- {
- "key_as_string" : "2016-02-08T00:00:00.000Z",
- "key" : 1454889600000,
- "doc_count" : 3,
- "categories" : {
- "buckets" : [
- {
- "doc_count" : 1,
- "key" : "Node shutting down",
- "regex" : ".*?Node.+?shutting.+?down.*?",
- "max_matching_length" : 49,
- "hit" : {
- "hits" : {
- "total" : {
- "value" : 1,
- "relation" : "eq"
- },
- "max_score" : null,
- "hits" : [
- {
- "_index" : "log-messages",
- "_id" : "4",
- "_score" : null,
- "_source" : {
- "message" : "2016-02-08T00:00:00+0000 Node 5 shutting down"
- },
- "sort" : [
- 1454889660000
- ]
- }
- ]
- }
- }
- },
- {
- "doc_count" : 1,
- "key" : "User logged off",
- "regex" : ".*?User.+?logged.+?off.*?",
- "max_matching_length" : 52,
- "hit" : {
- "hits" : {
- "total" : {
- "value" : 1,
- "relation" : "eq"
- },
- "max_score" : null,
- "hits" : [
- {
- "_index" : "log-messages",
- "_id" : "6",
- "_score" : null,
- "_source" : {
- "message" : "2016-02-08T00:00:00+0000 User foo_864 logged off"
- },
- "sort" : [
- 1454889840000
- ]
- }
- ]
- }
- }
- },
- {
- "doc_count" : 1,
- "key" : "User logging on",
- "regex" : ".*?User.+?logging.+?on.*?",
- "max_matching_length" : 52,
- "hit" : {
- "hits" : {
- "total" : {
- "value" : 1,
- "relation" : "eq"
- },
- "max_score" : null,
- "hits" : [
- {
- "_index" : "log-messages",
- "_id" : "5",
- "_score" : null,
- "_source" : {
- "message" : "2016-02-08T00:00:00+0000 User foo_325 logging on"
- },
- "sort" : [
- 1454889720000
- ]
- }
- ]
- }
- }
- }
- ]
- }
- }
- ]
- }
- }
- }
- --------------------------------------------------
|