Browse Source

[ML] Improve categorize_text docs (#90765)

Adds more detail about the meaning of the results
fields of the `categorize_text` aggregation, and
advice about how to use these fields when searching
for messages that match the categories.

Followup to #90723
David Roberts 3 years ago
parent
commit
be006e2eee

+ 1 - 1
docs/reference/aggregations/bucket/adjacency-matrix-aggregation.asciidoc

@@ -134,7 +134,7 @@ Separator used to concatenate filter names. Defaults to `&`.
 Filters for the bucket. If the bucket uses multiple filters, filter names are
 concatenated using a `separator`.
 
-`document_count`::
+`doc_count`::
 (integer)
 Number of documents matching the bucket's filters.
 

+ 40 - 0
docs/reference/aggregations/bucket/categorize-text-aggregation.asciidoc

@@ -106,6 +106,46 @@ Larger values will increase memory usage and create narrower categories.
 (Optional, integer, default: `10`)
 The number of buckets to return.
 
+[[bucket-categorize-text-agg-response]]
+==== Response body
+
+`key`::
+(string)
+Consists of the tokens (extracted by the `categorization_analyzer`)
+that are common to all values of the input field included in the category.
+
+`doc_count`::
+(integer)
+Number of documents matching the category.
+
+`max_matching_length`::
+(integer)
+Categories from short messages containing few tokens may also match
+categories containing many tokens derived from much longer messages.
+`max_matching_length` is an indication of the maximum length of messages
+that should be considered to belong to the category. When searching for
+messages that match the category, any messages longer than
+`max_matching_length` should be excluded. Use this field to prevent a
+search for members of a category of short messages from matching much longer
+ones.
+
+`regex`::
+(string)
+A regular expression that will match all values of the input field included
+in the category. It is possible that the `regex` does not incorporate every
+term in `key`, if ordering varies between the values included in the
+category. However, in simple cases the `regex` will be the ordered terms
+concatenated into a regular expression that allows for arbitrary sections
+in between them. It is not recommended to use the `regex` as the primary
+mechanism for searching for the original documents that were categorized.
+Search using a regular expression is very slow. Instead the terms in the
+`key` field should be used to search for matching documents, as a terms
+search can use the inverted index and hence be much faster. However, there
+may be situations where it is useful to use the `regex` field to test whether
+a small set of messages that have not been indexed match the category, or to
+confirm that the terms in the `key` occur in the correct order in all the
+matched documents.
+
 ==== Basic use
 
 WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be