Browse Source

[DOCS] Add guidance for mapping unstructured content (#69079)

James Rodewig 4 years ago
parent
commit
f1e911d13d

+ 3 - 2
docs/reference/mapping/types/keyword.asciidoc

@@ -11,8 +11,9 @@ The keyword family includes the following field types:
 addresses, hostnames, status codes, zip codes, or tags. 
 * <<constant-keyword-field-type,`constant_keyword`>> for keyword fields that always contain
 the same value.
-* <<wildcard-field-type,`wildcard`>>, which optimizes log lines and similar keyword values
-for grep-like <<query-dsl-wildcard-query,wildcard queries>>.
+* <<wildcard-field-type,`wildcard`>> for unstructured machine-generated content.
+The `wildcard` type is optimized for fields with large values or high
+cardinality.
 
 Keyword fields are often used in <<sort-search-results,sorting>>,
 <<search-aggregations,aggregations>>, and <<term-level-queries,term-level

+ 4 - 0
docs/reference/mapping/types/text.asciidoc

@@ -13,6 +13,10 @@ used for sorting and seldom used for aggregations (although the
 <<search-aggregations-bucket-significanttext-aggregation,significant text aggregation>>
 is a notable exception).
 
+`text` fields are best suited for unstructured but human-readable content. If
+you need to index unstructured machine-generated content, see
+<<mapping-unstructured-content>>.
+
 If you need to index structured content such as email addresses, hostnames, status
 codes, or tags, it is likely that you should rather use a <<keyword,`keyword`>> field.
 

+ 64 - 6
docs/reference/mapping/types/wildcard.asciidoc

@@ -4,16 +4,74 @@
 [[wildcard-field-type]]
 === Wildcard field type
 
-A `wildcard` field stores values optimised for wildcard grep-like queries.
-Wildcard queries are possible on other field types but suffer from constraints:
-
-* `text` fields limit matching of any wildcard expressions to individual tokens rather than the original whole value held in a field
-* `keyword` fields are untokenized but slow at performing wildcard queries (especially patterns with leading wildcards).
+The `wildcard` field type is a specialized keyword field for unstructured
+machine-generated content you plan to search using grep-like 
+<<query-dsl-wildcard-query,`wildcard`>> and <<query-dsl-regexp-query,`regexp`>>
+queries. The `wildcard` type is optimized for fields with large values or high
+cardinality.
+
+[[mapping-unstructured-content]]
+.Mapping unstructured content
+****
+You can map a field containing unstructured content to either a `text` or
+keyword family field. The best field type depends on the nature of the content
+and how you plan to search the field.
+
+Use the `text` field type if:
+
+* The content is human-readable, such as an email body or product description.
+* You plan to search the field for individual words or phrases, such as `the
+brown fox jumped`, using <<full-text-queries,full text queries>>. {es}
+<<analysis,analyzes>> `text` fields to return the most relevant results for
+these queries.
+
+Use a keyword family field type if:
+
+* The content is machine-generated, such as a log message or HTTP request
+information.
+* You plan to search the field for exact full values, such as `org.foo.bar`, or
+partial character sequences, such as `org.foo.*`, using
+<<term-level-queries,term-level queries>>.
+
+**Choosing a keyword family field type**
+
+If you choose a keyword family field type, you can map the field as a `keyword`
+or `wildcard` field depending on the cardinality and size of the field's values.
+Use the `wildcard` type if you plan to regularly search the field using a
+<<query-dsl-wildcard-query,`wildcard`>> or <<query-dsl-regexp-query,`regexp`>>
+query and meet one of the following criteria:
+
+* The field contains more than a million unique values. +
+AND +
+You plan to regularly search the field using a pattern with leading wildcards,
+such as `*foo` or `*baz`.
+
+* The field contains values larger than 32KB. +
+AND +
+You plan to regularly search the field using any wildcard pattern.
+
+Otherwise, use the `keyword` field type for faster searches, faster indexing,
+and lower storage costs. For an in-depth comparison and decision flowchart, see
+our
+https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field[related
+blog post].
+
+**Switching from a `text` field to a keyword field**
+
+If you previously used a `text` field to index unstructured machine-generated
+content, you can <<update-mapping,reindex to update the mapping>> to a `keyword`
+or `wildcard` field. We also recommend you update your application or workflow
+to replace any word-based <<full-text-queries,full text queries>> on the field
+to equivalent <<term-level-queries,term-level queries>>.
+****
 
 Internally the `wildcard` field indexes the whole field value using ngrams and stores the full string.
 The index is used as a rough filter to cut down the number of values that are then checked by retrieving and checking the full values.
 This field is especially well suited to run grep-like queries on log lines. Storage costs are typically lower than those of `keyword`
-fields but search speeds for exact matches on full terms are slower.
+fields but search speeds for exact matches on full terms are slower. If the
+field values share many prefixes, such as URLs for the same website, storage
+costs for a `wildcard` field may be higher than an equivalent `keyword` field.
+
 
 You index and search a wildcard field as follows