Browse Source

Updated fielddata docs to make it easier for users with old mappings

Clinton Gormley 9 years ago
parent
commit
05271d58ca
1 changed files with 106 additions and 41 deletions
  1. 106 41
      docs/reference/mapping/params/fielddata.asciidoc

+ 106 - 41
docs/reference/mapping/params/fielddata.asciidoc

@@ -2,42 +2,105 @@
 === `fielddata`
 === `fielddata`
 
 
 Most fields are <<mapping-index,indexed>> by default, which makes them
 Most fields are <<mapping-index,indexed>> by default, which makes them
-searchable. The inverted index allows queries to look up the search term in
-unique sorted list of terms, and from that immediately have access to the list
-of documents that contain the term.
+searchable. Sorting, aggregations, and accessing field values in scripts,
+however, requires a different access pattern from search.
 
 
-Sorting, aggregations, and access to field values in scripts requires a
-different data access pattern.  Instead of lookup up the term and finding
-documents, we need to be able to look up the document and find the terms that
-it has in a field.
+Search needs to answer the question _"Which documents contain this term?"_,
+while sorting and aggregations need to answer a different question: _"What is
+the value of this field for **this** document?"_.
 
 
-Most fields can use index-time, on-disk <<doc-values,`doc_values`>> to support
-this type of data access pattern, but `text` fields do not support `doc_values`.
+Most fields can use index-time, on-disk <<doc-values,`doc_values`>> for this
+data access pattern, but <<text,`text`>> fields do not support `doc_values`.
 
 
-Instead, `text` strings use a query-time data structure called
+Instead, `text` fields use a query-time *in-memory* data structure called
 `fielddata`.  This data structure is built on demand the first time that a
 `fielddata`.  This data structure is built on demand the first time that a
-field is used for aggregations, sorting, or is accessed in a script.  It is built
-by reading the entire inverted index for each segment from disk, inverting the
-term ↔︎ document relationship, and storing the result in memory, in the
-JVM heap.
-
-Loading fielddata is an expensive process so it is disabled by default. Also,
-when enabled, once it has been loaded, it remains in memory for the lifetime of
-the segment.
-
-[WARNING]
-.Fielddata can fill up your heap space
-==============================================================================
-Fielddata can consume a lot of heap space, especially when loading high
-cardinality `text` fields.  Most of the time, it doesn't make sense
-to sort or aggregate on `text` fields (with the notable exception
-of the
-<<search-aggregations-bucket-significantterms-aggregation,`significant_terms`>>
-aggregation).  Always think about whether a <<keyword,`keyword`>> field (which can
-use `doc_values`) would be  a better fit for your use case.
-==============================================================================
-
-TIP: The `fielddata.*` settings must have the same settings for fields of the
+field is used for aggregations, sorting, or in a script.  It is built by
+reading the entire inverted index for each segment from disk, inverting the
+term ↔︎ document relationship, and storing the result in memory, in the JVM
+heap.
+
+==== Fielddata is disabled on `text` fields by default
+
+Fielddata can consume a *lot* of heap space, especially when loading high
+cardinality `text` fields.  Once fielddata has been loaded into the heap, it
+remains there for the lifetime of the segment. Also, loading fielddata is an
+expensive process which can cause users to experience latency hits.  This is
+why fielddata is disabled by default.
+
+If you try to sort, aggregate, or access values from a script on a `text`
+field, you will see this exception:
+
+[quote]
+--
+Fielddata is disabled on text fields by default.  Set `fielddata=true` on
+[`your_field_name`] in order to load  fielddata in memory by uninverting the
+inverted index. Note that this can however use significant memory.
+--
+
+[[before-enabling-fielddata]]
+==== Before enabling fielddata
+
+Before you enable fielddata, consider why you are using a `text` field for
+aggregations, sorting, or in a script.  It usually doesn't make sense to do
+so.
+
+A text field is analyzed before indexing so that a value like
+`New York` can be found by searching for `new` or for `york`.  A `terms`
+aggregation on this field will return a `new` bucket and a `york` bucket, when
+you probably want a single bucket called `New York`.
+
+Instead, you should have a `text` field for full text searches, and an
+unanalyzed <<keyword,`keyword`>> field with <<doc-values,`doc_values`>>
+enabled for aggregations, as follows:
+
+[source,js]
+---------------------------------
+PUT my_index
+{
+  "mappings": {
+    "my_type": {
+      "properties": {
+        "my_field": { <1>
+          "type": "text",
+          "fields": {
+            "keyword": { <2>
+              "type": "keyword"
+            }
+          }
+        }
+      }
+    }
+  }
+}
+---------------------------------
+// CONSOLE
+<1> Use the `my_field` field for searches.
+<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts.
+
+==== Enabling fielddata on `text` fields
+
+You can enable fielddata on an existing `text` field using the
+<<indices-put-mapping,PUT mapping API>> as follows:
+
+[source,js]
+-----------------------------------
+PUT my_index/_mapping/my_type
+{
+  "properties": {
+    "my_field": { <1>
+      "type":     "text",
+      "fielddata": true
+    }
+  }
+}
+-----------------------------------
+// CONSOLE
+// TEST[continued]
+
+<1> The mapping that you specify for `my_field` should consist of the existing
+    mapping for that field, plus the `fielddata` parameter.
+
+TIP: The `fielddata.*` parameter must have the same settings for fields of the
 same name in the same index.  Its value can be updated on existing fields
 same name in the same index.  Its value can be updated on existing fields
 using the <<indices-put-mapping,PUT mapping API>>.
 using the <<indices-put-mapping,PUT mapping API>>.
 
 
@@ -49,12 +112,13 @@ using the <<indices-put-mapping,PUT mapping API>>.
 Global ordinals is a data-structure on top of fielddata and doc values, that
 Global ordinals is a data-structure on top of fielddata and doc values, that
 maintains an incremental numbering for each unique term in a lexicographic
 maintains an incremental numbering for each unique term in a lexicographic
 order. Each term has a unique number and the number of term 'A' is lower than
 order. Each term has a unique number and the number of term 'A' is lower than
-the number of term 'B'. Global ordinals are only supported on string fields.
+the number of term 'B'. Global ordinals are only supported on <<text,`text`>>
+and <<keyword,`keyword`>> fields.
 
 
-Fielddata and doc values also have ordinals, which is a unique numbering for all terms
-in a particular segment and field. Global ordinals just build on top of this,
-by providing a mapping between the segment ordinals and the global ordinals,
-the latter being unique across the entire shard.
+Fielddata and doc values also have ordinals, which is a unique numbering for
+all terms in a particular segment and field. Global ordinals just build on top
+of this, by providing a mapping between the segment ordinals and the global
+ordinals, the latter being unique across the entire shard.
 
 
 Global ordinals are used for features that use segment ordinals, such as
 Global ordinals are used for features that use segment ordinals, such as
 sorting and the terms aggregation, to improve the execution time. A terms
 sorting and the terms aggregation, to improve the execution time. A terms
@@ -68,10 +132,11 @@ which is different than for field data for a specific field which is tied to a
 single segment. For this reason global ordinals need to be entirely rebuilt
 single segment. For this reason global ordinals need to be entirely rebuilt
 whenever a once new segment becomes visible.
 whenever a once new segment becomes visible.
 
 
-The loading time of global ordinals depends on the number of terms in a field, but in general
-it is low, since it source field data has already been loaded. The memory overhead of global
-ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals
-can move the loading time from the first search request, to the refresh itself.
+The loading time of global ordinals depends on the number of terms in a field,
+but in general it is low, since it source field data has already been loaded.
+The memory overhead of global ordinals is a small because it is very
+efficiently compressed. Eager loading of global ordinals can move the loading
+time from the first search request, to the refresh itself.
 
 
 *****************************************
 *****************************************