123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134 |
- [[fielddata]]
- === `fielddata`
- Most fields are <<mapping-index,indexed>> by default, which makes them
- searchable. Sorting, aggregations, and accessing field values in scripts,
- however, requires a different access pattern from search.
- Search needs to answer the question _"Which documents contain this term?"_,
- while sorting and aggregations need to answer a different question: _"What is
- the value of this field for **this** document?"_.
- Most fields can use index-time, on-disk <<doc-values,`doc_values`>> for this
- data access pattern, but <<text,`text`>> fields do not support `doc_values`.
- Instead, `text` fields use a query-time *in-memory* data structure called
- `fielddata`. This data structure is built on demand the first time that a
- field is used for aggregations, sorting, or in a script. It is built by
- reading the entire inverted index for each segment from disk, inverting the
- term ↔︎ document relationship, and storing the result in memory, in the JVM
- heap.
- [[fielddata-disabled-text-fields]]
- ==== Fielddata is disabled on `text` fields by default
- Fielddata can consume a *lot* of heap space, especially when loading high
- cardinality `text` fields. Once fielddata has been loaded into the heap, it
- remains there for the lifetime of the segment. Also, loading fielddata is an
- expensive process which can cause users to experience latency hits. This is
- why fielddata is disabled by default.
- If you try to sort, aggregate, or access values from a script on a `text`
- field, you will see this exception:
- [literal]
- Fielddata is disabled on text fields by default. Set `fielddata=true` on
- [`your_field_name`] in order to load fielddata in memory by uninverting the
- inverted index. Note that this can however use significant memory.
- [[before-enabling-fielddata]]
- ==== Before enabling fielddata
- Before you enable fielddata, consider why you are using a `text` field for
- aggregations, sorting, or in a script. It usually doesn't make sense to do
- so.
- A text field is analyzed before indexing so that a value like
- `New York` can be found by searching for `new` or for `york`. A `terms`
- aggregation on this field will return a `new` bucket and a `york` bucket, when
- you probably want a single bucket called `New York`.
- Instead, you should have a `text` field for full text searches, and an
- unanalyzed <<keyword,`keyword`>> field with <<doc-values,`doc_values`>>
- enabled for aggregations, as follows:
- [source,console]
- ---------------------------------
- PUT my_index
- {
- "mappings": {
- "properties": {
- "my_field": { <1>
- "type": "text",
- "fields": {
- "keyword": { <2>
- "type": "keyword"
- }
- }
- }
- }
- }
- }
- ---------------------------------
- <1> Use the `my_field` field for searches.
- <2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts.
- [[enable-fielddata-text-fields]]
- ==== Enabling fielddata on `text` fields
- You can enable fielddata on an existing `text` field using the
- <<indices-put-mapping,PUT mapping API>> as follows:
- [source,console]
- -----------------------------------
- PUT my_index/_mapping
- {
- "properties": {
- "my_field": { <1>
- "type": "text",
- "fielddata": true
- }
- }
- }
- -----------------------------------
- // TEST[continued]
- <1> The mapping that you specify for `my_field` should consist of the existing
- mapping for that field, plus the `fielddata` parameter.
- [[field-data-filtering]]
- ==== `fielddata_frequency_filter`
- Fielddata filtering can be used to reduce the number of terms loaded into
- memory, and thus reduce memory usage. Terms can be filtered by _frequency_:
- The frequency filter allows you to only load terms whose document frequency falls
- between a `min` and `max` value, which can be expressed an absolute
- number (when the number is bigger than 1.0) or as a percentage
- (eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
- *per segment*. Percentages are based on the number of docs which have a
- value for the field, as opposed to all docs in the segment.
- Small segments can be excluded completely by specifying the minimum
- number of docs that the segment should contain with `min_segment_size`:
- [source,console]
- --------------------------------------------------
- PUT my_index
- {
- "mappings": {
- "properties": {
- "tag": {
- "type": "text",
- "fielddata": true,
- "fielddata_frequency_filter": {
- "min": 0.001,
- "max": 0.1,
- "min_segment_size": 500
- }
- }
- }
- }
- }
- --------------------------------------------------
|