| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229 | [[fielddata]]=== `fielddata`Most fields are <<mapping-index,indexed>> by default, which makes themsearchable. The inverted index allows queries to look up the search term inunique sorted list of terms, and from that immediately have access to the listof documents that contain the term.Sorting, aggregations, and access to field values in scripts requires adifferent data access pattern.  Instead of lookup up the term and findingdocuments, we need to be able to look up the document and find the terms thatis has in a field.Most fields can use index-time, on-disk <<doc-values,`doc_values`>> to supportthis type of data access pattern, but `analyzed` string fields do not support`doc_values`.Instead, `analyzed` strings use a query-time data structure called`fielddata`.  This data structure is built on demand the first time that afield is used for aggregations, sorting, or is accessed in a script.  It is builtby reading the entire inverted index for each segment from disk, inverting theterm ↔︎ document relationship, and storing the result in memory, in theJVM heap.Loading fielddata is an expensive process so, once it has been loaded, itremains in memory for the lifetime of the segment.[WARNING].Fielddata can fill up your heap space==============================================================================Fielddata can consume a lot of heap space, especially when loading highcardinality `analyzed` string fields.  Most of the time, it doesn't make senseto sort or aggregate on `analyzed` string fields (with the notable exceptionof the<<search-aggregations-bucket-significantterms-aggregation,`significant_terms`>>aggregation).  Always think about whether a `not_analyzed` field (which canuse `doc_values`) would be  a better fit for your use case.==============================================================================TIP: The `fielddata.*` settings must have the same settings for fields of thesame name in the same index.  Its value can be updated on existing fieldsusing the <<indices-put-mapping,PUT mapping API>>.[[fielddata-format]]==== `fielddata.format`For `analyzed` string fields, the fielddata `format` controls whetherfielddata should be enabled or not.  It accepts: `disabled` and `paged_bytes`(enabled, which is the default).  To disable fielddata loading, you can usethe following mapping:[source,js]--------------------------------------------------PUT my_index{  "mappings": {    "my_type": {      "properties": {        "text": {          "type": "string",          "fielddata": {            "format": "disabled" <1>          }        }      }    }  }}--------------------------------------------------// AUTOSENSE<1> The `text` field cannot be used for sorting, aggregations, or in scripts..Fielddata and other datatypes[NOTE]==================================================Historically, other field datatypes also used fielddata, but this has been replacedby index-time, disk-based <<doc-values,`doc_values`>>.==================================================[[fielddata-loading]]==== `fielddata.loading`This per-field setting controls when fielddata is loaded into memory. Itaccepts three options:[horizontal]`lazy`::    Fielddata is only loaded into memory when it is needed. (default)`eager`::    Fielddata is loaded into memory before a new search segment becomes    visible to search.  This can reduce the latency that a user may experience    if their search request has to trigger lazy loading from a big segment.`eager_global_ordinals`::    Loading fielddata into memory is only part of the work that is required.    After loading the fielddata for each segment, Elasticsearch builds the    <<global-ordinals>> data structure to make a list of all unique terms    across all the segments in a shard.  By default, global ordinals are built    lazily.  If the field has a very high cardinality, global ordinals may    take some time to build, in which case you can use eager loading instead.[[global-ordinals]].Global ordinals*****************************************Global ordinals is a data-structure on top of fielddata and doc values, thatmaintains an incremental numbering for each unique term in a lexicographicorder. Each term has a unique number and the number of term 'A' is lower thanthe number of term 'B'. Global ordinals are only supported on string fields.Fielddata and doc values also have ordinals, which is a unique numbering for all termsin a particular segment and field. Global ordinals just build on top of this,by providing a mapping between the segment ordinals and the global ordinals,the latter being unique across the entire shard.Global ordinals are used for features that use segment ordinals, such assorting and the terms aggregation, to improve the execution time. A termsaggregation relies purely on global ordinals to perform the aggregation at theshard level, then converts global ordinals to the real term only for the finalreduce phase, which combines results from different shards.Global ordinals for a specified field are tied to _all the segments of ashard_, while fielddata and doc values ordinals are tied to a single segment.which is different than for field data for a specific field which is tied to asingle segment. For this reason global ordinals need to be entirely rebuiltwhenever a once new segment becomes visible.The loading time of global ordinals depends on the number of terms in a field, but in generalit is low, since it source field data has already been loaded. The memory overhead of globalordinals is a small because it is very efficiently compressed. Eager loading of global ordinalscan move the loading time from the first search request, to the refresh itself.*****************************************[[field-data-filtering]]==== `fielddata.filter`Fielddata filtering can be used to reduce the number of terms loaded intomemory, and thus reduce memory usage. Terms can be filtered by _frequency_ orby _regular expression_, or a combination of the two:Filtering by frequency::+--The frequency filter allows you to only load terms whose term frequency fallsbetween a `min` and `max` value, which can be expressed an absolutenumber (when the number is bigger than 1.0) or as a percentage(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated*per segment*. Percentages are based on the number of docs which have avalue for the field, as opposed to all docs in the segment.Small segments can be excluded completely by specifying the minimumnumber of docs that the segment should contain with `min_segment_size`:[source,js]--------------------------------------------------PUT my_index{  "mappings": {    "my_type": {      "properties": {        "tag": {          "type": "string",          "fielddata": {            "filter": {              "frequency": {                "min": 0.001,                "max": 0.1,                "min_segment_size": 500              }            }          }        }      }    }  }}--------------------------------------------------// AUTOSENSE--Filtering by regex::+--Terms can also be filtered by regular expression - only values whichmatch the regular expression are loaded. Note: the regular expression isapplied to each term in the field, not to the whole field value. Forinstance, to only load hashtags from a tweet, we can use a regularexpression which matches terms beginning with `#`:[source,js]--------------------------------------------------PUT my_index{  "mappings": {    "my_type": {      "properties": {        "tweet": {          "type": "string",          "analyzer": "whitespace",          "fielddata": {            "filter": {              "regex": {                "pattern": "^#.*"              }            }          }        }      }    }  }}--------------------------------------------------// AUTOSENSE--These filters can be updated on an existing field mapping and will takeeffect the next time the fielddata for a segment is loaded. Use the<<indices-clearcache,Clear Cache>> APIto reload the fielddata using the new filters.
 |