123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357 |
- [[index-modules-fielddata]]
- == Field data
- The field data cache is used mainly when sorting on or computing aggregations
- on a field. It loads all the field values to memory in order to provide fast
- document based access to those values. The field data cache can be
- expensive to build for a field, so its recommended to have enough memory
- to allocate it, and to keep it loaded.
- The amount of memory used for the field
- data cache can be controlled using `indices.fielddata.cache.size`. Note:
- reloading the field data which does not fit into your cache will be expensive
- and perform poorly.
- [cols="<,<",options="header",]
- |=======================================================================
- |Setting |Description
- |`indices.fielddata.cache.size` |The max size of the field data cache,
- eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
- to unbounded.
- |`indices.fielddata.cache.expire` |A time based setting that expires
- field data after a certain time of inactivity. Defaults to `-1`. For
- example, can be set to `5m` for a 5 minute expiry.
- |=======================================================================
- [float]
- [[circuit-breaker]]
- === Circuit Breaker
- coming[1.4.0,Prior to 1.4.0 there was only a single circuit breaker for fielddata]
- Elasticsearch contains multiple circuit breakers used to prevent operations from
- causing an OutOfMemoryError. Each breaker specifies a limit for how much memory
- it can use. Additionally, there is a parent-level breaker that specifies the
- total amount of memory that can be used across all breakers.
- The parent-level breaker can be configured with the following setting:
- `indices.breaker.total.limit`::
- Starting limit for overall parent breaker, defaults to 70% of JVM heap
- All circuit breaker settings can be changed dynamically using the cluster update
- settings API.
- [float]
- [[fielddata-circuit-breaker]]
- ==== Field data circuit breaker
- The field data circuit breaker allows Elasticsearch to estimate the amount of
- memory a field will required to be loaded into memory. It can then prevent the
- field data loading by raising an exception. By default the limit is configured
- to 60% of the maximum JVM heap. It can be configured with the following
- parameters:
- `indices.breaker.fielddata.limit`::
- Limit for fielddata breaker, defaults to 60% of JVM heap
- `indices.breaker.fielddata.overhead`::
- A constant that all field data estimations are multiplied with to determine a
- final estimation. Defaults to 1.03
- `indices.fielddata.breaker.limit`::
- deprecated[1.4.0,Replaced by `indices.breaker.fielddata.limit`]
- `indices.fielddata.breaker.overhead`::
- deprecated[1.4.0,Replaced by `indices.breaker.fielddata.overhead`]
- [float]
- [[request-circuit-breaker]]
- ==== Request circuit breaker
- coming[1.4.0]
- The request circuit breaker allows Elasticsearch to prevent per-request data
- structures (for example, memory used for calculating aggregations during a
- request) from exceeding a certain amount of memory.
- `indices.breaker.request.limit`::
- Limit for request breaker, defaults to 40% of JVM heap
- `indices.breaker.request.overhead`::
- A constant that all request estimations are multiplied with to determine a
- final estimation. Defaults to 1
- [float]
- [[fielddata-monitoring]]
- === Monitoring field data
- You can monitor memory usage for field data as well as the field data circuit
- breaker using
- <<cluster-nodes-stats,Nodes Stats API>>
- [[fielddata-formats]]
- == Field data formats
- The field data format controls how field data should be stored.
- Depending on the field type, there might be several field data types
- available. In particular, string and numeric types support the `doc_values`
- format which allows for computing the field data data-structures at indexing
- time and storing them on disk. Although it will make the index larger and may
- be slightly slower, this implementation will be more near-realtime-friendly
- and will require much less memory from the JVM than other implementations.
- Here is an example of how to configure the `tag` field to use the `fst` field
- data format.
- [source,js]
- --------------------------------------------------
- {
- "tag": {
- "type": "string",
- "fielddata": {
- "format": "fst"
- }
- }
- }
- --------------------------------------------------
- It is possible to change the field data format (and the field data settings
- in general) on a live index by using the update mapping API. When doing so,
- field data which had already been loaded for existing segments will remain
- alive while new segments will use the new field data configuration. Thanks to
- the background merging process, all segments will eventually use the new
- field data format.
- [float]
- ==== String field data types
- `paged_bytes` (default)::
- Stores unique terms sequentially in a large buffer and maps documents to
- the indices of the terms they contain in this large buffer.
- `fst`::
- Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
- memory usage if many terms share common prefixes and/or suffixes.
- `doc_values`::
- Computes and stores field data data-structures on disk at indexing time.
- Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
- `not_analyzed`) and doesn't support filtering.
- [float]
- ==== Numeric field data types
- `array` (default)::
- Stores field values in memory using arrays.
- `doc_values`::
- Computes and stores field data data-structures on disk at indexing time.
- Doesn't support filtering.
- [float]
- ==== Geo point field data types
- `array` (default)::
- Stores latitudes and longitudes in arrays.
- `doc_values`::
- Computes and stores field data data-structures on disk at indexing time.
- [float]
- ==== Global ordinals
- added[1.2.0]
- Global ordinals is a data-structure on top of field data, that maintains an
- incremental numbering for all the terms in field data in a lexicographic order.
- Each term has a unique number and the number of term 'A' is lower than the number
- of term 'B'. Global ordinals are only supported on string fields.
- Field data on string also has ordinals, which is a unique numbering for all terms
- in a particular segment and field. Global ordinals just build on top of this,
- by providing a mapping between the segment ordinals and the global ordinals.
- The latter being unique across the entire shard.
- Global ordinals can be beneficial in search features that use segment ordinals already
- such as the terms aggregator to improve the execution time. Often these search features
- need to merge the segment ordinal results to a cross segment terms result. With
- global ordinals this mapping happens during field data load time instead of during each
- query execution. With global ordinals search features only need to resolve the actual
- term when building the (shard) response, but during the execution there is no need
- at all to use the actual terms and the unique numbering global ordinals provided is
- sufficient and improves the execution time.
- Global ordinals for a specified field are tied to all the segments of a shard (Lucene index),
- which is different than for field data for a specific field which is tied to a single segment.
- For this reason global ordinals need to be rebuilt in its entirety once new segments
- become visible. This one time cost would happen anyway without global ordinals, but
- then it would happen for each search execution instead!
- The loading time of global ordinals depends on the number of terms in a field, but in general
- it is low, since it source field data has already been loaded. The memory overhead of global
- ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals
- can move the loading time from the first search request, to the refresh itself.
- [float]
- === Fielddata loading
- By default, field data is loaded lazily, ie. the first time that a query that
- requires them is executed. However, this can make the first requests that
- follow a merge operation quite slow since fielddata loading is a heavy
- operation.
- It is possible to force field data to be loaded and cached eagerly through the
- `loading` setting of fielddata:
- [source,js]
- --------------------------------------------------
- {
- "category": {
- "type": "string",
- "fielddata": {
- "loading": "eager"
- }
- }
- }
- --------------------------------------------------
- Global ordinals can also be eagerly loaded:
- [source,js]
- --------------------------------------------------
- {
- "category": {
- "type": "string",
- "fielddata": {
- "loading": "eager_global_ordinals"
- }
- }
- }
- --------------------------------------------------
- With the above setting both field data and global ordinals for a specific field
- are eagerly loaded.
- [float]
- ==== Disabling field data loading
- Field data can take a lot of RAM so it makes sense to disable field data
- loading on the fields that don't need field data, for example those that are
- used for full-text search only. In order to disable field data loading, just
- change the field data format to `disabled`. When disabled, all requests that
- will try to load field data, e.g. when they include aggregations and/or sorting,
- will return an error.
- [source,js]
- --------------------------------------------------
- {
- "text": {
- "type": "string",
- "fielddata": {
- "format": "disabled"
- }
- }
- }
- --------------------------------------------------
- The `disabled` format is supported by all field types.
- [float]
- [[field-data-filtering]]
- === Filtering fielddata
- It is possible to control which field values are loaded into memory,
- which is particularly useful for string fields. When specifying the
- <<mapping-core-types,mapping>> for a field, you
- can also specify a fielddata filter.
- Fielddata filters can be changed using the
- <<indices-put-mapping,PUT mapping>>
- API. After changing the filters, use the
- <<indices-clearcache,Clear Cache>> API
- to reload the fielddata using the new filters.
- [float]
- ==== Filtering by frequency:
- The frequency filter allows you to only load terms whose frequency falls
- between a `min` and `max` value, which can be expressed an absolute
- number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
- *per segment*. Percentages are based on the number of docs which have a
- value for the field, as opposed to all docs in the segment.
- Small segments can be excluded completely by specifying the minimum
- number of docs that the segment should contain with `min_segment_size`:
- [source,js]
- --------------------------------------------------
- {
- "tag": {
- "type": "string",
- "fielddata": {
- "filter": {
- "frequency": {
- "min": 0.001,
- "max": 0.1,
- "min_segment_size": 500
- }
- }
- }
- }
- }
- --------------------------------------------------
- [float]
- ==== Filtering by regex
- Terms can also be filtered by regular expression - only values which
- match the regular expression are loaded. Note: the regular expression is
- applied to each term in the field, not to the whole field value. For
- instance, to only load hashtags from a tweet, we can use a regular
- expression which matches terms beginning with `#`:
- [source,js]
- --------------------------------------------------
- {
- "tweet": {
- "type": "string",
- "analyzer": "whitespace"
- "fielddata": {
- "filter": {
- "regex": {
- "pattern": "^#.*"
- }
- }
- }
- }
- }
- --------------------------------------------------
- [float]
- ==== Combining filters
- The `frequency` and `regex` filters can be combined:
- [source,js]
- --------------------------------------------------
- {
- "tweet": {
- "type": "string",
- "analyzer": "whitespace"
- "fielddata": {
- "filter": {
- "regex": {
- "pattern": "^#.*",
- },
- "frequency": {
- "min": 0.001,
- "max": 0.1,
- "min_segment_size": 500
- }
- }
- }
- }
- }
- --------------------------------------------------
|