|
@@ -12,28 +12,28 @@ documents, we need to be able to look up the document and find the terms that
|
|
it has in a field.
|
|
it has in a field.
|
|
|
|
|
|
Most fields can use index-time, on-disk <<doc-values,`doc_values`>> to support
|
|
Most fields can use index-time, on-disk <<doc-values,`doc_values`>> to support
|
|
-this type of data access pattern, but `analyzed` string fields do not support
|
|
|
|
-`doc_values`.
|
|
|
|
|
|
+this type of data access pattern, but `text` fields do not support `doc_values`.
|
|
|
|
|
|
-Instead, `analyzed` strings use a query-time data structure called
|
|
|
|
|
|
+Instead, `text` strings use a query-time data structure called
|
|
`fielddata`. This data structure is built on demand the first time that a
|
|
`fielddata`. This data structure is built on demand the first time that a
|
|
field is used for aggregations, sorting, or is accessed in a script. It is built
|
|
field is used for aggregations, sorting, or is accessed in a script. It is built
|
|
by reading the entire inverted index for each segment from disk, inverting the
|
|
by reading the entire inverted index for each segment from disk, inverting the
|
|
term ↔︎ document relationship, and storing the result in memory, in the
|
|
term ↔︎ document relationship, and storing the result in memory, in the
|
|
JVM heap.
|
|
JVM heap.
|
|
|
|
|
|
-Loading fielddata is an expensive process so, once it has been loaded, it
|
|
|
|
-remains in memory for the lifetime of the segment.
|
|
|
|
|
|
+Loading fielddata is an expensive process so it is disabled by default. Also,
|
|
|
|
+when enabled, once it has been loaded, it remains in memory for the lifetime of
|
|
|
|
+the segment.
|
|
|
|
|
|
[WARNING]
|
|
[WARNING]
|
|
.Fielddata can fill up your heap space
|
|
.Fielddata can fill up your heap space
|
|
==============================================================================
|
|
==============================================================================
|
|
Fielddata can consume a lot of heap space, especially when loading high
|
|
Fielddata can consume a lot of heap space, especially when loading high
|
|
-cardinality `analyzed` string fields. Most of the time, it doesn't make sense
|
|
|
|
-to sort or aggregate on `analyzed` string fields (with the notable exception
|
|
|
|
|
|
+cardinality `text` fields. Most of the time, it doesn't make sense
|
|
|
|
+to sort or aggregate on `text` fields (with the notable exception
|
|
of the
|
|
of the
|
|
<<search-aggregations-bucket-significantterms-aggregation,`significant_terms`>>
|
|
<<search-aggregations-bucket-significantterms-aggregation,`significant_terms`>>
|
|
-aggregation). Always think about whether a `not_analyzed` field (which can
|
|
|
|
|
|
+aggregation). Always think about whether a <<keyword,`keyword`>> field (which can
|
|
use `doc_values`) would be a better fit for your use case.
|
|
use `doc_values`) would be a better fit for your use case.
|
|
==============================================================================
|
|
==============================================================================
|
|
|
|
|
|
@@ -42,71 +42,6 @@ same name in the same index. Its value can be updated on existing fields
|
|
using the <<indices-put-mapping,PUT mapping API>>.
|
|
using the <<indices-put-mapping,PUT mapping API>>.
|
|
|
|
|
|
|
|
|
|
-[[fielddata-format]]
|
|
|
|
-==== `fielddata.format`
|
|
|
|
-
|
|
|
|
-For `analyzed` string fields, the fielddata `format` controls whether
|
|
|
|
-fielddata should be enabled or not. It accepts: `disabled` and `paged_bytes`
|
|
|
|
-(enabled, which is the default). To disable fielddata loading, you can use
|
|
|
|
-the following mapping:
|
|
|
|
-
|
|
|
|
-[source,js]
|
|
|
|
---------------------------------------------------
|
|
|
|
-PUT my_index
|
|
|
|
-{
|
|
|
|
- "mappings": {
|
|
|
|
- "my_type": {
|
|
|
|
- "properties": {
|
|
|
|
- "text": {
|
|
|
|
- "type": "string",
|
|
|
|
- "fielddata": {
|
|
|
|
- "format": "disabled" <1>
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
-}
|
|
|
|
---------------------------------------------------
|
|
|
|
-// AUTOSENSE
|
|
|
|
-<1> The `text` field cannot be used for sorting, aggregations, or in scripts.
|
|
|
|
-
|
|
|
|
-.Fielddata and other datatypes
|
|
|
|
-[NOTE]
|
|
|
|
-==================================================
|
|
|
|
-
|
|
|
|
-Historically, other field datatypes also used fielddata, but this has been replaced
|
|
|
|
-by index-time, disk-based <<doc-values,`doc_values`>>.
|
|
|
|
-
|
|
|
|
-==================================================
|
|
|
|
-
|
|
|
|
-
|
|
|
|
-[[fielddata-loading]]
|
|
|
|
-==== `fielddata.loading`
|
|
|
|
-
|
|
|
|
-This per-field setting controls when fielddata is loaded into memory. It
|
|
|
|
-accepts three options:
|
|
|
|
-
|
|
|
|
-[horizontal]
|
|
|
|
-`lazy`::
|
|
|
|
-
|
|
|
|
- Fielddata is only loaded into memory when it is needed. (default)
|
|
|
|
-
|
|
|
|
-`eager`::
|
|
|
|
-
|
|
|
|
- Fielddata is loaded into memory before a new search segment becomes
|
|
|
|
- visible to search. This can reduce the latency that a user may experience
|
|
|
|
- if their search request has to trigger lazy loading from a big segment.
|
|
|
|
-
|
|
|
|
-`eager_global_ordinals`::
|
|
|
|
-
|
|
|
|
- Loading fielddata into memory is only part of the work that is required.
|
|
|
|
- After loading the fielddata for each segment, Elasticsearch builds the
|
|
|
|
- <<global-ordinals>> data structure to make a list of all unique terms
|
|
|
|
- across all the segments in a shard. By default, global ordinals are built
|
|
|
|
- lazily. If the field has a very high cardinality, global ordinals may
|
|
|
|
- take some time to build, in which case you can use eager loading instead.
|
|
|
|
-
|
|
|
|
[[global-ordinals]]
|
|
[[global-ordinals]]
|
|
.Global ordinals
|
|
.Global ordinals
|
|
*****************************************
|
|
*****************************************
|
|
@@ -141,15 +76,10 @@ can move the loading time from the first search request, to the refresh itself.
|
|
*****************************************
|
|
*****************************************
|
|
|
|
|
|
[[field-data-filtering]]
|
|
[[field-data-filtering]]
|
|
-==== `fielddata.filter`
|
|
|
|
|
|
+==== `fielddata_frequency_filter`
|
|
|
|
|
|
Fielddata filtering can be used to reduce the number of terms loaded into
|
|
Fielddata filtering can be used to reduce the number of terms loaded into
|
|
-memory, and thus reduce memory usage. Terms can be filtered by _frequency_ or
|
|
|
|
-by _regular expression_, or a combination of the two:
|
|
|
|
-
|
|
|
|
-Filtering by frequency::
|
|
|
|
-+
|
|
|
|
---
|
|
|
|
|
|
+memory, and thus reduce memory usage. Terms can be filtered by _frequency_:
|
|
|
|
|
|
The frequency filter allows you to only load terms whose term frequency falls
|
|
The frequency filter allows you to only load terms whose term frequency falls
|
|
between a `min` and `max` value, which can be expressed an absolute
|
|
between a `min` and `max` value, which can be expressed an absolute
|
|
@@ -169,7 +99,7 @@ PUT my_index
|
|
"my_type": {
|
|
"my_type": {
|
|
"properties": {
|
|
"properties": {
|
|
"tag": {
|
|
"tag": {
|
|
- "type": "string",
|
|
|
|
|
|
+ "type": "text",
|
|
"fielddata": {
|
|
"fielddata": {
|
|
"filter": {
|
|
"filter": {
|
|
"frequency": {
|
|
"frequency": {
|
|
@@ -186,44 +116,3 @@ PUT my_index
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
--------------------------------------------------
|
|
// AUTOSENSE
|
|
// AUTOSENSE
|
|
---
|
|
|
|
-
|
|
|
|
-Filtering by regex::
|
|
|
|
-+
|
|
|
|
---
|
|
|
|
-Terms can also be filtered by regular expression - only values which
|
|
|
|
-match the regular expression are loaded. Note: the regular expression is
|
|
|
|
-applied to each term in the field, not to the whole field value. For
|
|
|
|
-instance, to only load hashtags from a tweet, we can use a regular
|
|
|
|
-expression which matches terms beginning with `#`:
|
|
|
|
-
|
|
|
|
-[source,js]
|
|
|
|
---------------------------------------------------
|
|
|
|
-PUT my_index
|
|
|
|
-{
|
|
|
|
- "mappings": {
|
|
|
|
- "my_type": {
|
|
|
|
- "properties": {
|
|
|
|
- "tweet": {
|
|
|
|
- "type": "string",
|
|
|
|
- "analyzer": "whitespace",
|
|
|
|
- "fielddata": {
|
|
|
|
- "filter": {
|
|
|
|
- "regex": {
|
|
|
|
- "pattern": "^#.*"
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
- }
|
|
|
|
-}
|
|
|
|
---------------------------------------------------
|
|
|
|
-// AUTOSENSE
|
|
|
|
---
|
|
|
|
-
|
|
|
|
-These filters can be updated on an existing field mapping and will take
|
|
|
|
-effect the next time the fielddata for a segment is loaded. Use the
|
|
|
|
-<<indices-clearcache,Clear Cache>> API
|
|
|
|
-to reload the fielddata using the new filters.
|
|
|