fielddata.asciidoc 8.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229
  1. [[fielddata]]
  2. === `fielddata`
  3. Most fields are <<mapping-index,indexed>> by default, which makes them
  4. searchable. The inverted index allows queries to look up the search term in
  5. unique sorted list of terms, and from that immediately have access to the list
  6. of documents that contain the term.
  7. Sorting, aggregations, and access to field values in scripts requires a
  8. different data access pattern. Instead of lookup up the term and finding
  9. documents, we need to be able to look up the document and find the terms that
  10. is has in a field.
  11. Most fields can use index-time, on-disk <<doc-values,`doc_values`>> to support
  12. this type of data access pattern, but `analyzed` string fields do not support
  13. `doc_values`.
  14. Instead, `analyzed` strings use a query-time data structure called
  15. `fielddata`. This data structure is built on demand the first time that a
  16. field is used for aggregations, sorting, or is accessed in a script. It is built
  17. by reading the entire inverted index for each segment from disk, inverting the
  18. term ↔︎ document relationship, and storing the result in memory, in the
  19. JVM heap.
  20. Loading fielddata is an expensive process so, once it has been loaded, it
  21. remains in memory for the lifetime of the segment.
  22. [WARNING]
  23. .Fielddata can fill up your heap space
  24. ==============================================================================
  25. Fielddata can consume a lot of heap space, especially when loading high
  26. cardinality `analyzed` string fields. Most of the time, it doesn't make sense
  27. to sort or aggregate on `analyzed` string fields (with the notable exception
  28. of the
  29. <<search-aggregations-bucket-significantterms-aggregation,`significant_terms`>>
  30. aggregation). Always think about whether a `not_analyzed` field (which can
  31. use `doc_values`) would be a better fit for your use case.
  32. ==============================================================================
  33. TIP: The `fielddata.*` settings must have the same settings for fields of the
  34. same name in the same index. Its value can be updated on existing fields
  35. using the <<indices-put-mapping,PUT mapping API>>.
  36. [[fielddata-format]]
  37. ==== `fielddata.format`
  38. For `analyzed` string fields, the fielddata `format` controls whether
  39. fielddata should be enabled or not. It accepts: `disabled` and `paged_bytes`
  40. (enabled, which is the default). To disable fielddata loading, you can use
  41. the following mapping:
  42. [source,js]
  43. --------------------------------------------------
  44. PUT my_index
  45. {
  46. "mappings": {
  47. "my_type": {
  48. "properties": {
  49. "text": {
  50. "type": "string",
  51. "fielddata": {
  52. "format": "disabled" <1>
  53. }
  54. }
  55. }
  56. }
  57. }
  58. }
  59. --------------------------------------------------
  60. // AUTOSENSE
  61. <1> The `text` field cannot be used for sorting, aggregations, or in scripts.
  62. .Fielddata and other datatypes
  63. [NOTE]
  64. ==================================================
  65. Historically, other field datatypes also used fielddata, but this has been replaced
  66. by index-time, disk-based <<doc-values,`doc_values`>>.
  67. ==================================================
  68. [[fielddata-loading]]
  69. ==== `fielddata.loading`
  70. This per-field setting controls when fielddata is loaded into memory. It
  71. accepts three options:
  72. [horizontal]
  73. `lazy`::
  74. Fielddata is only loaded into memory when it is needed. (default)
  75. `eager`::
  76. Fielddata is loaded into memory before a new search segment becomes
  77. visible to search. This can reduce the latency that a user may experience
  78. if their search request has to trigger lazy loading from a big segment.
  79. `eager_global_ordinals`::
  80. Loading fielddata into memory is only part of the work that is required.
  81. After loading the fielddata for each segment, Elasticsearch builds the
  82. <<global-ordinals>> data structure to make a list of all unique terms
  83. across all the segments in a shard. By default, global ordinals are built
  84. lazily. If the field has a very high cardinality, global ordinals may
  85. take some time to build, in which case you can use eager loading instead.
  86. [[global-ordinals]]
  87. .Global ordinals
  88. *****************************************
  89. Global ordinals is a data-structure on top of fielddata and doc values, that
  90. maintains an incremental numbering for each unique term in a lexicographic
  91. order. Each term has a unique number and the number of term 'A' is lower than
  92. the number of term 'B'. Global ordinals are only supported on string fields.
  93. Fielddata and doc values also have ordinals, which is a unique numbering for all terms
  94. in a particular segment and field. Global ordinals just build on top of this,
  95. by providing a mapping between the segment ordinals and the global ordinals,
  96. the latter being unique across the entire shard.
  97. Global ordinals are used for features that use segment ordinals, such as
  98. sorting and the terms aggregation, to improve the execution time. A terms
  99. aggregation relies purely on global ordinals to perform the aggregation at the
  100. shard level, then converts global ordinals to the real term only for the final
  101. reduce phase, which combines results from different shards.
  102. Global ordinals for a specified field are tied to _all the segments of a
  103. shard_, while fielddata and doc values ordinals are tied to a single segment.
  104. which is different than for field data for a specific field which is tied to a
  105. single segment. For this reason global ordinals need to be entirely rebuilt
  106. whenever a once new segment becomes visible.
  107. The loading time of global ordinals depends on the number of terms in a field, but in general
  108. it is low, since it source field data has already been loaded. The memory overhead of global
  109. ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals
  110. can move the loading time from the first search request, to the refresh itself.
  111. *****************************************
  112. [[field-data-filtering]]
  113. ==== `fielddata.filter`
  114. Fielddata filtering can be used to reduce the number of terms loaded into
  115. memory, and thus reduce memory usage. Terms can be filtered by _frequency_ or
  116. by _regular expression_, or a combination of the two:
  117. Filtering by frequency::
  118. +
  119. --
  120. The frequency filter allows you to only load terms whose term frequency falls
  121. between a `min` and `max` value, which can be expressed an absolute
  122. number (when the number is bigger than 1.0) or as a percentage
  123. (eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
  124. *per segment*. Percentages are based on the number of docs which have a
  125. value for the field, as opposed to all docs in the segment.
  126. Small segments can be excluded completely by specifying the minimum
  127. number of docs that the segment should contain with `min_segment_size`:
  128. [source,js]
  129. --------------------------------------------------
  130. PUT my_index
  131. {
  132. "mappings": {
  133. "my_type": {
  134. "properties": {
  135. "tag": {
  136. "type": "string",
  137. "fielddata": {
  138. "filter": {
  139. "frequency": {
  140. "min": 0.001,
  141. "max": 0.1,
  142. "min_segment_size": 500
  143. }
  144. }
  145. }
  146. }
  147. }
  148. }
  149. }
  150. }
  151. --------------------------------------------------
  152. // AUTOSENSE
  153. --
  154. Filtering by regex::
  155. +
  156. --
  157. Terms can also be filtered by regular expression - only values which
  158. match the regular expression are loaded. Note: the regular expression is
  159. applied to each term in the field, not to the whole field value. For
  160. instance, to only load hashtags from a tweet, we can use a regular
  161. expression which matches terms beginning with `#`:
  162. [source,js]
  163. --------------------------------------------------
  164. PUT my_index
  165. {
  166. "mappings": {
  167. "my_type": {
  168. "properties": {
  169. "tweet": {
  170. "type": "string",
  171. "analyzer": "whitespace",
  172. "fielddata": {
  173. "filter": {
  174. "regex": {
  175. "pattern": "^#.*"
  176. }
  177. }
  178. }
  179. }
  180. }
  181. }
  182. }
  183. }
  184. --------------------------------------------------
  185. // AUTOSENSE
  186. --
  187. These filters can be updated on an existing field mapping and will take
  188. effect the next time the fielddata for a segment is loaded. Use the
  189. <<indices-clearcache,Clear Cache>> API
  190. to reload the fielddata using the new filters.