fielddata.asciidoc 6.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180
  1. [[fielddata]]
  2. === `fielddata`
  3. Most fields are <<mapping-index,indexed>> by default, which makes them
  4. searchable. Sorting, aggregations, and accessing field values in scripts,
  5. however, requires a different access pattern from search.
  6. Search needs to answer the question _"Which documents contain this term?"_,
  7. while sorting and aggregations need to answer a different question: _"What is
  8. the value of this field for **this** document?"_.
  9. Most fields can use index-time, on-disk <<doc-values,`doc_values`>> for this
  10. data access pattern, but <<text,`text`>> fields do not support `doc_values`.
  11. Instead, `text` fields use a query-time *in-memory* data structure called
  12. `fielddata`. This data structure is built on demand the first time that a
  13. field is used for aggregations, sorting, or in a script. It is built by
  14. reading the entire inverted index for each segment from disk, inverting the
  15. term ↔︎ document relationship, and storing the result in memory, in the JVM
  16. heap.
  17. ==== Fielddata is disabled on `text` fields by default
  18. Fielddata can consume a *lot* of heap space, especially when loading high
  19. cardinality `text` fields. Once fielddata has been loaded into the heap, it
  20. remains there for the lifetime of the segment. Also, loading fielddata is an
  21. expensive process which can cause users to experience latency hits. This is
  22. why fielddata is disabled by default.
  23. If you try to sort, aggregate, or access values from a script on a `text`
  24. field, you will see this exception:
  25. [quote]
  26. --
  27. Fielddata is disabled on text fields by default. Set `fielddata=true` on
  28. [`your_field_name`] in order to load fielddata in memory by uninverting the
  29. inverted index. Note that this can however use significant memory.
  30. --
  31. [[before-enabling-fielddata]]
  32. ==== Before enabling fielddata
  33. Before you enable fielddata, consider why you are using a `text` field for
  34. aggregations, sorting, or in a script. It usually doesn't make sense to do
  35. so.
  36. A text field is analyzed before indexing so that a value like
  37. `New York` can be found by searching for `new` or for `york`. A `terms`
  38. aggregation on this field will return a `new` bucket and a `york` bucket, when
  39. you probably want a single bucket called `New York`.
  40. Instead, you should have a `text` field for full text searches, and an
  41. unanalyzed <<keyword,`keyword`>> field with <<doc-values,`doc_values`>>
  42. enabled for aggregations, as follows:
  43. [source,js]
  44. ---------------------------------
  45. PUT my_index
  46. {
  47. "mappings": {
  48. "my_type": {
  49. "properties": {
  50. "my_field": { <1>
  51. "type": "text",
  52. "fields": {
  53. "keyword": { <2>
  54. "type": "keyword"
  55. }
  56. }
  57. }
  58. }
  59. }
  60. }
  61. }
  62. ---------------------------------
  63. // CONSOLE
  64. <1> Use the `my_field` field for searches.
  65. <2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts.
  66. ==== Enabling fielddata on `text` fields
  67. You can enable fielddata on an existing `text` field using the
  68. <<indices-put-mapping,PUT mapping API>> as follows:
  69. [source,js]
  70. -----------------------------------
  71. PUT my_index/_mapping/my_type
  72. {
  73. "properties": {
  74. "my_field": { <1>
  75. "type": "text",
  76. "fielddata": true
  77. }
  78. }
  79. }
  80. -----------------------------------
  81. // CONSOLE
  82. // TEST[continued]
  83. <1> The mapping that you specify for `my_field` should consist of the existing
  84. mapping for that field, plus the `fielddata` parameter.
  85. TIP: The `fielddata.*` parameter must have the same settings for fields of the
  86. same name in the same index. Its value can be updated on existing fields
  87. using the <<indices-put-mapping,PUT mapping API>>.
  88. [[global-ordinals]]
  89. .Global ordinals
  90. *****************************************
  91. Global ordinals is a data-structure on top of fielddata and doc values, that
  92. maintains an incremental numbering for each unique term in a lexicographic
  93. order. Each term has a unique number and the number of term 'A' is lower than
  94. the number of term 'B'. Global ordinals are only supported on <<text,`text`>>
  95. and <<keyword,`keyword`>> fields.
  96. Fielddata and doc values also have ordinals, which is a unique numbering for
  97. all terms in a particular segment and field. Global ordinals just build on top
  98. of this, by providing a mapping between the segment ordinals and the global
  99. ordinals, the latter being unique across the entire shard.
  100. Global ordinals are used for features that use segment ordinals, such as
  101. sorting and the terms aggregation, to improve the execution time. A terms
  102. aggregation relies purely on global ordinals to perform the aggregation at the
  103. shard level, then converts global ordinals to the real term only for the final
  104. reduce phase, which combines results from different shards.
  105. Global ordinals for a specified field are tied to _all the segments of a
  106. shard_, while fielddata and doc values ordinals are tied to a single segment.
  107. which is different than for field data for a specific field which is tied to a
  108. single segment. For this reason global ordinals need to be entirely rebuilt
  109. whenever a once new segment becomes visible.
  110. The loading time of global ordinals depends on the number of terms in a field,
  111. but in general it is low, since it source field data has already been loaded.
  112. The memory overhead of global ordinals is a small because it is very
  113. efficiently compressed. Eager loading of global ordinals can move the loading
  114. time from the first search request, to the refresh itself.
  115. *****************************************
  116. [[field-data-filtering]]
  117. ==== `fielddata_frequency_filter`
  118. Fielddata filtering can be used to reduce the number of terms loaded into
  119. memory, and thus reduce memory usage. Terms can be filtered by _frequency_:
  120. The frequency filter allows you to only load terms whose document frequency falls
  121. between a `min` and `max` value, which can be expressed an absolute
  122. number (when the number is bigger than 1.0) or as a percentage
  123. (eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
  124. *per segment*. Percentages are based on the number of docs which have a
  125. value for the field, as opposed to all docs in the segment.
  126. Small segments can be excluded completely by specifying the minimum
  127. number of docs that the segment should contain with `min_segment_size`:
  128. [source,js]
  129. --------------------------------------------------
  130. PUT my_index
  131. {
  132. "mappings": {
  133. "my_type": {
  134. "properties": {
  135. "tag": {
  136. "type": "text",
  137. "fielddata": true,
  138. "fielddata_frequency_filter": {
  139. "min": 0.001,
  140. "max": 0.1,
  141. "min_segment_size": 500
  142. }
  143. }
  144. }
  145. }
  146. }
  147. }
  148. --------------------------------------------------
  149. // CONSOLE