1
0

fielddata.asciidoc 8.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
  1. [[index-modules-fielddata]]
  2. == Field data
  3. The field data cache is used mainly when sorting on or faceting on a
  4. field. It loads all the field values to memory in order to provide fast
  5. document based access to those values. The field data cache can be
  6. expensive to build for a field, so its recommended to have enough memory
  7. to allocate it, and to keep it loaded.
  8. The amount of memory used for the field
  9. data cache can be controlled using `indices.fielddata.cache.size`. Note:
  10. reloading the field data which does not fit into your cache will be expensive
  11. and perform poorly.
  12. [cols="<,<",options="header",]
  13. |=======================================================================
  14. |Setting |Description
  15. |`indices.fielddata.cache.size` |The max size of the field data cache,
  16. eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
  17. to unbounded.
  18. |`indices.fielddata.cache.expire` |A time based setting that expires
  19. field data after a certain time of inactivity. Defaults to `-1`. For
  20. example, can be set to `5m` for a 5 minute expiry.
  21. |=======================================================================
  22. === Field data formats
  23. The field data format controls how field data should be stored.
  24. Depending on the field type, there might be several field data types
  25. available. In particular, string and numeric types support the `doc_values`
  26. format which allows for computing the field data data-structures at indexing
  27. time and storing them on disk. Although it will make the index larger and may
  28. be slightly slower, this implementation will be more near-realtime-friendly
  29. and will require much less memory from the JVM than other implementations.
  30. Here is an example of how to configure the `tag` field to use the `fst` field
  31. data format.
  32. [source,js]
  33. --------------------------------------------------
  34. {
  35. tag: {
  36. type: "string",
  37. fielddata: {
  38. format: "fst"
  39. }
  40. }
  41. }
  42. --------------------------------------------------
  43. It is possible to change the field data format (and the field data settings
  44. in general) on a live index by using the update mapping API. When doing so,
  45. field data which had already been loaded for existing segments will remain
  46. alive while new segments will use the new field data configuration. Thanks to
  47. the background merging process, all segments will eventually use the new
  48. field data format.
  49. [float]
  50. ==== String field data types
  51. `paged_bytes` (default)::
  52. Stores unique terms sequentially in a large buffer and maps documents to
  53. the indices of the terms they contain in this large buffer.
  54. `fst`::
  55. Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
  56. memory usage if many terms share common prefixes and/or suffixes.
  57. `doc_values`::
  58. Computes and stores field data data-structures on disk at indexing time.
  59. Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
  60. `not_analyzed`) and doesn't support filtering.
  61. [float]
  62. ==== Numeric field data types
  63. `array` (default)::
  64. Stores field values in memory using arrays.
  65. `doc_values`::
  66. Computes and stores field data data-structures on disk at indexing time.
  67. Doesn't support filtering.
  68. [float]
  69. ==== Geo point field data types
  70. `array` (default)::
  71. Stores latitudes and longitudes in arrays.
  72. `doc_values`::
  73. Computes and stores field data data-structures on disk at indexing time.
  74. [float]
  75. === Fielddata loading
  76. By default, field data is loaded lazily, ie. the first time that a query that
  77. requires them is executed. However, this can make the first requests that
  78. follow a merge operation quite slow since fielddata loading is a heavy
  79. operation.
  80. It is possible to force field data to be loaded and cached eagerly through the
  81. `loading` setting of fielddata:
  82. [source,js]
  83. --------------------------------------------------
  84. {
  85. category: {
  86. type: "string",
  87. fielddata: {
  88. loading: "eager"
  89. }
  90. }
  91. }
  92. --------------------------------------------------
  93. [float]
  94. ==== Disabling field data loading
  95. Field data can take a lot of RAM so it makes sense to disable field data
  96. loading on the fields that don't need field data, for example those that are
  97. used for full-text search only. In order to disable field data loading, just
  98. change the field data format to `disabled`. When disabled, all requests that
  99. will try to load field data, e.g. when they include aggregations and/or sorting,
  100. will return an error.
  101. [source,js]
  102. --------------------------------------------------
  103. {
  104. text: {
  105. type: "string",
  106. fielddata: {
  107. format: "disabled"
  108. }
  109. }
  110. }
  111. --------------------------------------------------
  112. The `disabled` format is supported by all field types.
  113. [float]
  114. [[field-data-filtering]]
  115. === Filtering fielddata
  116. It is possible to control which field values are loaded into memory,
  117. which is particularly useful for string fields. When specifying the
  118. <<mapping-core-types,mapping>> for a field, you
  119. can also specify a fielddata filter.
  120. Fielddata filters can be changed using the
  121. <<indices-put-mapping,PUT mapping>>
  122. API. After changing the filters, use the
  123. <<indices-clearcache,Clear Cache>> API
  124. to reload the fielddata using the new filters.
  125. [float]
  126. ==== Filtering by frequency:
  127. The frequency filter allows you to only load terms whose frequency falls
  128. between a `min` and `max` value, which can be expressed an absolute
  129. number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
  130. *per segment*. Percentages are based on the number of docs which have a
  131. value for the field, as opposed to all docs in the segment.
  132. Small segments can be excluded completely by specifying the minimum
  133. number of docs that the segment should contain with `min_segment_size`:
  134. [source,js]
  135. --------------------------------------------------
  136. {
  137. tag: {
  138. type: "string",
  139. fielddata: {
  140. filter: {
  141. frequency: {
  142. min: 0.001,
  143. max: 0.1,
  144. min_segment_size: 500
  145. }
  146. }
  147. }
  148. }
  149. }
  150. --------------------------------------------------
  151. [float]
  152. ==== Filtering by regex
  153. Terms can also be filtered by regular expression - only values which
  154. match the regular expression are loaded. Note: the regular expression is
  155. applied to each term in the field, not to the whole field value. For
  156. instance, to only load hashtags from a tweet, we can use a regular
  157. expression which matches terms beginning with `#`:
  158. [source,js]
  159. --------------------------------------------------
  160. {
  161. tweet: {
  162. type: "string",
  163. analyzer: "whitespace"
  164. fielddata: {
  165. filter: {
  166. regex: {
  167. pattern: "^#.*"
  168. }
  169. }
  170. }
  171. }
  172. }
  173. --------------------------------------------------
  174. [float]
  175. ==== Combining filters
  176. The `frequency` and `regex` filters can be combined:
  177. [source,js]
  178. --------------------------------------------------
  179. {
  180. tweet: {
  181. type: "string",
  182. analyzer: "whitespace"
  183. fielddata: {
  184. filter: {
  185. regex: {
  186. pattern: "^#.*",
  187. },
  188. frequency: {
  189. min: 0.001,
  190. max: 0.1,
  191. min_segment_size: 500
  192. }
  193. }
  194. }
  195. }
  196. }
  197. --------------------------------------------------
  198. [float]
  199. [[field-data-circuit-breaker]]
  200. === Field data circuit breaker
  201. The field data circuit breaker allows Elasticsearch to estimate the amount of
  202. memory a field will required to be loaded into memory. It can then prevent the
  203. field data loading by raising and exception. By default it is configured with
  204. no limit (-1 bytes), but is automatically set to `indices.fielddata.cache.size`
  205. if set. It can be configured with the following parameters:
  206. [cols="<,<",options="header",]
  207. |=======================================================================
  208. |Setting |Description
  209. |`indices.fielddata.breaker.limit` |Maximum size of estimated field data
  210. to allow loading. Defaults to 80% of the maximum JVM heap.
  211. |`indices.fielddata.breaker.overhead` |A constant that all field data
  212. estimations are multiplied with to determine a final estimation. Defaults to
  213. 1.03
  214. |=======================================================================
  215. Both the `indices.fielddata.breaker.limit` and
  216. `indices.fielddata.breaker.overhead` can be changed dynamically using the
  217. cluster update settings API.
  218. [float]
  219. [[field-data-monitoring]]
  220. === Monitoring field data
  221. You can monitor memory usage for field data as well as the field data circuit
  222. breaker using
  223. <<cluster-nodes-stats,Nodes Stats API>>