fielddata.asciidoc 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357
  1. [[index-modules-fielddata]]
  2. == Field data
  3. The field data cache is used mainly when sorting on or computing aggregations
  4. on a field. It loads all the field values to memory in order to provide fast
  5. document based access to those values. The field data cache can be
  6. expensive to build for a field, so its recommended to have enough memory
  7. to allocate it, and to keep it loaded.
  8. The amount of memory used for the field
  9. data cache can be controlled using `indices.fielddata.cache.size`. Note:
  10. reloading the field data which does not fit into your cache will be expensive
  11. and perform poorly.
  12. [cols="<,<",options="header",]
  13. |=======================================================================
  14. |Setting |Description
  15. |`indices.fielddata.cache.size` |The max size of the field data cache,
  16. eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
  17. to unbounded.
  18. |`indices.fielddata.cache.expire` |A time based setting that expires
  19. field data after a certain time of inactivity. Defaults to `-1`. For
  20. example, can be set to `5m` for a 5 minute expiry.
  21. |=======================================================================
  22. [float]
  23. [[circuit-breaker]]
  24. === Circuit Breaker
  25. coming[1.4.0,Prior to 1.4.0 there was only a single circuit breaker for fielddata]
  26. Elasticsearch contains multiple circuit breakers used to prevent operations from
  27. causing an OutOfMemoryError. Each breaker specifies a limit for how much memory
  28. it can use. Additionally, there is a parent-level breaker that specifies the
  29. total amount of memory that can be used across all breakers.
  30. The parent-level breaker can be configured with the following setting:
  31. `indices.breaker.total.limit`::
  32. Starting limit for overall parent breaker, defaults to 70% of JVM heap
  33. All circuit breaker settings can be changed dynamically using the cluster update
  34. settings API.
  35. [float]
  36. [[fielddata-circuit-breaker]]
  37. ==== Field data circuit breaker
  38. The field data circuit breaker allows Elasticsearch to estimate the amount of
  39. memory a field will required to be loaded into memory. It can then prevent the
  40. field data loading by raising an exception. By default the limit is configured
  41. to 60% of the maximum JVM heap. It can be configured with the following
  42. parameters:
  43. `indices.breaker.fielddata.limit`::
  44. Limit for fielddata breaker, defaults to 60% of JVM heap
  45. `indices.breaker.fielddata.overhead`::
  46. A constant that all field data estimations are multiplied with to determine a
  47. final estimation. Defaults to 1.03
  48. `indices.fielddata.breaker.limit`::
  49. deprecated[1.4.0,Replaced by `indices.breaker.fielddata.limit`]
  50. `indices.fielddata.breaker.overhead`::
  51. deprecated[1.4.0,Replaced by `indices.breaker.fielddata.overhead`]
  52. [float]
  53. [[request-circuit-breaker]]
  54. ==== Request circuit breaker
  55. coming[1.4.0]
  56. The request circuit breaker allows Elasticsearch to prevent per-request data
  57. structures (for example, memory used for calculating aggregations during a
  58. request) from exceeding a certain amount of memory.
  59. `indices.breaker.request.limit`::
  60. Limit for request breaker, defaults to 40% of JVM heap
  61. `indices.breaker.request.overhead`::
  62. A constant that all request estimations are multiplied with to determine a
  63. final estimation. Defaults to 1
  64. [float]
  65. [[fielddata-monitoring]]
  66. === Monitoring field data
  67. You can monitor memory usage for field data as well as the field data circuit
  68. breaker using
  69. <<cluster-nodes-stats,Nodes Stats API>>
  70. [[fielddata-formats]]
  71. == Field data formats
  72. The field data format controls how field data should be stored.
  73. Depending on the field type, there might be several field data types
  74. available. In particular, string and numeric types support the `doc_values`
  75. format which allows for computing the field data data-structures at indexing
  76. time and storing them on disk. Although it will make the index larger and may
  77. be slightly slower, this implementation will be more near-realtime-friendly
  78. and will require much less memory from the JVM than other implementations.
  79. Here is an example of how to configure the `tag` field to use the `fst` field
  80. data format.
  81. [source,js]
  82. --------------------------------------------------
  83. {
  84. "tag": {
  85. "type": "string",
  86. "fielddata": {
  87. "format": "fst"
  88. }
  89. }
  90. }
  91. --------------------------------------------------
  92. It is possible to change the field data format (and the field data settings
  93. in general) on a live index by using the update mapping API. When doing so,
  94. field data which had already been loaded for existing segments will remain
  95. alive while new segments will use the new field data configuration. Thanks to
  96. the background merging process, all segments will eventually use the new
  97. field data format.
  98. [float]
  99. ==== String field data types
  100. `paged_bytes` (default)::
  101. Stores unique terms sequentially in a large buffer and maps documents to
  102. the indices of the terms they contain in this large buffer.
  103. `fst`::
  104. Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
  105. memory usage if many terms share common prefixes and/or suffixes.
  106. `doc_values`::
  107. Computes and stores field data data-structures on disk at indexing time.
  108. Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
  109. `not_analyzed`) and doesn't support filtering.
  110. [float]
  111. ==== Numeric field data types
  112. `array` (default)::
  113. Stores field values in memory using arrays.
  114. `doc_values`::
  115. Computes and stores field data data-structures on disk at indexing time.
  116. Doesn't support filtering.
  117. [float]
  118. ==== Geo point field data types
  119. `array` (default)::
  120. Stores latitudes and longitudes in arrays.
  121. `doc_values`::
  122. Computes and stores field data data-structures on disk at indexing time.
  123. [float]
  124. ==== Global ordinals
  125. added[1.2.0]
  126. Global ordinals is a data-structure on top of field data, that maintains an
  127. incremental numbering for all the terms in field data in a lexicographic order.
  128. Each term has a unique number and the number of term 'A' is lower than the number
  129. of term 'B'. Global ordinals are only supported on string fields.
  130. Field data on string also has ordinals, which is a unique numbering for all terms
  131. in a particular segment and field. Global ordinals just build on top of this,
  132. by providing a mapping between the segment ordinals and the global ordinals.
  133. The latter being unique across the entire shard.
  134. Global ordinals can be beneficial in search features that use segment ordinals already
  135. such as the terms aggregator to improve the execution time. Often these search features
  136. need to merge the segment ordinal results to a cross segment terms result. With
  137. global ordinals this mapping happens during field data load time instead of during each
  138. query execution. With global ordinals search features only need to resolve the actual
  139. term when building the (shard) response, but during the execution there is no need
  140. at all to use the actual terms and the unique numbering global ordinals provided is
  141. sufficient and improves the execution time.
  142. Global ordinals for a specified field are tied to all the segments of a shard (Lucene index),
  143. which is different than for field data for a specific field which is tied to a single segment.
  144. For this reason global ordinals need to be rebuilt in its entirety once new segments
  145. become visible. This one time cost would happen anyway without global ordinals, but
  146. then it would happen for each search execution instead!
  147. The loading time of global ordinals depends on the number of terms in a field, but in general
  148. it is low, since it source field data has already been loaded. The memory overhead of global
  149. ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals
  150. can move the loading time from the first search request, to the refresh itself.
  151. [float]
  152. === Fielddata loading
  153. By default, field data is loaded lazily, ie. the first time that a query that
  154. requires them is executed. However, this can make the first requests that
  155. follow a merge operation quite slow since fielddata loading is a heavy
  156. operation.
  157. It is possible to force field data to be loaded and cached eagerly through the
  158. `loading` setting of fielddata:
  159. [source,js]
  160. --------------------------------------------------
  161. {
  162. "category": {
  163. "type": "string",
  164. "fielddata": {
  165. "loading": "eager"
  166. }
  167. }
  168. }
  169. --------------------------------------------------
  170. Global ordinals can also be eagerly loaded:
  171. [source,js]
  172. --------------------------------------------------
  173. {
  174. "category": {
  175. "type": "string",
  176. "fielddata": {
  177. "loading": "eager_global_ordinals"
  178. }
  179. }
  180. }
  181. --------------------------------------------------
  182. With the above setting both field data and global ordinals for a specific field
  183. are eagerly loaded.
  184. [float]
  185. ==== Disabling field data loading
  186. Field data can take a lot of RAM so it makes sense to disable field data
  187. loading on the fields that don't need field data, for example those that are
  188. used for full-text search only. In order to disable field data loading, just
  189. change the field data format to `disabled`. When disabled, all requests that
  190. will try to load field data, e.g. when they include aggregations and/or sorting,
  191. will return an error.
  192. [source,js]
  193. --------------------------------------------------
  194. {
  195. "text": {
  196. "type": "string",
  197. "fielddata": {
  198. "format": "disabled"
  199. }
  200. }
  201. }
  202. --------------------------------------------------
  203. The `disabled` format is supported by all field types.
  204. [float]
  205. [[field-data-filtering]]
  206. === Filtering fielddata
  207. It is possible to control which field values are loaded into memory,
  208. which is particularly useful for string fields. When specifying the
  209. <<mapping-core-types,mapping>> for a field, you
  210. can also specify a fielddata filter.
  211. Fielddata filters can be changed using the
  212. <<indices-put-mapping,PUT mapping>>
  213. API. After changing the filters, use the
  214. <<indices-clearcache,Clear Cache>> API
  215. to reload the fielddata using the new filters.
  216. [float]
  217. ==== Filtering by frequency:
  218. The frequency filter allows you to only load terms whose frequency falls
  219. between a `min` and `max` value, which can be expressed an absolute
  220. number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
  221. *per segment*. Percentages are based on the number of docs which have a
  222. value for the field, as opposed to all docs in the segment.
  223. Small segments can be excluded completely by specifying the minimum
  224. number of docs that the segment should contain with `min_segment_size`:
  225. [source,js]
  226. --------------------------------------------------
  227. {
  228. "tag": {
  229. "type": "string",
  230. "fielddata": {
  231. "filter": {
  232. "frequency": {
  233. "min": 0.001,
  234. "max": 0.1,
  235. "min_segment_size": 500
  236. }
  237. }
  238. }
  239. }
  240. }
  241. --------------------------------------------------
  242. [float]
  243. ==== Filtering by regex
  244. Terms can also be filtered by regular expression - only values which
  245. match the regular expression are loaded. Note: the regular expression is
  246. applied to each term in the field, not to the whole field value. For
  247. instance, to only load hashtags from a tweet, we can use a regular
  248. expression which matches terms beginning with `#`:
  249. [source,js]
  250. --------------------------------------------------
  251. {
  252. "tweet": {
  253. "type": "string",
  254. "analyzer": "whitespace"
  255. "fielddata": {
  256. "filter": {
  257. "regex": {
  258. "pattern": "^#.*"
  259. }
  260. }
  261. }
  262. }
  263. }
  264. --------------------------------------------------
  265. [float]
  266. ==== Combining filters
  267. The `frequency` and `regex` filters can be combined:
  268. [source,js]
  269. --------------------------------------------------
  270. {
  271. "tweet": {
  272. "type": "string",
  273. "analyzer": "whitespace"
  274. "fielddata": {
  275. "filter": {
  276. "regex": {
  277. "pattern": "^#.*",
  278. },
  279. "frequency": {
  280. "min": 0.001,
  281. "max": 0.1,
  282. "min_segment_size": 500
  283. }
  284. }
  285. }
  286. }
  287. }
  288. --------------------------------------------------