fielddata.asciidoc 6.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206
  1. [[index-modules-fielddata]]
  2. == Field data
  3. The field data cache is used mainly when sorting on or faceting on a
  4. field. It loads all the field values to memory in order to provide fast
  5. document based access to those values. The field data cache can be
  6. expensive to build for a field, so its recommended to have enough memory
  7. to allocate it, and to keep it loaded.
  8. The amount of memory used for the field
  9. data cache can be controlled using `indices.fielddata.cache.size`. Note:
  10. reloading the field data which does not fit into your cache will be expensive
  11. and perform poorly.
  12. [cols="<,<",options="header",]
  13. |=======================================================================
  14. |Setting |Description
  15. |`indices.fielddata.cache.size` |The max size of the field data cache,
  16. eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
  17. to unbounded.
  18. |`indices.fielddata.cache.expire` |A time based setting that expires
  19. field data after a certain time of inactivity. Defaults to `-1`. For
  20. example, can be set to `5m` for a 5 minute expiry.
  21. |=======================================================================
  22. === Field data formats
  23. Depending on the field type, there might be several field data types
  24. available. In particular, string and numeric types support the `doc_values`
  25. format which allows for computing the field data data-structures at indexing
  26. time and storing them on disk. Although it will make the index larger and may
  27. be slightly slower, this implementation will be more near-realtime-friendly
  28. and will require much less memory from the JVM than other implementations.
  29. [source,js]
  30. --------------------------------------------------
  31. {
  32. tag: {
  33. type: "string",
  34. fielddata: {
  35. format: "fst"
  36. }
  37. }
  38. }
  39. --------------------------------------------------
  40. [float]
  41. ==== String field data types
  42. `paged_bytes` (default)::
  43. Stores unique terms sequentially in a large buffer and maps documents to
  44. the indices of the terms they contain in this large buffer.
  45. `fst`::
  46. Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
  47. memory usage if many terms share common prefixes and/or suffixes.
  48. `doc_values`::
  49. Computes and stores field data data-structures on disk at indexing time.
  50. Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
  51. `not_analyzed`) and doesn't support filtering.
  52. [float]
  53. ==== Numeric field data types
  54. `array` (default)::
  55. Stores field values in memory using arrays.
  56. `doc_values`::
  57. Computes and stores field data data-structures on disk at indexing time.
  58. Doesn't support filtering.
  59. [float]
  60. ==== Geo point field data types
  61. `array` (default)::
  62. Stores latitudes and longitudes in arrays.
  63. [float]
  64. === Fielddata loading
  65. By default, field data is loaded lazily, on the first time that a query that
  66. requires field data is fired. However, this can make the first requests that
  67. follow a merge operation quite slow since fielddata loading is a heavy
  68. operation.
  69. It is possible to force field data to be loaded and cached eagerly through the
  70. `loading` setting of fielddata:
  71. [source,js]
  72. --------------------------------------------------
  73. {
  74. category: {
  75. type: "string",
  76. fielddata: {
  77. loading: "eager"
  78. }
  79. }
  80. }
  81. --------------------------------------------------
  82. [float]
  83. [[field-data-filtering]]
  84. === Filtering fielddata
  85. It is possible to control which field values are loaded into memory,
  86. which is particularly useful for string fields. When specifying the
  87. <<mapping-core-types,mapping>> for a field, you
  88. can also specify a fielddata filter.
  89. Fielddata filters can be changed using the
  90. <<indices-put-mapping,PUT mapping>>
  91. API. After changing the filters, use the
  92. <<indices-clearcache,Clear Cache>> API
  93. to reload the fielddata using the new filters.
  94. [float]
  95. ==== Filtering by frequency:
  96. The frequency filter allows you to only load terms whose frequency falls
  97. between a `min` and `max` value, which can be expressed an absolute
  98. number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
  99. *per segment*. Percentages are based on the number of docs which have a
  100. value for the field, as opposed to all docs in the segment.
  101. Small segments can be excluded completely by specifying the minimum
  102. number of docs that the segment should contain with `min_segment_size`:
  103. [source,js]
  104. --------------------------------------------------
  105. {
  106. tag: {
  107. type: "string",
  108. fielddata: {
  109. filter: {
  110. frequency: {
  111. min: 0.001,
  112. max: 0.1,
  113. min_segment_size: 500
  114. }
  115. }
  116. }
  117. }
  118. }
  119. --------------------------------------------------
  120. [float]
  121. ==== Filtering by regex
  122. Terms can also be filtered by regular expression - only values which
  123. match the regular expression are loaded. Note: the regular expression is
  124. applied to each term in the field, not to the whole field value. For
  125. instance, to only load hashtags from a tweet, we can use a regular
  126. expression which matches terms beginning with `#`:
  127. [source,js]
  128. --------------------------------------------------
  129. {
  130. tweet: {
  131. type: "string",
  132. analyzer: "whitespace"
  133. fielddata: {
  134. filter: {
  135. regex: {
  136. pattern: "^#.*"
  137. }
  138. }
  139. }
  140. }
  141. }
  142. --------------------------------------------------
  143. [float]
  144. ==== Combining filters
  145. The `frequency` and `regex` filters can be combined:
  146. [source,js]
  147. --------------------------------------------------
  148. {
  149. tweet: {
  150. type: "string",
  151. analyzer: "whitespace"
  152. fielddata: {
  153. filter: {
  154. regex: {
  155. pattern: "^#.*",
  156. },
  157. frequency: {
  158. min: 0.001,
  159. max: 0.1,
  160. min_segment_size: 500
  161. }
  162. }
  163. }
  164. }
  165. }
  166. --------------------------------------------------
  167. [float]
  168. [[field-data-monitoring]]
  169. === Monitoring field data
  170. You can monitor memory usage for field data using
  171. <<cluster-nodes-stats,Nodes Stats API>>