fielddata_formats.asciidoc 8.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257
  1. [[fielddata-formats]]
  2. == Fielddata formats
  3. The field data format controls how field data should be stored.
  4. Depending on the field type, there might be several field data types
  5. available. In particular, string, geo-point and numeric types support the `doc_values`
  6. format which allows for computing the field data data-structures at indexing
  7. time and storing them on disk. Although it will make the index larger and may
  8. be slightly slower, this implementation will be more near-realtime-friendly
  9. and will require much less memory from the JVM than other implementations.
  10. Here is an example of how to configure the `tag` field to use the `paged_bytes` field
  11. data format.
  12. [source,js]
  13. --------------------------------------------------
  14. {
  15. "tag": {
  16. "type": "string",
  17. "fielddata": {
  18. "format": "paged_bytes"
  19. }
  20. }
  21. }
  22. --------------------------------------------------
  23. It is possible to change the field data format (and the field data settings
  24. in general) on a live index by using the update mapping API.
  25. [float]
  26. === String field data types
  27. `paged_bytes` (default on analyzed string fields)::
  28. Stores unique terms sequentially in a large buffer and maps documents to
  29. the indices of the terms they contain in this large buffer.
  30. `doc_values` (default when index is set to `not_analyzed`)::
  31. Computes and stores field data data-structures on disk at indexing time.
  32. Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
  33. `not_analyzed`).
  34. [float]
  35. === Numeric field data types
  36. `array`::
  37. Stores field values in memory using arrays.
  38. `doc_values` (default unless doc values are disabled)::
  39. Computes and stores field data data-structures on disk at indexing time.
  40. [float]
  41. === Geo point field data types
  42. `array`::
  43. Stores latitudes and longitudes in arrays.
  44. `doc_values` (default unless doc values are disabled)::
  45. Computes and stores field data data-structures on disk at indexing time.
  46. [float]
  47. [[global-ordinals]]
  48. === Global ordinals
  49. Global ordinals is a data-structure on top of field data, that maintains an
  50. incremental numbering for all the terms in field data in a lexicographic order.
  51. Each term has a unique number and the number of term 'A' is lower than the number
  52. of term 'B'. Global ordinals are only supported on string fields.
  53. Field data on string also has ordinals, which is a unique numbering for all terms
  54. in a particular segment and field. Global ordinals just build on top of this,
  55. by providing a mapping between the segment ordinals and the global ordinals.
  56. The latter being unique across the entire shard.
  57. Global ordinals can be beneficial in search features that use segment ordinals already
  58. such as the terms aggregator to improve the execution time. Often these search features
  59. need to merge the segment ordinal results to a cross segment terms result. With
  60. global ordinals this mapping happens during field data load time instead of during each
  61. query execution. With global ordinals search features only need to resolve the actual
  62. term when building the (shard) response, but during the execution there is no need
  63. at all to use the actual terms and the unique numbering global ordinals provided is
  64. sufficient and improves the execution time.
  65. Global ordinals for a specified field are tied to all the segments of a shard (Lucene index),
  66. which is different than for field data for a specific field which is tied to a single segment.
  67. For this reason global ordinals need to be rebuilt in its entirety once new segments
  68. become visible. This one time cost would happen anyway without global ordinals, but
  69. then it would happen for each search execution instead!
  70. The loading time of global ordinals depends on the number of terms in a field, but in general
  71. it is low, since it source field data has already been loaded. The memory overhead of global
  72. ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals
  73. can move the loading time from the first search request, to the refresh itself.
  74. [float]
  75. [[fielddata-loading]]
  76. === Fielddata loading
  77. By default, field data is loaded lazily, ie. the first time that a query that
  78. requires them is executed. However, this can make the first requests that
  79. follow a merge operation quite slow since fielddata loading is a heavy
  80. operation.
  81. It is possible to force field data to be loaded and cached eagerly through the
  82. `loading` setting of fielddata:
  83. [source,js]
  84. --------------------------------------------------
  85. {
  86. "category": {
  87. "type": "string",
  88. "fielddata": {
  89. "loading": "eager"
  90. }
  91. }
  92. }
  93. --------------------------------------------------
  94. Global ordinals can also be eagerly loaded:
  95. [source,js]
  96. --------------------------------------------------
  97. {
  98. "category": {
  99. "type": "string",
  100. "fielddata": {
  101. "loading": "eager_global_ordinals"
  102. }
  103. }
  104. }
  105. --------------------------------------------------
  106. With the above setting both field data and global ordinals for a specific field
  107. are eagerly loaded.
  108. [float]
  109. ==== Disabling field data loading
  110. Field data can take a lot of RAM so it makes sense to disable field data
  111. loading on the fields that don't need field data, for example those that are
  112. used for full-text search only. In order to disable field data loading, just
  113. change the field data format to `disabled`. When disabled, all requests that
  114. will try to load field data, e.g. when they include aggregations and/or sorting,
  115. will return an error.
  116. [source,js]
  117. --------------------------------------------------
  118. {
  119. "text": {
  120. "type": "string",
  121. "fielddata": {
  122. "format": "disabled"
  123. }
  124. }
  125. }
  126. --------------------------------------------------
  127. The `disabled` format is supported by all field types.
  128. [float]
  129. [[field-data-filtering]]
  130. === Filtering fielddata
  131. It is possible to control which field values are loaded into memory,
  132. which is particularly useful for string fields. When specifying the
  133. <<mapping-core-types,mapping>> for a field, you
  134. can also specify a fielddata filter.
  135. Fielddata filters can be changed using the
  136. <<indices-put-mapping,PUT mapping>>
  137. API. After changing the filters, use the
  138. <<indices-clearcache,Clear Cache>> API
  139. to reload the fielddata using the new filters.
  140. [float]
  141. ==== Filtering by frequency:
  142. The frequency filter allows you to only load terms whose frequency falls
  143. between a `min` and `max` value, which can be expressed an absolute
  144. number (when the number is bigger than 1.0) or as a percentage
  145. (eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
  146. *per segment*. Percentages are based on the number of docs which have a
  147. value for the field, as opposed to all docs in the segment.
  148. Small segments can be excluded completely by specifying the minimum
  149. number of docs that the segment should contain with `min_segment_size`:
  150. [source,js]
  151. --------------------------------------------------
  152. {
  153. "tag": {
  154. "type": "string",
  155. "fielddata": {
  156. "filter": {
  157. "frequency": {
  158. "min": 0.001,
  159. "max": 0.1,
  160. "min_segment_size": 500
  161. }
  162. }
  163. }
  164. }
  165. }
  166. --------------------------------------------------
  167. [float]
  168. ==== Filtering by regex
  169. Terms can also be filtered by regular expression - only values which
  170. match the regular expression are loaded. Note: the regular expression is
  171. applied to each term in the field, not to the whole field value. For
  172. instance, to only load hashtags from a tweet, we can use a regular
  173. expression which matches terms beginning with `#`:
  174. [source,js]
  175. --------------------------------------------------
  176. {
  177. "tweet": {
  178. "type": "string",
  179. "analyzer": "whitespace"
  180. "fielddata": {
  181. "filter": {
  182. "regex": {
  183. "pattern": "^#.*"
  184. }
  185. }
  186. }
  187. }
  188. }
  189. --------------------------------------------------
  190. [float]
  191. ==== Combining filters
  192. The `frequency` and `regex` filters can be combined:
  193. [source,js]
  194. --------------------------------------------------
  195. {
  196. "tweet": {
  197. "type": "string",
  198. "analyzer": "whitespace"
  199. "fielddata": {
  200. "filter": {
  201. "regex": {
  202. "pattern": "^#.*",
  203. },
  204. "frequency": {
  205. "min": 0.001,
  206. "max": 0.1,
  207. "min_segment_size": 500
  208. }
  209. }
  210. }
  211. }
  212. }
  213. --------------------------------------------------