codec.asciidoc 9.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278
  1. [[index-modules-codec]]
  2. == Codec module
  3. Codecs define how documents are written to disk and read from disk. The
  4. postings format is the part of the codec that is responsible for reading
  5. and writing the term dictionary, postings lists and positions, as well as the payloads
  6. and offsets stored in the postings list. The doc values format is
  7. responsible for reading column-stride storage for a field and is typically
  8. used for sorting or faceting. When a field doesn't have doc values enabled,
  9. it is still possible to sort or facet by loading field values from the
  10. inverted index into main memory.
  11. Configuring custom postings or doc values formats is an expert feature and
  12. most likely using the builtin formats will suit your needs as is described
  13. in the <<mapping-core-types,mapping section>>.
  14. **********************************
  15. Only the default codec, postings format and doc values format are supported:
  16. other formats may break backward compatibility between minor versions of
  17. Elasticsearch, requiring data to be reindexed.
  18. **********************************
  19. [float]
  20. [[custom-postings]]
  21. === Configuring a custom postings format
  22. A custom postings format can be defined in the index settings in the
  23. `codec` part. The `codec` part can be configured when creating an index
  24. or updating index settings. An example on how to define your custom
  25. postings format:
  26. [source,js]
  27. --------------------------------------------------
  28. curl -XPUT 'http://localhost:9200/twitter/' -d '{
  29. "settings" : {
  30. "index" : {
  31. "codec" : {
  32. "postings_format" : {
  33. "my_format" : {
  34. "type" : "pulsing",
  35. "freq_cut_off" : "5"
  36. }
  37. }
  38. }
  39. }
  40. }
  41. }'
  42. --------------------------------------------------
  43. Then when defining your mapping you can use the `my_format` name in the
  44. `postings_format` option as the example below illustrates:
  45. [source,js]
  46. --------------------------------------------------
  47. {
  48. "person" : {
  49. "properties" : {
  50. "second_person_id" : {"type" : "string", "postings_format" : "my_format"}
  51. }
  52. }
  53. }
  54. --------------------------------------------------
  55. [float]
  56. === Available postings formats
  57. [float]
  58. [[direct-postings]]
  59. ==== Direct postings format
  60. Wraps the default postings format for on-disk storage, but then at read
  61. time loads and stores all terms & postings directly in RAM. This
  62. postings format makes no effort to compress the terms and posting list
  63. and therefore is memory intensive, but because of this it gives a
  64. substantial increase in search performance. Because this holds all term
  65. bytes as a single byte[], you cannot have more than 2.1GB worth of terms
  66. in a single segment.
  67. This postings format offers the following parameters:
  68. `min_skip_count`::
  69. The minimum number terms with a shared prefix to
  70. allow a skip pointer to be written. The default is *8*.
  71. `low_freq_cutoff`::
  72. Terms with a lower document frequency use a
  73. single array object representation for postings and positions. The
  74. default is *32*.
  75. Type name: `direct`
  76. [float]
  77. [[memory-postings]]
  78. ==== Memory postings format
  79. A postings format that stores terms & postings (docs, positions,
  80. payloads) in RAM, using an FST. This postings format does write to disk,
  81. but loads everything into memory. The memory postings format has the
  82. following options:
  83. `pack_fst`::
  84. A boolean option that defines if the in memory structure
  85. should be packed once its build. Packed will reduce the size for the
  86. data-structure in memory but requires more memory during building.
  87. Default is *false*.
  88. `acceptable_overhead_ratio`::
  89. The compression ratio specified as a
  90. float, that is used to compress internal structures. Example ratios `0`
  91. (Compact, no memory overhead at all, but the returned implementation may
  92. be slow), `0.5` (Fast, at most 50% memory overhead, always select a
  93. reasonably fast implementation), `7` (Fastest, at most 700% memory
  94. overhead, no compression). Default is `0.2`.
  95. Type name: `memory`
  96. [float]
  97. [[bloom-postings]]
  98. ==== Bloom filter posting format
  99. The bloom filter postings format wraps a delegate postings format and on
  100. top of this creates a bloom filter that is written to disk. During
  101. opening this bloom filter is loaded into memory and used to offer
  102. "fast-fail" reads. This postings format is useful for low doc-frequency
  103. fields such as primary keys. The bloom filter postings format has the
  104. following options:
  105. `delegate`::
  106. The name of the configured postings format that the
  107. bloom filter postings format will wrap.
  108. `fpp`::
  109. The desired false positive probability specified as a
  110. floating point number between 0 and 1.0. The `fpp` can be configured for
  111. multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
  112. number docs per index segment is larger than *1m* then use *0.03* as fpp
  113. and if number of docs per segment is larger than *10k* use *0.01* as
  114. fpp. The last fallback value is always *0.03*. This example expression
  115. is also the default.
  116. Type name: `bloom`
  117. [[codec-bloom-load]]
  118. [TIP]
  119. ==================================================
  120. It can sometime make sense to disable bloom filters. For instance, if you are
  121. logging into an index per day, and you have thousands of indices, the bloom
  122. filters can take up a sizable amount of memory. For most queries you are only
  123. interested in recent indices, so you don't mind CRUD operations on older
  124. indices taking slightly longer.
  125. In these cases you can disable loading of the bloom filter on a per-index
  126. basis by updating the index settings:
  127. [source,js]
  128. --------------------------------------------------
  129. PUT /old_index/_settings?index.codec.bloom.load=false
  130. --------------------------------------------------
  131. This setting, which defaults to `true`, can be updated on a live index. Note,
  132. however, that changing the value will cause the index to be reopened, which
  133. will invalidate any existing caches.
  134. ==================================================
  135. [float]
  136. [[pulsing-postings]]
  137. ==== Pulsing postings format
  138. The pulsing implementation in-lines the posting lists for very low
  139. frequent terms in the term dictionary. This is useful to improve lookup
  140. performance for low-frequent terms. This postings format offers the
  141. following parameters:
  142. `min_block_size`::
  143. The minimum block size the default Lucene term
  144. dictionary uses to encode on-disk blocks. Defaults to *25*.
  145. `max_block_size`::
  146. The maximum block size the default Lucene term
  147. dictionary uses to encode on-disk blocks. Defaults to *48*.
  148. `freq_cut_off`::
  149. The document frequency cut off where pulsing
  150. in-lines posting lists into the term dictionary. Terms with a document
  151. frequency less or equal to the cutoff will be in-lined. The default is
  152. *1*.
  153. Type name: `pulsing`
  154. [float]
  155. [[default-postings]]
  156. ==== Default postings format
  157. The default postings format has the following options:
  158. `min_block_size`::
  159. The minimum block size the default Lucene term
  160. dictionary uses to encode on-disk blocks. Defaults to *25*.
  161. `max_block_size`::
  162. The maximum block size the default Lucene term
  163. dictionary uses to encode on-disk blocks. Defaults to *48*.
  164. Type name: `default`
  165. [float]
  166. === Configuring a custom doc values format
  167. Custom doc values format can be defined in the index settings in the
  168. `codec` part. The `codec` part can be configured when creating an index
  169. or updating index settings. An example on how to define your custom
  170. doc values format:
  171. [source,js]
  172. --------------------------------------------------
  173. curl -XPUT 'http://localhost:9200/twitter/' -d '{
  174. "settings" : {
  175. "index" : {
  176. "codec" : {
  177. "doc_values_format" : {
  178. "my_format" : {
  179. "type" : "disk"
  180. }
  181. }
  182. }
  183. }
  184. }
  185. }'
  186. --------------------------------------------------
  187. Then we defining your mapping your can use the `my_format` name in the
  188. `doc_values_format` option as the example below illustrates:
  189. [source,js]
  190. --------------------------------------------------
  191. {
  192. "product" : {
  193. "properties" : {
  194. "price" : {"type" : "integer", "doc_values_format" : "my_format"}
  195. }
  196. }
  197. }
  198. --------------------------------------------------
  199. [float]
  200. === Available doc values formats
  201. [float]
  202. ==== Memory doc values format
  203. A doc values format that stores all values in a FST in RAM. This format does
  204. write to disk but the whole data-structure is loaded into memory when reading
  205. the index. The memory postings format has no options.
  206. Type name: `memory`
  207. [float]
  208. ==== Disk doc values format
  209. A doc values format that stores and reads everything from disk. Although it may
  210. be slightly slower than the default doc values format, this doc values format
  211. will require almost no memory from the JVM. The disk doc values format has no
  212. options.
  213. Type name: `disk`
  214. [float]
  215. ==== Default doc values format
  216. The default doc values format tries to make a good compromise between speed and
  217. memory usage by only loading into memory data-structures that matter for
  218. performance. This makes this doc values format a good fit for most use-cases.
  219. The default doc values format has no options.
  220. Type name: `default`