codec.asciidoc 5.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172
  1. [[index-modules-codec]]
  2. == Codec module
  3. Codecs define how documents are written to disk and read from disk. The
  4. postings format is the part of the codec that responsible for reading
  5. and writing the term dictionary, postings lists and positions, payloads
  6. and offsets stored in the postings list.
  7. Configuring custom postings formats is an expert feature and most likely
  8. using the builtin postings formats will suite your needs as is described
  9. in the <<mapping-core-types,mapping section>>
  10. [float]
  11. [[custom-postings]]
  12. === Configuring a custom postings format
  13. Custom postings format can be defined in the index settings in the
  14. `codec` part. The `codec` part can be configured when creating an index
  15. or updating index settings. An example on how to define your custom
  16. postings format:
  17. [source,js]
  18. --------------------------------------------------
  19. curl -XPUT 'http://localhost:9200/twitter/' -d '{
  20. "settings" : {
  21. "index" : {
  22. "codec" : {
  23. "postings_format" : {
  24. "my_format" : {
  25. "type" : "pulsing",
  26. "freq_cut_off" : "5"
  27. }
  28. }
  29. }
  30. }
  31. }
  32. }'
  33. --------------------------------------------------
  34. Then we defining your mapping your can use the `my_format` name in the
  35. `postings_format` option as the example below illustrates:
  36. [source,js]
  37. --------------------------------------------------
  38. {
  39. "person" : {
  40. "properties" : {
  41. "second_person_id" : {"type" : "string", "postings_format" : "my_format"}
  42. }
  43. }
  44. }
  45. --------------------------------------------------
  46. [float]
  47. === Available postings formats
  48. [float]
  49. [[direct-postings]]
  50. ==== Direct postings format
  51. Wraps the default postings format for on-disk storage, but then at read
  52. time loads and stores all terms & postings directly in RAM. This
  53. postings format makes no effort to compress the terms and posting list
  54. and therefore is memory intensive, but because of this it gives a
  55. substantial increase in search performance. Because this holds all term
  56. bytes as a single byte[], you cannot have more than 2.1GB worth of terms
  57. in a single segment.
  58. This postings format offers the following parameters:
  59. `min_skip_count`::
  60. The minimum number terms with a shared prefix to
  61. allow a skip pointer to be written. The default is *8*.
  62. `low_freq_cutoff`::
  63. Terms with a lower document frequency use a
  64. single array object representation for postings and positions. The
  65. default is *32*.
  66. Type name: `direct`
  67. [float]
  68. [[memory-postings]]
  69. ==== Memory postings format
  70. A postings format that stores terms & postings (docs, positions,
  71. payloads) in RAM, using an FST. This postings format does write to disk,
  72. but loads everything into memory. The memory postings format has the
  73. following options:
  74. `pack_fst`::
  75. A boolean option that defines if the in memory structure
  76. should be packed once its build. Packed will reduce the size for the
  77. data-structure in memory but requires more memory during building.
  78. Default is *false*.
  79. `acceptable_overhead_ratio`::
  80. The compression ratio specified as a
  81. float, that is used to compress internal structures. Example ratios `0`
  82. (Compact, no memory overhead at all, but the returned implementation may
  83. be slow), `0.5` (Fast, at most 50% memory overhead, always select a
  84. reasonably fast implementation), `7` (Fastest, at most 700% memory
  85. overhead, no compression). Default is `0.2`.
  86. Type name: `memory`
  87. [float]
  88. [[bloom-postings]]
  89. ==== Bloom filter posting format
  90. The bloom filter postings format wraps a delegate postings format and on
  91. top of this creates a bloom filter that is written to disk. During
  92. opening this bloom filter is loaded into memory and used to offer
  93. "fast-fail" reads. This postings format is useful for low doc-frequency
  94. fields such as primary keys. The bloom filter postings format has the
  95. following options:
  96. `delegate`::
  97. The name of the configured postings format that the
  98. bloom filter postings format will wrap.
  99. `fpp`::
  100. The desired false positive probability specified as a
  101. floating point number between 0 and 1.0. The `fpp` can be configured for
  102. multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
  103. number docs per index segment is larger than *1m* then use *0.03* as fpp
  104. and if number of docs per segment is larger than *10k* use *0.01* as
  105. fpp. The last fallback value is always *0.03*. This example expression
  106. is also the default.
  107. Type name: `bloom`
  108. [float]
  109. [[pulsing-postings]]
  110. ==== Pulsing postings format
  111. The pulsing implementation in-lines the posting lists for very low
  112. frequent terms in the term dictionary. This is useful to improve lookup
  113. performance for low-frequent terms. This postings format offers the
  114. following parameters:
  115. `min_block_size`::
  116. The minimum block size the default Lucene term
  117. dictionary uses to encode on-disk blocks. Defaults to *25*.
  118. `max_block_size`::
  119. The maximum block size the default Lucene term
  120. dictionary uses to encode on-disk blocks. Defaults to *48*.
  121. `freq_cut_off`::
  122. The document frequency cut off where pulsing
  123. in-lines posting lists into the term dictionary. Terms with a document
  124. frequency less or equal to the cutoff will be in-lined. The default is
  125. *1*.
  126. Type name: `pulsing`
  127. [float]
  128. [[default-postings]]
  129. ==== Default postings format
  130. The default postings format has the following options:
  131. `min_block_size`::
  132. The minimum block size the default Lucene term
  133. dictionary uses to encode on-disk blocks. Defaults to *25*.
  134. `max_block_size`::
  135. The maximum block size the default Lucene term
  136. dictionary uses to encode on-disk blocks. Defaults to *48*.
  137. Type name: `default`