123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166 |
- [[index-modules-codec]]
- == Codec module
- Codecs define how documents are written to disk and read from disk. The
- postings format is the part of the codec that responsible for reading
- and writing the term dictionary, postings lists and positions, payloads
- and offsets stored in the postings list.
- Configuring custom postings formats is an expert feature and most likely
- using the builtin postings formats will suite your needs as is described
- in the <<mapping-core-types,mapping section>>
- [float]
- === Configuring a custom postings format
- Custom postings format can be defined in the index settings in the
- `codec` part. The `codec` part can be configured when creating an index
- or updating index settings. An example on how to define your custom
- postings format:
- [source,js]
- --------------------------------------------------
- curl -XPUT 'http://localhost:9200/twitter/' -d '{
- "settings" : {
- "index" : {
- "codec" : {
- "postings_format" : {
- "my_format" : {
- "type" : "pulsing",
- "freq_cut_off" : "5"
- }
- }
- }
- }
- }
- }'
- --------------------------------------------------
- Then we defining your mapping your can use the `my_format` name in the
- `postings_format` option as the example below illustrates:
- [source,js]
- --------------------------------------------------
- {
- "person" : {
- "properties" : {
- "second_person_id" : {"type" : "string", "postings_format" : "my_format"}
- }
- }
- }
- --------------------------------------------------
- [float]
- === Available postings formats
- [float]
- ==== Direct postings format
- Wraps the default postings format for on-disk storage, but then at read
- time loads and stores all terms & postings directly in RAM. This
- postings format makes no effort to compress the terms and posting list
- and therefore is memory intensive, but because of this it gives a
- substantial increase in search performance. Because this holds all term
- bytes as a single byte[], you cannot have more than 2.1GB worth of terms
- in a single segment.
- This postings format offers the following parameters:
- `min_skip_count`::
- The minimum number terms with a shared prefix to
- allow a skip pointer to be written. The default is *8*.
- `low_freq_cutoff`::
- Terms with a lower document frequency use a
- single array object representation for postings and positions. The
- default is *32*.
- Type name: `direct`
- [float]
- ==== Memory postings format
- A postings format that stores terms & postings (docs, positions,
- payloads) in RAM, using an FST. This postings format does write to disk,
- but loads everything into memory. The memory postings format has the
- following options:
- `pack_fst`::
- A boolean option that defines if the in memory structure
- should be packed once its build. Packed will reduce the size for the
- data-structure in memory but requires more memory during building.
- Default is *false*.
- `acceptable_overhead_ratio`::
- The compression ratio specified as a
- float, that is used to compress internal structures. Example ratios `0`
- (Compact, no memory overhead at all, but the returned implementation may
- be slow), `0.5` (Fast, at most 50% memory overhead, always select a
- reasonably fast implementation), `7` (Fastest, at most 700% memory
- overhead, no compression). Default is `0.2`.
- Type name: `memory`
- [float]
- ==== Bloom filter posting format
- The bloom filter postings format wraps a delegate postings format and on
- top of this creates a bloom filter that is written to disk. During
- opening this bloom filter is loaded into memory and used to offer
- "fast-fail" reads. This postings format is useful for low doc-frequency
- fields such as primary keys. The bloom filter postings format has the
- following options:
- `delegate`::
- The name of the configured postings format that the
- bloom filter postings format will wrap.
- `fpp`::
- The desired false positive probability specified as a
- floating point number between 0 and 1.0. The `fpp` can be configured for
- multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
- number docs per index segment is larger than *1m* then use *0.03* as fpp
- and if number of docs per segment is larger than *10k* use *0.01* as
- fpp. The last fallback value is always *0.03*. This example expression
- is also the default.
- Type name: `bloom`
- [float]
- ==== Pulsing postings format
- The pulsing implementation in-lines the posting lists for very low
- frequent terms in the term dictionary. This is useful to improve lookup
- performance for low-frequent terms. This postings format offers the
- following parameters:
- `min_block_size`::
- The minimum block size the default Lucene term
- dictionary uses to encode on-disk blocks. Defaults to *25*.
- `max_block_size`::
- The maximum block size the default Lucene term
- dictionary uses to encode on-disk blocks. Defaults to *48*.
- `freq_cut_off`::
- The document frequency cut off where pulsing
- in-lines posting lists into the term dictionary. Terms with a document
- frequency less or equal to the cutoff will be in-lined. The default is
- *1*.
- Type name: `pulsing`
- [float]
- ==== Default postings format
- The default postings format has the following options:
- `min_block_size`::
- The minimum block size the default Lucene term
- dictionary uses to encode on-disk blocks. Defaults to *25*.
- `max_block_size`::
- The maximum block size the default Lucene term
- dictionary uses to encode on-disk blocks. Defaults to *48*.
- Type name: `default`
|