|
@@ -1,277 +0,0 @@
|
|
|
-[[index-modules-codec]]
|
|
|
-== Codec module
|
|
|
-
|
|
|
-Codecs define how documents are written to disk and read from disk. The
|
|
|
-postings format is the part of the codec that is responsible for reading
|
|
|
-and writing the term dictionary, postings lists and positions, as well as the payloads
|
|
|
-and offsets stored in the postings list. The doc values format is
|
|
|
-responsible for reading column-stride storage for a field and is typically
|
|
|
-used for sorting or faceting. When a field doesn't have doc values enabled,
|
|
|
-it is still possible to sort or facet by loading field values from the
|
|
|
-inverted index into main memory.
|
|
|
-
|
|
|
-Configuring custom postings or doc values formats is an expert feature and
|
|
|
-most likely using the builtin formats will suit your needs as is described
|
|
|
-in the <<mapping-core-types,mapping section>>.
|
|
|
-
|
|
|
-[WARNING]
|
|
|
-Only the default codec, postings format and doc values format are supported:
|
|
|
-other formats may break backward compatibility between minor versions of
|
|
|
-Elasticsearch, requiring data to be reindexed.
|
|
|
-
|
|
|
-
|
|
|
-[float]
|
|
|
-[[custom-postings]]
|
|
|
-=== Configuring a custom postings format
|
|
|
-
|
|
|
-A custom postings format can be defined in the index settings in the
|
|
|
-`codec` part. The `codec` part can be configured when creating an index
|
|
|
-or updating index settings. An example on how to define your custom
|
|
|
-postings format:
|
|
|
-
|
|
|
-[source,js]
|
|
|
---------------------------------------------------
|
|
|
-curl -XPUT 'http://localhost:9200/twitter/' -d '{
|
|
|
- "settings" : {
|
|
|
- "index" : {
|
|
|
- "codec" : {
|
|
|
- "postings_format" : {
|
|
|
- "my_format" : {
|
|
|
- "type" : "pulsing",
|
|
|
- "freq_cut_off" : "5"
|
|
|
- }
|
|
|
- }
|
|
|
- }
|
|
|
- }
|
|
|
- }
|
|
|
-}'
|
|
|
---------------------------------------------------
|
|
|
-
|
|
|
-Then when defining your mapping you can use the `my_format` name in the
|
|
|
-`postings_format` option as the example below illustrates:
|
|
|
-
|
|
|
-[source,js]
|
|
|
---------------------------------------------------
|
|
|
-{
|
|
|
- "person" : {
|
|
|
- "properties" : {
|
|
|
- "second_person_id" : {"type" : "string", "postings_format" : "my_format"}
|
|
|
- }
|
|
|
- }
|
|
|
-}
|
|
|
---------------------------------------------------
|
|
|
-
|
|
|
-[float]
|
|
|
-=== Available postings formats
|
|
|
-
|
|
|
-[float]
|
|
|
-[[direct-postings]]
|
|
|
-==== Direct postings format
|
|
|
-
|
|
|
-Wraps the default postings format for on-disk storage, but then at read
|
|
|
-time loads and stores all terms & postings directly in RAM. This
|
|
|
-postings format makes no effort to compress the terms and posting list
|
|
|
-and therefore is memory intensive, but because of this it gives a
|
|
|
-substantial increase in search performance. Because this holds all term
|
|
|
-bytes as a single byte[], you cannot have more than 2.1GB worth of terms
|
|
|
-in a single segment.
|
|
|
-
|
|
|
-This postings format offers the following parameters:
|
|
|
-
|
|
|
-`min_skip_count`::
|
|
|
- The minimum number terms with a shared prefix to
|
|
|
- allow a skip pointer to be written. The default is *8*.
|
|
|
-
|
|
|
-`low_freq_cutoff`::
|
|
|
- Terms with a lower document frequency use a
|
|
|
- single array object representation for postings and positions. The
|
|
|
- default is *32*.
|
|
|
-
|
|
|
-Type name: `direct`
|
|
|
-
|
|
|
-[float]
|
|
|
-[[memory-postings]]
|
|
|
-==== Memory postings format
|
|
|
-
|
|
|
-A postings format that stores terms & postings (docs, positions,
|
|
|
-payloads) in RAM, using an FST. This postings format does write to disk,
|
|
|
-but loads everything into memory. The memory postings format has the
|
|
|
-following options:
|
|
|
-
|
|
|
-`pack_fst`::
|
|
|
- A boolean option that defines if the in memory structure
|
|
|
- should be packed once its build. Packed will reduce the size for the
|
|
|
- data-structure in memory but requires more memory during building.
|
|
|
- Default is *false*.
|
|
|
-
|
|
|
-`acceptable_overhead_ratio`::
|
|
|
- The compression ratio specified as a
|
|
|
- float, that is used to compress internal structures. Example ratios `0`
|
|
|
- (Compact, no memory overhead at all, but the returned implementation may
|
|
|
- be slow), `0.5` (Fast, at most 50% memory overhead, always select a
|
|
|
- reasonably fast implementation), `7` (Fastest, at most 700% memory
|
|
|
- overhead, no compression). Default is `0.2`.
|
|
|
-
|
|
|
-Type name: `memory`
|
|
|
-
|
|
|
-[float]
|
|
|
-[[bloom-postings]]
|
|
|
-==== Bloom filter posting format
|
|
|
-
|
|
|
-The bloom filter postings format wraps a delegate postings format and on
|
|
|
-top of this creates a bloom filter that is written to disk. During
|
|
|
-opening this bloom filter is loaded into memory and used to offer
|
|
|
-"fast-fail" reads. This postings format is useful for low doc-frequency
|
|
|
-fields such as primary keys. The bloom filter postings format has the
|
|
|
-following options:
|
|
|
-
|
|
|
-`delegate`::
|
|
|
- The name of the configured postings format that the
|
|
|
- bloom filter postings format will wrap.
|
|
|
-
|
|
|
-`fpp`::
|
|
|
- The desired false positive probability specified as a
|
|
|
- floating point number between 0 and 1.0. The `fpp` can be configured for
|
|
|
- multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
|
|
|
- number docs per index segment is larger than *1m* then use *0.03* as fpp
|
|
|
- and if number of docs per segment is larger than *10k* use *0.01* as
|
|
|
- fpp. The last fallback value is always *0.03*. This example expression
|
|
|
- is also the default.
|
|
|
-
|
|
|
-Type name: `bloom`
|
|
|
-
|
|
|
-[[codec-bloom-load]]
|
|
|
-[TIP]
|
|
|
-==================================================
|
|
|
-
|
|
|
-As of 1.4, the bloom filters are no longer loaded at search time by
|
|
|
-default: they consume RAM in proportion to the number of unique terms,
|
|
|
-which can quickly add up for certain use cases, and separate
|
|
|
-performance improvements have made the performance gains with bloom
|
|
|
-filters very small.
|
|
|
-
|
|
|
-You can enable loading of the bloom filter at search time on a
|
|
|
-per-index basis by updating the index settings:
|
|
|
-
|
|
|
-[source,js]
|
|
|
---------------------------------------------------
|
|
|
-PUT /old_index/_settings?index.codec.bloom.load=true
|
|
|
---------------------------------------------------
|
|
|
-
|
|
|
-This setting, which defaults to `false`, can be updated on a live index. Note,
|
|
|
-however, that changing the value will cause the index to be reopened, which
|
|
|
-will invalidate any existing caches.
|
|
|
-
|
|
|
-==================================================
|
|
|
-
|
|
|
-[float]
|
|
|
-[[pulsing-postings]]
|
|
|
-==== Pulsing postings format
|
|
|
-
|
|
|
-The pulsing implementation in-lines the posting lists for very low
|
|
|
-frequent terms in the term dictionary. This is useful to improve lookup
|
|
|
-performance for low-frequent terms. This postings format offers the
|
|
|
-following parameters:
|
|
|
-
|
|
|
-`min_block_size`::
|
|
|
- The minimum block size the default Lucene term
|
|
|
- dictionary uses to encode on-disk blocks. Defaults to *25*.
|
|
|
-
|
|
|
-`max_block_size`::
|
|
|
- The maximum block size the default Lucene term
|
|
|
- dictionary uses to encode on-disk blocks. Defaults to *48*.
|
|
|
-
|
|
|
-`freq_cut_off`::
|
|
|
- The document frequency cut off where pulsing
|
|
|
- in-lines posting lists into the term dictionary. Terms with a document
|
|
|
- frequency less or equal to the cutoff will be in-lined. The default is
|
|
|
- *1*.
|
|
|
-
|
|
|
-Type name: `pulsing`
|
|
|
-
|
|
|
-[float]
|
|
|
-[[default-postings]]
|
|
|
-==== Default postings format
|
|
|
-
|
|
|
-The default postings format has the following options:
|
|
|
-
|
|
|
-`min_block_size`::
|
|
|
- The minimum block size the default Lucene term
|
|
|
- dictionary uses to encode on-disk blocks. Defaults to *25*.
|
|
|
-
|
|
|
-`max_block_size`::
|
|
|
- The maximum block size the default Lucene term
|
|
|
- dictionary uses to encode on-disk blocks. Defaults to *48*.
|
|
|
-
|
|
|
-Type name: `default`
|
|
|
-
|
|
|
-[float]
|
|
|
-=== Configuring a custom doc values format
|
|
|
-
|
|
|
-Custom doc values format can be defined in the index settings in the
|
|
|
-`codec` part. The `codec` part can be configured when creating an index
|
|
|
-or updating index settings. An example on how to define your custom
|
|
|
-doc values format:
|
|
|
-
|
|
|
-[source,js]
|
|
|
---------------------------------------------------
|
|
|
-curl -XPUT 'http://localhost:9200/twitter/' -d '{
|
|
|
- "settings" : {
|
|
|
- "index" : {
|
|
|
- "codec" : {
|
|
|
- "doc_values_format" : {
|
|
|
- "my_format" : {
|
|
|
- "type" : "disk"
|
|
|
- }
|
|
|
- }
|
|
|
- }
|
|
|
- }
|
|
|
- }
|
|
|
-}'
|
|
|
---------------------------------------------------
|
|
|
-
|
|
|
-Then we defining your mapping your can use the `my_format` name in the
|
|
|
-`doc_values_format` option as the example below illustrates:
|
|
|
-
|
|
|
-[source,js]
|
|
|
---------------------------------------------------
|
|
|
-{
|
|
|
- "product" : {
|
|
|
- "properties" : {
|
|
|
- "price" : {"type" : "integer", "doc_values_format" : "my_format"}
|
|
|
- }
|
|
|
- }
|
|
|
-}
|
|
|
---------------------------------------------------
|
|
|
-
|
|
|
-[float]
|
|
|
-=== Available doc values formats
|
|
|
-
|
|
|
-[float]
|
|
|
-==== Memory doc values format
|
|
|
-
|
|
|
-A doc values format that stores all values in a FST in RAM. This format does
|
|
|
-write to disk but the whole data-structure is loaded into memory when reading
|
|
|
-the index. The memory postings format has no options.
|
|
|
-
|
|
|
-Type name: `memory`
|
|
|
-
|
|
|
-[float]
|
|
|
-==== Disk doc values format
|
|
|
-
|
|
|
-A doc values format that stores and reads everything from disk. This is
|
|
|
-generally not a good idea to use it as it saves very little memory compared
|
|
|
-to the default doc values format although it can be significantly slower.
|
|
|
-The disk doc values format has no options.
|
|
|
-
|
|
|
-Type name: `disk`
|
|
|
-
|
|
|
-[float]
|
|
|
-==== Default doc values format
|
|
|
-
|
|
|
-The default doc values format tries to make a good compromise between speed and
|
|
|
-memory usage by only loading into memory data-structures that matter for
|
|
|
-performance. This makes this doc values format a good fit for most use-cases.
|
|
|
-The default doc values format has no options.
|
|
|
-
|
|
|
-Type name: `default`
|