123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325 |
- [[synthetic-source]]
- ==== Synthetic `_source`
- NOTE: This feature requires a https://www.elastic.co/subscriptions[subscription].
- Though very handy to have around, the source field takes up a significant amount
- of space on disk. Instead of storing source documents on disk exactly as you
- send them, Elasticsearch can reconstruct source content on the fly upon retrieval.
- To enable this feature, use the value `synthetic` for the index setting `index.mapping.source.mode`:
- [source,console,id=enable-synthetic-source-example]
- ----
- PUT idx
- {
- "settings": {
- "index": {
- "mapping": {
- "source": {
- "mode": "synthetic"
- }
- }
- }
- }
- }
- ----
- // TESTSETUP
- While this on-the-fly reconstruction is _generally_ slower than saving the source
- documents verbatim and loading them at query time, it saves a lot of storage
- space. Additional latency can be avoided by not loading `_source` field in queries when it is not needed.
- [[synthetic-source-fields]]
- ===== Supported fields
- Synthetic `_source` is supported by all field types. Depending on implementation details, field types have different
- properties when used with synthetic `_source`.
- <<synthetic-source-fields-native-list, Most field types>> construct synthetic `_source` using existing data, most
- commonly <<doc-values,`doc_values`>> and <<stored-fields, stored fields>>. For these field types, no additional space
- is needed to store the contents of `_source` field. Due to the storage layout of <<doc-values,`doc_values`>>, the
- generated `_source` field undergoes <<synthetic-source-modifications, modifications>> compared to the original document.
- For all other field types, the original value of the field is stored as is, in the same way as the `_source` field in
- non-synthetic mode. In this case there are no modifications and field data in `_source` is the same as in the original
- document. Similarly, malformed values of fields that use <<ignore-malformed,`ignore_malformed`>> or
- <<ignore-above,`ignore_above`>> need to be stored as is. This approach is less storage efficient since data needed for
- `_source` reconstruction is stored in addition to other data required to index the field (like `doc_values`).
- [[synthetic-source-restrictions]]
- ===== Synthetic `_source` restrictions
- Some field types have additional restrictions. These restrictions are documented in the **synthetic `_source`** section
- of the field type's <<mapping-types,documentation>>.
- Synthetic source is not supported in <<snapshots-source-only-repository,source-only>> snapshot repositories. To store indexes that use synthetic `_source`, choose a different repository type.
- [[synthetic-source-modifications]]
- ===== Synthetic `_source` modifications
- When synthetic `_source` is enabled, retrieved documents undergo some
- modifications compared to the original JSON.
- [[synthetic-source-modifications-leaf-arrays]]
- ====== Arrays moved to leaf fields
- Synthetic `_source` arrays are moved to leaves. For example:
- [source,console,id=synthetic-source-leaf-arrays-example]
- ----
- PUT idx/_doc/1
- {
- "foo": [
- {
- "bar": 1
- },
- {
- "bar": 2
- }
- ]
- }
- ----
- // TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
- Will become:
- [source,console-result]
- ----
- {
- "foo": {
- "bar": [1, 2]
- }
- }
- ----
- // TEST[s/^/{"_source":/ s/\n$/}/]
- This can cause some arrays to vanish:
- [source,console,id=synthetic-source-leaf-arrays-example-sneaky]
- ----
- PUT idx/_doc/1
- {
- "foo": [
- {
- "bar": 1
- },
- {
- "baz": 2
- }
- ]
- }
- ----
- // TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
- Will become:
- [source,console-result]
- ----
- {
- "foo": {
- "bar": 1,
- "baz": 2
- }
- }
- ----
- // TEST[s/^/{"_source":/ s/\n$/}/]
- [[synthetic-source-modifications-field-names]]
- ====== Fields named as they are mapped
- Synthetic source names fields as they are named in the mapping. When used
- with <<dynamic,dynamic mapping>>, fields with dots (`.`) in their names are, by
- default, interpreted as multiple objects, while dots in field names are
- preserved within objects that have <<subobjects>> disabled. For example:
- [source,console,id=synthetic-source-objecty-example]
- ----
- PUT idx/_doc/1
- {
- "foo.bar.baz": 1
- }
- ----
- // TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
- Will become:
- [source,console-result]
- ----
- {
- "foo": {
- "bar": {
- "baz": 1
- }
- }
- }
- ----
- // TEST[s/^/{"_source":/ s/\n$/}/]
- This impacts how source contents can be referenced in <<modules-scripting-using,scripts>>. For instance, referencing
- a script in its original source form will return null:
- [source,js]
- ----
- "script": { "source": """ emit(params._source['foo.bar.baz']) """ }
- ----
- // NOTCONSOLE
- Instead, source references need to be in line with the mapping structure:
- [source,js]
- ----
- "script": { "source": """ emit(params._source['foo']['bar']['baz']) """ }
- ----
- // NOTCONSOLE
- or simply
- [source,js]
- ----
- "script": { "source": """ emit(params._source.foo.bar.baz) """ }
- ----
- // NOTCONSOLE
- The following <<modules-scripting-fields, field APIs>> are preferable as, in addition to being agnostic to the
- mapping structure, they make use of docvalues if available and fall back to synthetic source only when needed. This
- reduces source synthesizing, a slow and costly operation.
- [source,js]
- ----
- "script": { "source": """ emit(field('foo.bar.baz').get(null)) """ }
- "script": { "source": """ emit($('foo.bar.baz', null)) """ }
- ----
- // NOTCONSOLE
- [[synthetic-source-modifications-alphabetical]]
- ====== Alphabetical sorting
- Synthetic `_source` fields are sorted alphabetically. The
- https://www.rfc-editor.org/rfc/rfc7159.html[JSON RFC] defines objects as
- "an unordered collection of zero or more name/value pairs" so applications
- shouldn't care but without synthetic `_source` the original ordering is
- preserved and some applications may, counter to the spec, do something with
- that ordering.
- [[synthetic-source-modifications-ranges]]
- ====== Representation of ranges
- Range field values (e.g. `long_range`) are always represented as inclusive on both sides with bounds adjusted
- accordingly. See <<range-synthetic-source-inclusive, examples>>.
- [[synthetic-source-precision-loss-for-point-types]]
- ====== Reduced precision of `geo_point` values
- Values of `geo_point` fields are represented in synthetic `_source` with reduced precision. See
- <<geo-point-synthetic-source, examples>>.
- [[synthetic-source-keep]]
- ====== Minimizing source modifications
- It is possible to avoid synthetic source modifications for a particular object or field, at extra storage cost.
- This is controlled through param `synthetic_source_keep` with the following option:
- - `none`: synthetic source diverges from the original source as described above (default).
- - `arrays`: arrays of the corresponding field or object preserve the original element ordering and duplicate elements.
- The synthetic source fragment for such arrays is not guaranteed to match the original source exactly, e.g. array
- `[1, 2, [5], [[4, [3]]], 5]` may appear as-is or in an equivalent format like `[1, 2, 5, 4, 3, 5]`. The exact format
- may change in the future, in an effort to reduce the storage overhead of this option.
- - `all`: the source for both singleton instances and arrays of the corresponding field or object gets recorded. When
- applied to objects, the source of all sub-objects and sub-fields gets captured. Furthermore, the original source of
- arrays gets captured and appears in synthetic source with no modifications.
- For instance:
- [source,console,id=create-index-with-synthetic-source-keep]
- ----
- PUT idx_keep
- {
- "settings": {
- "index": {
- "mapping": {
- "source": {
- "mode": "synthetic"
- }
- }
- }
- },
- "mappings": {
- "properties": {
- "path": {
- "type": "object",
- "synthetic_source_keep": "all"
- },
- "ids": {
- "type": "integer",
- "synthetic_source_keep": "arrays"
- }
- }
- }
- }
- ----
- // TEST
- [source,console,id=synthetic-source-keep-example]
- ----
- PUT idx_keep/_doc/1
- {
- "path": {
- "to": [
- { "foo": [3, 2, 1] },
- { "foo": [30, 20, 10] }
- ],
- "bar": "baz"
- },
- "ids": [ 200, 100, 300, 100 ]
- }
- ----
- // TEST[s/$/\nGET idx_keep\/_doc\/1?filter_path=_source\n/]
- returns the original source, with no array deduplication and sorting:
- [source,console-result]
- ----
- {
- "path": {
- "to": [
- { "foo": [3, 2, 1] },
- { "foo": [30, 20, 10] }
- ],
- "bar": "baz"
- },
- "ids": [ 200, 100, 300, 100 ]
- }
- ----
- // TEST[s/^/{"_source":/ s/\n$/}/]
- The option for capturing the source of arrays can be applied at index level, by setting
- `index.mapping.synthetic_source_keep` to `arrays`. This applies to all objects and fields in the index, except for
- the ones with explicit overrides of `synthetic_source_keep` set to `none`. In this case, the storage overhead grows
- with the number and sizes of arrays present in source of each document, naturally.
- [[synthetic-source-fields-native-list]]
- ===== Field types that support synthetic source with no storage overhead
- The following field types support synthetic source using data from <<doc-values,`doc_values`>> or
- <<stored-fields, stored fields>>, and require no additional storage space to construct the `_source` field.
- NOTE: If you enable the <<ignore-malformed,`ignore_malformed`>> or <<ignore-above,`ignore_above`>> settings, then
- additional storage is required to store ignored field values for these types.
- ** <<aggregate-metric-double-synthetic-source, `aggregate_metric_double`>>
- ** {plugins}/mapper-annotated-text-usage.html#annotated-text-synthetic-source[`annotated-text`]
- ** <<binary-synthetic-source,`binary`>>
- ** <<boolean-synthetic-source,`boolean`>>
- ** <<numeric-synthetic-source,`byte`>>
- ** <<date-synthetic-source,`date`>>
- ** <<date-nanos-synthetic-source,`date_nanos`>>
- ** <<dense-vector-synthetic-source,`dense_vector`>>
- ** <<numeric-synthetic-source,`double`>>
- ** <<flattened-synthetic-source, `flattened`>>
- ** <<numeric-synthetic-source,`float`>>
- ** <<geo-point-synthetic-source,`geo_point`>>
- ** <<numeric-synthetic-source,`half_float`>>
- ** <<histogram-synthetic-source,`histogram`>>
- ** <<numeric-synthetic-source,`integer`>>
- ** <<ip-synthetic-source,`ip`>>
- ** <<keyword-synthetic-source,`keyword`>>
- ** <<numeric-synthetic-source,`long`>>
- ** <<range-synthetic-source,`range` types>>
- ** <<numeric-synthetic-source,`scaled_float`>>
- ** <<numeric-synthetic-source,`short`>>
- ** <<text-synthetic-source,`text`>>
- ** <<version-synthetic-source,`version`>>
- ** <<wildcard-synthetic-source,`wildcard`>>
|