Browse Source

Docs for synthetic source (#87416)

This adds some basic docs for synthetic source both to get us started
documenting it and to show how I'd like to get it documented - with a
central section in the docs for `_source` and "satellite" sections in
each of the supported field types that link back to the central section.

[Preview](https://elasticsearch_87416.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/mapping-source-field.html#synthetic-source)
Nik Everett 3 years ago
parent
commit
b18bafb207

+ 11 - 0
docs/reference/mapping/fields/source-field.asciidoc

@@ -6,6 +6,17 @@ at index time. The `_source` field itself is not indexed (and thus is not
 searchable), but it is stored so that it can be returned when executing
 _fetch_ requests, like <<docs-get,get>> or <<search-search,search>>.
 
+ifeval::["{release-state}"=="unreleased"]
+If disk usage is important to you then have a look at
+<<synthetic-source,synthetic `_source`>> which shrinks disk usage at the cost of
+only supporting a subset of mappings and slower fetches or (not recommended)
+<<disable-source-field,disabling the `_source` field>> which also shrinks disk
+usage but disables many features.
+
+include::synthetic-source.asciidoc[]
+endif::[]
+
+
 [[disable-source-field]]
 ==== Disabling the `_source` field
 

+ 120 - 0
docs/reference/mapping/fields/synthetic-source.asciidoc

@@ -0,0 +1,120 @@
+[[synthetic-source]]
+==== Synthetic `_source`
+
+Though very handy to have around, the source field takes up a significant amount
+of space on disk. Instead of storing source documents on disk exactly as you
+send them, Elasticsearch can reconstruct source content on the fly upon retrieval.
+Enable this by setting `synthetic: true` in `_source`:
+
+[source,console,id=enable-synthetic-source-example]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": {
+      "synthetic": true
+    }
+  }
+}
+----
+// TESTSETUP
+
+While this on the fly reconstruction is *generally* slower than saving the source
+documents verbatim and loading them at query time, it saves a lot of storage
+space. There are a couple of restrictions to be aware of:
+
+* When you retrieve synthetic `_source` content it undergoes minor
+<<synthetic-source-modifications,modifications>> compared to the original JSON.
+* Synthetic `_source` can be used with indices that contain only these field
+types:
+
+** <<boolean-synthetic-source,`boolean`>>
+** <<numeric-synthetic-source,`byte`>>
+** <<numeric-synthetic-source,`double`>>
+** <<numeric-synthetic-source,`float`>>
+** <<geo-point-synthetic-source,`geo_point`>>
+** <<numeric-synthetic-source,`half_float`>>
+** <<numeric-synthetic-source,`integer`>>
+** <<ip-synthetic-source,`ip`>>
+** <<keyword-synthetic-source,`keyword`>>
+** <<numeric-synthetic-source,`long`>>
+** <<numeric-synthetic-source,`scaled_float`>>
+** <<numeric-synthetic-source,`short`>>
+** <<text-synthetic-source,`text`>> (with a `keyword` sub-field)
+
+[[synthetic-source-modifications]]
+===== Synthetic source modifications
+
+When synthetic `_source` is enabled, retrieved documents undergo some
+modifications compared to the original JSON.
+
+[[synthetic-source-modifications-leaf-arrays]]
+====== Arrays moved to leaf fields
+Synthetic `_source` arrays are moved to leaves. For example:
+
+[source,console,id=synthetic-source-leaf-arrays-example]
+----
+PUT idx/_doc/1
+{
+  "foo": [
+    {
+      "bar": 1
+    },
+    {
+      "bar": 2
+    }
+  ]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+
+[source,console-result]
+----
+{
+  "foo": {
+    "bar": [1, 2]
+  }
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
+[[synthetic-source-modifications-field-names]]
+====== Fields named as they are mapped
+Synthetic source names fields as they are named in the mapping. When used
+with <<dynamic,dynamic mapping>>, fields with dots (`.`) in their names are, by
+default, interpreted as multiple objects, while dots in field names are
+preserved within objects that have <<subobjects>> disabled. For example:
+
+[source,console,id=synthetic-source-objecty-example]
+----
+PUT idx/_doc/1
+{
+  "foo.bar.baz": 1
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+
+[source,console-result]
+----
+{
+  "foo": {
+    "bar": {
+      "baz": 1
+    }
+  }
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
+[[synthetic-source-modifications-alphabetical]]
+====== Alphabetical sorting
+Synthetic `_source` fields are sorted alphabetically. The
+https://www.rfc-editor.org/rfc/rfc7159.html[JSON RFC] defines objects as
+"an unordered collection of zero or more name/value pairs" so applications
+shouldn't care but without synthetic `_source` the original ordering is
+preserved and some applications may, counter to the spec, do something with
+that ordering.

+ 36 - 0
docs/reference/mapping/types/boolean.asciidoc

@@ -214,3 +214,39 @@ The following parameters are accepted by `boolean` fields:
 <<mapping-field-meta,`meta`>>::
 
     Metadata about the field.
+
+ifeval::["{release-state}"=="unreleased"]
+[[boolean-synthetic-source]]
+==== Synthetic source
+`boolean` fields support <<synthetic-source,synthetic `_source`>> in their
+default configuration. Synthetic `_source` cannot be used together with
+<<copy-to,`copy_to`>> or with <<doc-values,`doc_values`>> disabled.
+
+Synthetic source always sorts `boolean` fields. For example:
+[source,console,id=synthetic-source-boolean-example]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "synthetic": true },
+    "properties": {
+      "bool": { "type": "boolean" }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "bool": [true, false, true, false]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+[source,console-result]
+----
+{
+  "bool": [false, false, true, true]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+endif::[]

+ 44 - 0
docs/reference/mapping/types/geo-point.asciidoc

@@ -203,3 +203,47 @@ For performance reasons, it is better to access the lat/lon values directly:
 def lat      = doc['location'].lat;
 def lon      = doc['location'].lon;
 --------------------------------------------------
+
+ifeval::["{release-state}"=="unreleased"]
+[[geo-point-synthetic-source]]
+==== Synthetic source
+`geo_point` fields support <<synthetic-source,synthetic `_source`>> in their
+default configuration. Synthetic `_source` cannot be used together with 
+<<ignore-malformed,`ignore_malformed`>>, <<copy-to,`copy_to`>>, or with
+<<doc-values,`doc_values`>> disabled.
+
+Synthetic source always sorts `geo_point` fields (first by latitude and then
+longitude) and reduces them to their stored precision. For example:
+[source,console,id=synthetic-source-geo-point-example]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "synthetic": true },
+    "properties": {
+      "point": { "type": "geo_point" }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "point": [
+    {"lat":-90, "lon":-80},
+    {"lat":10, "lon":30}
+  ]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+[source,console-result]
+----
+{
+  "point": [
+    {"lat":-90.0, "lon":-80.00000000931323},
+    {"lat":9.999999990686774, "lon":29.999999972060323}
+   ]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+endif::[]

+ 44 - 0
docs/reference/mapping/types/ip.asciidoc

@@ -156,3 +156,47 @@ GET my-index-000001/_search
   }
 }
 --------------------------------------------------
+
+ifeval::["{release-state}"=="unreleased"]
+[[ip-synthetic-source]]
+==== Synthetic source
+`ip` fields support <<synthetic-source,synthetic `_source`>> in their default
+configuration. Synthetic `_source` cannot be used together with
+<<ignore-malformed,`ignore_malformed`>>, <<copy-to,`copy_to`>>, or with
+<<doc-values,`doc_values`>> disabled.
+
+Synthetic source always sorts `ip` fields and removes duplicates. For example:
+[source,console,id=synthetic-source-ip-example]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "synthetic": true },
+    "properties": {
+      "ip": { "type": "ip" }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "ip": ["192.168.0.1", "192.168.0.1", "10.10.12.123",
+         "2001:db8::1:0:0:1", "::afff:4567:890a"]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+
+[source,console-result]
+----
+{
+  "ip": ["::afff:4567:890a", "10.10.12.123", "192.168.0.1", "2001:db8::1:0:0:1"]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
+NOTE: IPv4 addresses are sorted as though they were IPv6 addresses prefixed by
+      `::ffff:0:0:0/96` as specified by
+      https://datatracker.ietf.org/doc/html/rfc6144[rfc6144].
+
+endif::[]

+ 40 - 0
docs/reference/mapping/types/keyword.asciidoc

@@ -173,6 +173,46 @@ Dimension fields have the following constraints:
 ====
 --
 
+ifeval::["{release-state}"=="unreleased"]
+[[keyword-synthetic-source]]
+==== Synthetic source
+`keyword` fields support <<synthetic-source,synthetic `_source`>> in their
+default configuration. Synthetic `_source` cannot be used together with
+<<ignore-above,`ignore_above`>>, a <<normalizer,`normalizer`>>,
+<<copy-to,`copy_to`>>, or with <<doc-values,`doc_values`>> disabled.
+
+Synthetic source always sorts `keyword` fields and removes duplicates. For
+example:
+[source,console,id=synthetic-source-keyword-example]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "synthetic": true },
+    "properties": {
+      "kwd": { "type": "keyword" }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "kwd": ["foo", "foo", "bar", "baz"]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+
+[source,console-result]
+----
+{
+  "kwd": ["bar", "baz", "foo"]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
+endif::[]
+
 include::constant-keyword.asciidoc[]
 
 include::wildcard.asciidoc[]

+ 67 - 0
docs/reference/mapping/types/numeric.asciidoc

@@ -233,3 +233,70 @@ numeric field can't be both a time series dimension and a time series metric.
     sorting) will behave as if the document had a value of +2.3+. High values
     of `scaling_factor` improve accuracy but also increase space requirements.
     This parameter is required.
+
+ifeval::["{release-state}"=="unreleased"]
+[[numeric-synthetic-source]]
+==== Synthetic source
+All numeric fields except `unsigned_long` support <<synthetic-source,synthetic
+`_source`>> in their default configuration. Synthetic `_source` cannot be used
+together with <<ignore-malformed,`ignore_malformed`>>, <<copy-to,`copy_to`>>, or
+with <<doc-values,`doc_values`>> disabled.
+
+Synthetic source always sorts numeric fields and removes duplicates. For example:
+[source,console,id=synthetic-source-numeric-example]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "synthetic": true },
+    "properties": {
+      "long": { "type": "long" }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "long": [0, 0, -123466, 87612]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+[source,console-result]
+----
+{
+  "long": [-123466, 0, 0, 87612]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
+Scaled floats will always apply their scaling factor so:
+[source,console,id=synthetic-source-scaled-float-example]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "synthetic": true },
+    "properties": {
+      "f": { "type": "scaled_float", "scaling_factor": 0.01 }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "f": 123
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+
+[source,console-result]
+----
+{
+  "f": 100.0
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
+endif::[]

+ 57 - 0
docs/reference/mapping/types/text.asciidoc

@@ -159,6 +159,63 @@ The following parameters are accepted by `text` fields:
 
     Metadata about the field.
 
+ifeval::["{release-state}"=="unreleased"]
+[[text-synthetic-source]]
+==== Synthetic source
+`text` fields support <<synthetic-source,synthetic `_source`>> if they have
+a `keyword` sub-field that supports synthetic `_source` and *do not* have
+<<copy-to,`copy_to`>>.
+
+Synthetic source always sorts `keyword` fields and removes duplicates, so
+`text` fields are sorted based on the sub-`keyword` field. For example:
+[source,console,id=synthetic-source-text-example]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "synthetic": true },
+    "properties": {
+      "text": {
+        "type": "text",
+        "fields": {
+          "raw": {
+            "type": "keyword"
+          }
+        }
+      }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "text": [
+    "the quick brown fox",
+    "the quick brown fox",
+    "jumped over the lazy dog"
+  ]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+[source,console-result]
+----
+{
+  "text": [
+    "jumped over the lazy dog",
+    "the quick brown fox"
+  ]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
+NOTE: Reordering text fields can have an effect on <<query-dsl-match-query-phrase,phrase>>
+      and <<span-queries,span>> queries. See the discussion about
+      <<position-increment-gap,`position_increment_gap`>> for more detail. You
+      can avoid this by making sure the `slop` parameter on the phrase queries
+      is lower than the `position_increment_gap`. This is the default.
+endif::[]
+
 [[fielddata-mapping-param]]
 ==== `fielddata` mapping parameter