Browse Source

[DOCS] Document reindex for data streams (#58013)

Changes:

* Adds new 'Reindex with a data stream' section to 'Use a data stream'

* Makes the existing reindex API docs aware of data streams

* Rewrites the reindex glossary definition to include data streams
James Rodewig 5 years ago
parent
commit
d5e6b13151

+ 104 - 0
docs/reference/data-streams/use-a-data-stream.asciidoc

@@ -7,6 +7,7 @@ the following:
 * <<add-documents-to-a-data-stream>>
 * <<search-a-data-stream>>
 * <<manually-roll-over-a-data-stream>>
+* <<reindex-with-a-data-stream>>
 
 ////
 [source,console]
@@ -175,6 +176,109 @@ POST /logs/_rollover/
 // TEST[continued]
 ====
 
+[discrete]
+[[reindex-with-a-data-stream]]
+=== Reindex with a data stream
+
+You can use the <<docs-reindex,reindex API>> to copy documents to a data stream
+from an existing index, index alias, or data stream.
+
+A reindex copies documents from a _source_ to a _destination_. The source and
+destination can be any pre-existing index, index alias, or data stream. However,
+the source and destination must be different. You cannot reindex a data stream
+into itself.
+
+Because data streams are <<data-streams-append-only,append-only>>, a reindex
+request to a data stream destination must have an `op_type` of `create`. This
+means a reindex can only add new documents to a data stream. It cannot update
+existing documents in the data stream destination.
+
+A reindex can be used to:
+
+* Convert an existing index alias and collection of time-based indices into a
+  data stream.
+
+* Apply a new or updated <<create-a-data-stream-template,composable template>>
+  by reindexing an existing data stream into a new one. This applies mapping
+  and setting changes in the template to each document and backing index of the
+  data stream destination.
+
+TIP: If you only want to update the mappings or settings of a data stream's
+write index, we recommend you update the <<create-a-data-stream-template,data
+stream's template>> and perform a <<manually-roll-over-a-data-stream,rollover>>.
+
+.*Example*
+[%collapsible]
+====
+The following reindex request copies documents from the `archive` index alias to
+the existing `logs` data stream. Because the destination is a data stream, the
+the request's `op_type` is `create`.
+
+////
+[source,console]
+----
+PUT /_bulk?refresh=wait_for
+{"create":{"_index" : "archive_1"}}
+{ "@timestamp": "2020-12-08T11:04:05.000Z" }
+{"create":{"_index" : "archive_2"}}
+{ "@timestamp": "2020-12-08T11:06:07.000Z" }
+{"create":{"_index" : "archive_2"}}
+{ "@timestamp": "2020-12-09T11:07:08.000Z" }
+{"create":{"_index" : "archive_2"}}
+{ "@timestamp": "2020-12-09T11:07:08.000Z" }
+
+POST /_aliases
+{
+  "actions" : [
+    { "add" : { "index" : "archive_1", "alias" : "archive" } },
+    { "add" : { "index" : "archive_2", "alias" : "archive", "is_write_index" : true} }
+  ]
+}
+----
+// TEST[continued]
+////
+
+[source,console]
+----
+POST /_reindex
+{
+  "source": {
+    "index": "archive"
+  },
+  "dest": {
+    "index": "logs",
+    "op_type": "create"
+  }
+}
+----
+// TEST[continued]
+====
+
+You can also reindex documents from a data stream to an index, index
+alias, or data stream.
+
+.*Example*
+[%collapsible]
+====
+The following reindex request copies documents from the `logs` data stream
+to the existing `archive` index alias. Because the destination is not a data
+stream, the `op_type` does not need to be specified.
+
+[source,console]
+----
+POST /_reindex
+{
+  "source": {
+    "index": "logs"
+  },
+  "dest": {
+    "index": "archive"
+  }
+}
+----
+// TEST[continued]
+====
+
 ////
 [source,console]
 ----

+ 48 - 35
docs/reference/docs/reindex.asciidoc

@@ -4,15 +4,19 @@
 <titleabbrev>Reindex</titleabbrev>
 ++++
 
-Copies documents from one index to another. 
+Copies documents from a _source_ to a _destination_.
+
+The source and destination can be any pre-existing index, index alias, or
+<<data-streams,data stream>>. However, the source and destination must be 
+different. For example, you cannot reindex a data stream into itself.
 
 [IMPORTANT]
 =================================================
 Reindex requires <<mapping-source-field,`_source`>> to be enabled for
-all documents in the source index.
+all documents in the source.
 
-You must set up the destination index before calling `_reindex`.
-Reindex does not copy the settings from the source index. 
+The destination must exist and should be configured as wanted before calling `_reindex`.
+Reindex does not copy the settings from the source or its associated template. 
 Mappings, shard counts, replicas, and so on must be configured ahead of time.
 =================================================
 
@@ -66,25 +70,30 @@ POST _reindex
 [[docs-reindex-api-desc]]
 ==== {api-description-title}
 
-Extracts the document source from the source index and indexes the documents into the destination index. 
-You can copy all documents to the destination index, or reindex a subset of the documents. 
+Extracts the <<mapping-source-field,document source>> from the reindex request's source and indexes the documents into the destination. 
+You can copy all documents to the destination, or reindex a subset of the documents. 
 
 Just like <<docs-update-by-query,`_update_by_query`>>, `_reindex` gets a
-snapshot of the source index but its target must be a **different** index so
+snapshot of the source but its destination must be **different** so
 version conflicts are unlikely. The `dest` element can be configured like the
 index API to control optimistic concurrency control. Omitting
 `version_type` or setting it to `internal` causes Elasticsearch
-to blindly dump documents into the target, overwriting any that happen to have
+to blindly dump documents into the destination, overwriting any that happen to have
 the same ID.
 
 Setting `version_type` to `external` causes Elasticsearch to preserve the
 `version` from the source, create any documents that are missing, and update
-any documents that have an older version in the destination index than they do
-in the source index.
+any documents that have an older version in the destination than they do
+in the source.
 
 Setting `op_type` to `create` causes `_reindex` to only create missing
-documents in the target index. All existing documents will cause a version
-conflict. 
+documents in the destination. All existing documents will cause a version
+conflict.
+
+IMPORTANT: Because data streams are <<data-streams-append-only,append-only>>,
+any reindex request to a destination data stream must have an `op_type`
+of`create`. A reindex can only add new documents to a destination data stream.
+It cannot update existing documents in a destination data stream.
 
 By default, version conflicts abort the `_reindex` process. 
 To continue reindexing if there are conflicts, set the `"conflicts"` request body parameter to `proceed`. 
@@ -101,13 +110,13 @@ performs some preflight checks, launches the request, and returns a
 When you are done with a task, you should delete the task document so 
 {es} can reclaim the space.
 
-[[docs-reindex-many-indices]]
-===== Reindexing many indices
-If you have many indices to reindex it is generally better to reindex them
-one at a time rather than using a glob pattern to pick up many indices. That
+[[docs-reindex-from-multiple-sources]]
+===== Reindex from multiple sources
+If you have many sources to reindex it is generally better to reindex them
+one at a time rather than using a glob pattern to pick up multiple sources. That
 way you can resume the process if there are any errors by removing the
-partially completed index and starting over at that index. It also makes
-parallelizing the process fairly simple: split the list of indices to reindex
+partially completed source and starting over. It also makes
+parallelizing the process fairly simple: split the list of sources to reindex
 and run each list in parallel.
 
 One-off bash scripts seem to work nicely for this:
@@ -283,10 +292,11 @@ which results in a sensible `total` like this one:
 }
 ----------------------------------------------------------------
 
-Setting `slices` to `auto` will let Elasticsearch choose the number of slices
-to use. This setting will use one slice per shard, up to a certain limit. If
-there are multiple source indices, it will choose the number of slices based
-on the index with the smallest number of shards.
+Setting `slices` to `auto` will let Elasticsearch choose the number of slices to
+use. This setting will use one slice per shard, up to a certain limit. If there
+are multiple sources, it will choose the number of
+slices based on the index or <<data-streams,backing index>> with the smallest
+number of shards.
 
 Adding `slices` to `_reindex` just automates the manual process used in the
 section above, creating sub-requests which means it has some quirks:
@@ -308,7 +318,7 @@ be larger than others. Expect larger slices to have a more even distribution.
 the point above about distribution being uneven and you should conclude that
 using `max_docs` with `slices` might not result in exactly `max_docs` documents
 being reindexed.
-* Each sub-request gets a slightly different snapshot of the source index,
+* Each sub-request gets a slightly different snapshot of the source,
 though these are all taken at approximately the same time.
 
 [[docs-reindex-picking-slices]]
@@ -352,7 +362,7 @@ Sets the routing on the bulk request sent for each match to all text after
 the `=`.
 
 For example, you can use the following request to copy all documents from
-the `source` index with the company name `cat` into the `dest` index with
+the `source` with the company name `cat` into the `dest`  with
 routing set to `cat`.
 
 [source,console]
@@ -442,8 +452,8 @@ Defaults to `abort`.
 
 `source`::
 `index`:::
-(Required, string) The name of the index you are copying _from_. 
-Also accepts a comma-separated list of indices to reindex from multiple sources.  
+(Required, string) The name of the data stream, index, or index alias you are copying _from_. 
+Also accepts a comma-separated list to reindex from multiple sources.  
 
 `max_docs`:::
 (Optional, integer) The maximum number of documents to reindex.
@@ -491,7 +501,7 @@ Defaults to `true`.
 
 `dest`::
 `index`:::
-(Required, string) The name of the index you are copying _to_.
+(Required, string) The name of the data stream, index, or index alias you are copying _to_.
 
 `version_type`:::
 (Optional, enum) The versioning to use for the indexing operation.  
@@ -501,6 +511,9 @@ See <<index-version-types>> for more information.
 `op_type`::: 
 (Optional, enum) Set to create to only index documents that do not already exist (put if absent). 
 Valid values: `index`, `create`. Defaults to `index`.
++
+IMPORTANT: To reindex to a data stream destination, this argument must be
+`create`.
 
 `script`::
 `source`::: 
@@ -629,8 +642,8 @@ POST _reindex
 --------------------------------------------------
 // TEST[setup:twitter]
 
-[[docs-reindex-multiple-indices]]
-===== Reindex from multiple indices
+[[docs-reindex-multiple-sources]]
+===== Reindex from multiple sources
 
 The `index` attribute in `source` can be a list, allowing you to copy from lots 
 of sources in one request. This will copy documents from the
@@ -794,9 +807,9 @@ The previous method can also be used in conjunction with <<docs-reindex-change-n
 to load only the existing data into the new index and rename any fields if needed.
 
 [[docs-reindex-api-subset]]
-===== Extract a random subset of an index
+===== Extract a random subset of the source
 
-`_reindex` can be used to extract a random subset of an index for testing:
+`_reindex` can be used to extract a random subset of the source for testing:
 
 [source,console]
 ----------------------------------------------------------------
@@ -849,18 +862,18 @@ POST _reindex
 // TEST[setup:twitter]
 
 Just as in `_update_by_query`, you can set `ctx.op` to change the
-operation that is executed on the destination index:
+operation that is executed on the destination:
 
 `noop`::
 
 Set `ctx.op = "noop"` if your script decides that the document doesn't have
-to be indexed in the destination index. This no operation will be reported
+to be indexed in the destination. This no operation will be reported
 in the `noop` counter in the <<docs-reindex-api-response-body, response body>>.
 
 `delete`::
 
 Set `ctx.op = "delete"` if your script decides that the document must be
- deleted from the destination index. The deletion will be reported in the
+ deleted from the destination. The deletion will be reported in the
  `deleted` counter in the <<docs-reindex-api-response-body, response body>>.
 
 Setting `ctx.op` to anything else will return an error, as will setting any
@@ -876,7 +889,7 @@ change:
 
 Setting `_version` to `null` or clearing it from the `ctx` map is just like not
 sending the version in an indexing request; it will cause the document to be
-overwritten in the target index regardless of the version on the target or the
+overwritten in the destination regardless of the version on the target or the
 version type you use in the `_reindex` request.
 
 [[reindex-from-remote]]

+ 12 - 3
docs/reference/glossary.asciidoc

@@ -352,11 +352,20 @@ during the following processes:
 --
 
 [[glossary-reindex]] reindex ::
-
++
+--
 // tag::reindex-def[]
-To cycle through some or all documents in one or more indices, re-writing them into the same 
-or new index in a local or remote cluster. This is most commonly done to update mappings, or to upgrade {es} between two incompatible index versions.
+Copies documents from a _source_ to a _destination_. The source and
+destination can be any pre-existing index, index alias, or
+{ref}/data-streams.html[data stream].
+
+You can reindex all documents from a source or select a subset of documents to
+copy. You can also reindex to a destination in a remote cluster.
+
+A reindex is often performed to update mappings, change static index settings,
+or upgrade {es} between incompatible versions.
 // end::reindex-def[]
+--
 
 [[glossary-remote-cluster]] remote cluster ::