5 years ago · 43a503f3dc
--- a/docs/reference/data-management.asciidoc
+++ b/docs/reference/data-management.asciidoc
@@ -0,0 +1,33 @@
 
				+[role="xpack"]
			
 
				+[[data-management]]
			
 
				+= Data management
			
 
				+
			
 
				+[partintro]
			
 
				+--
			
 
				+The data you store in {es} generally falls into one of two categories:
			
 
				+
			
 
				+* Content: a collection of items you want to search, such as a catalog of products
			
 
				+* Time series data: a stream of continuously-generated timestamped data, such as log entries
			
 
				+
			
 
				+Content might be frequently updated, 
			
 
				+but the value of the content remains relatively constant over time. 
			
 
				+You want to be able to retrieve items quickly regardless of how old they are. 
			
 
				+
			
 
				+Time series data keeps accumulating over time, so you need strategies for 
			
 
				+balancing the value of the data against the cost of storing it. 
			
 
				+As it ages, it tends to become less important and less-frequently accessed, 
			
 
				+so you can move it to less expensive, less performant hardware. 
			
 
				+For your oldest data, what matters is that you have access to the data. 
			
 
				+It's ok if queries take longer to complete.
			
 
				+
			
 
				+To help you manage your data, {es} enables you to:
			
 
				+
			
 
				+* Define <<data-tiers, multiple tiers>> of data nodes with different performance characteristics.
			
 
				+* Automatically transition indices through the data tiers according to your performance needs and retention policies
			
 
				+with <<index-lifecycle-management, {ilm}>> ({ilm-init}). 
			
 
				+* Leverage <<searchable-snapshots, searchable snapshots>> stored in a remote repository to provide resiliency 
			
 
				+for your older indices while reducing operating costs and maintaining search performance.  
			
 
				+* Perform <<async-search-intro, asynchronous searches>> of data stored on less-performant hardware.
			
 
				+--
			
 
				+
			
 
				+include::datatiers.asciidoc[]
			
--- a/docs/reference/datatiers.asciidoc
+++ b/docs/reference/datatiers.asciidoc
@@ -1,100 +1,112 @@
 
				 [role="xpack"]
			
 
				 [[data-tiers]]
			
 
				-=== Data tiers
			
 
				-
			
 
				-Common data lifecycle management patterns revolve around transitioning indices
			
 
				-through multiple collections of nodes with different hardware characteristics in order
			
 
				-to fulfil evolving CRUD, search, and aggregation needs as indices age. The concept
			
 
				-of a tiered hardware architecture is not new in {es}.
			
 
				-<<index-lifecycle-management, Index Lifecycle Management>> is instrumental in
			
 
				-implementing tiered architectures by automating the managemnt of indices according to
			
 
				-performance, resiliency and data retention requirements.
			
 
				-<<overview-index-lifecycle-management, Hot/warm/cold>> architectures are common
			
 
				-for timeseries data such as logging and metrics.
			
 
				-
			
 
				-A data tier is a collection of nodes with the same role. Data tiers are an integrated
			
 
				-solution offering better support for optimising cost and improving performance.
			
 
				-Formalized data tiers in ES allow configuration of the lifecycle and location of data
			
 
				-in a hot/warm/cold topology without requiring the use of custom node attributes.
			
 
				-Each tier formalises specific characteristics and data behaviours.
			
 
				-
			
 
				-The node roles that can currently define data tiers are:
			
 
				-
			
 
				-* <<data-content-node, data_content>>
			
 
				-* <<data-hot-node, data_hot>>
			
 
				-* <<data-warm-node, data_warm>>
			
 
				-* <<data-cold-node, data_cold>>
			
 
				-
			
 
				-The more generic <<data-node, data role>> is not a data tier role, but
			
 
				-it is the default node role if no roles are configured. If a node has the
			
 
				-<<data-node, data>> role we treat the node as if it has all of the tier
			
 
				-roles assigned.
			
 
				+== Data tiers
			
 
				 
			
 
				-[[content-tier]]
			
 
				-==== Content tier
			
 
				+A _data tier_ is a collection of nodes with the same data role that 
			
 
				+typically share the same hardware profile: 
			
 
				 
			
 
				-The content tier is made of one or more nodes that have the <<data-content-node, data_content>>
			
 
				-role. A content tier is designed to store and search user created content. Non-timeseries data
			
 
				-doesn't necessarily follow the hot-warm-cold path. The hardware profiles are quite different to
			
 
				-the <<hot-tier, hot tier>>. User created content prioritises high CPU to support complex
			
 
				-queries and aggregations in a timely manner, as opposed to the <<hot-tier, hot tier>> which
			
 
				-prioritises high IO.
			
 
				-The content data has very long data retention characteristics and from a resiliency perspective
			
 
				-the indices in this tier should be configured to use one or more replicas.
			
 
				+* <<content-tier, Content tier>> nodes handle the indexing and query load for content such as a product catalog.
			
 
				+* <<hot-tier, Hot tier>> nodes handle the indexing load for time series data such as logs or metrics
			
 
				+and hold your most recent, most-frequently-accessed data. 
			
 
				+* <<warm-tier, Warm tier>> nodes hold time series data that is accessed less-frequently 
			
 
				+and rarely needs to be updated.
			
 
				+* <<cold-tier, Cold tier>> nodes hold time series data that is accessed occasionally and not normally updated.
			
 
				 
			
 
				-NOTE: new indices that are not part of <<data-streams, data streams>> will be automatically allocated to the
			
 
				-<<content-tier>>
			
 
				+When you index documents directly to a specific index, they remain on content tier nodes indefinitely. 
			
 
				 
			
 
				-[[hot-tier]]
			
 
				-==== Hot tier
			
 
				+When you index documents to a data stream, they initially reside on hot tier nodes. 
			
 
				+You can configure <<index-lifecycle-management, {ilm}>> ({ilm-init}) policies
			
 
				+to automatically transition your time series data through the hot, warm, and cold tiers 
			
 
				+according to your performance, resiliency and data retention requirements. 
			
 
				+
			
 
				+A node's <<data-node, data role>> is configured in `elasticsearch.yml`. 
			
 
				+For example, the highest-performance nodes in a cluster might be assigned to both the hot and content tiers:
			
 
				 
			
 
				-The hot tier is made of one or more nodes that have the <<data-hot-node, data_hot>> role.
			
 
				-It is the {es} entry point for timeseries data. This tier needs to be fast both for reads
			
 
				-and writes, requiring more hardware resources such as SSD drives. The hot tier is usually
			
 
				-hosting the data from recent days. From a resiliency perspective the indices in this
			
 
				+[source,yaml]
			
 
				+--------------------------------------------------
			
 
				+node.roles: ["data_hot", "data_content"]
			
 
				+--------------------------------------------------
			
 
				+
			
 
				+[discrete]
			
 
				+[[content-tier]]
			
 
				+=== Content tier
			
 
				+
			
 
				+Data stored in the content tier is generally a collection of items such as a product catalog or article archive.
			
 
				+Unlike time series data, the value of the content remains relatively constant over time,
			
 
				+so it doesn't make sense to move it to a tier with different performance characteristics as it ages. 
			
 
				+Content data typically has long data retention requirements, and you want to be able to retrieve 
			
 
				+items quickly regardless of how old they are. 
			
 
				+
			
 
				+Content tier nodes are usually optimized for query performance--they prioritize processing power over IO throughput
			
 
				+so they can process complex searches and aggregations and return results quickly.
			
 
				+While they are also responsible for indexing, content data is generally not ingested at as high a rate
			
 
				+as time series data such as logs and metrics. From a resiliency perspective the indices in this
			
 
				 tier should be configured to use one or more replicas.
			
 
				 
			
 
				-NOTE: new indices that are part of a <<data-streams, data stream>> will be automatically allocated to the
			
 
				-<<hot-tier>>
			
 
				+New indices are automatically allocated to the <<content-tier>> unless they are part of a data stream.
			
 
				+
			
 
				+[discrete]
			
 
				+[[hot-tier]]
			
 
				+=== Hot tier
			
 
				 
			
 
				+The hot tier is the {es} entry point for time series data and holds your most-recent, 
			
 
				+most-frequently-searched time series data. 
			
 
				+Nodes in the hot tier need to be fast for both reads and writes, 
			
 
				+which requires more hardware resources and faster storage (SSDs). 
			
 
				+For resiliency, indices in the hot tier should be configured to use one or more replicas.
			
 
				+
			
 
				+New indices that are part of a <<data-streams, data stream>> are automatically allocated to the
			
 
				+hot tier.
			
 
				+
			
 
				+[discrete]
			
 
				 [[warm-tier]]
			
 
				-==== Warm tier
			
 
				+=== Warm tier
			
 
				 
			
 
				-The warm tier is made of one or more nodes that have the <<data-warm-node, data_warm>> role.
			
 
				-This tier is where data goes once it is not queried as frequently as in the <<hot-tier, hot tier>>.
			
 
				-It is a medium-fast tier that still allows data updates. The warm tier is usually
			
 
				-hosting the data from recent weeks. From a resiliency perspective the indices in this
			
 
				-tier should be configured to use one or more replicas.
			
 
				+Time series data can move to the warm tier once it is being queried less frequently 
			
 
				+than the recently-indexed data in the hot tier. 
			
 
				+The warm tier typically holds data from recent weeks. 
			
 
				+Updates are still allowed, but likely infrequent.
			
 
				+Nodes in the warm tier generally don't need to be as fast as those in the hot tier. 
			
 
				+For resiliency, indices in the warm tier should be configured to use one or more replicas.
			
 
				 
			
 
				+[discrete]
			
 
				 [[cold-tier]]
			
 
				-==== Cold tier
			
 
				+=== Cold tier
			
 
				 
			
 
				-The cold tier is made of one or more nodes that have the <<data-cold-node, data_cold>> role.
			
 
				-Once the data in the <<warm-tier, warm tier>> is not updated anymore it can transition to the
			
 
				-cold tier. The cold tier is still a responsive query tier but as the data transitions into this
			
 
				-tier it can be compressed, shrunken, or configured to have zero replicas and be backed by
			
 
				-a <<ilm-searchable-snapshot, snapshot>>. The cold tier is usually hosting the data from recent
			
 
				-months or years.
			
 
				+Once data in the warm tier is no longer being updated, it can move to the cold tier. 
			
 
				+The cold tier typically holds the data from recent months or years.
			
 
				+The cold tier is still a responsive query tier, but data in the cold tier is not normally updated.
			
 
				+As data transitions into the cold tier it can be compressed and shrunken.
			
 
				+For resiliency, indices in the cold tier can rely on 
			
 
				+<<ilm-searchable-snapshot, searchable snapshots>>, eliminating the need for replicas. 
			
 
				 
			
 
				 [discrete]
			
 
				 [[data-tier-allocation]]
			
 
				 === Data tier index allocation
			
 
				 
			
 
				-When an index is created {es} will automatically allocate the index to the <<content-tier, Content tier>>
			
 
				-if the index is not part of a <<data-streams, data stream>> or to the <<hot-tier, Hot tier>> if the index
			
 
				-is part of a <<data-streams, data stream>>.
			
 
				-{es} will configure the <<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
			
 
				-to `data_content` or `data_hot` respectively.
			
 
				+When you create an index, by default {es} sets 
			
 
				+<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
			
 
				+to `data_content` to automatically allocate the index shards to the content tier.
			
 
				+
			
 
				+When {es} creates an index as part of a <<data-streams, data stream>>, 
			
 
				+by default {es} sets 
			
 
				+<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
			
 
				+to `data_hot` to automatically allocate the index shards to the hot tier.
			
 
				 
			
 
				-These heuristics can be overridden by specifying any <<shard-allocation-filtering, shard allocation filtering>>
			
 
				+You can override the automatic tier-based allocation by specifying 
			
 
				+<<shard-allocation-filtering, shard allocation filtering>>
			
 
				 settings in the create index request or index template that matches the new index.
			
 
				-Specifying any configuration, including `null`, for `index.routing.allocation.include._tier_preference` will
			
 
				-also opt out of the automatic new index allocation to tiers.
			
 
				+
			
 
				+You can also explicitly set `index.routing.allocation.include._tier_preference`  
			
 
				+to opt out of the default tier-based allocation.  
			
 
				+If you set the tier preference to `null`, {es} ignores the data tier roles during allocation.
			
 
				+
			
 
				 [discrete]
			
 
				 [[data-tier-migration]]
			
 
				-=== Data tier index migration
			
 
				+=== Automatic data tier migration
			
 
				 
			
 
				-<<index-lifecycle-management, Index Lifecycle Management>> automates the transition of managed
			
 
				-indices through the available data tiers using the `migrate` action which is injected
			
 
				-in every phase, unless it's manually specified in the phase or an
			
 
				-<<ilm-allocate-action, allocate action>> modifying the allocation rules is manually configured.
			
 
				+{ilm-init} automatically transitions managed
			
 
				+indices through the available data tiers using the <<ilm-migrate-action, migrate>> action. 
			
 
				+By default, this action is automatically injected in every phase. 
			
 
				+You can explicitly specify the migrate action to override the default behavior, 
			
 
				+or use the <<ilm-allocate-action, allocate action>> to manually specify allocation rules.
			
--- a/docs/reference/index.asciidoc
+++ b/docs/reference/index.asciidoc
@@ -30,8 +30,6 @@ include::indices/index-templates.asciidoc[]
 
				 
			
 
				 include::data-streams/data-streams.asciidoc[]
			
 
				 
			
 
				-include::datatiers.asciidoc[]
			
 
				-
			
 
				 include::ingest.asciidoc[]
			
 
				 
			
 
				 include::search/search-your-data/search-your-data.asciidoc[]
			
@@ -46,6 +44,8 @@ include::sql/index.asciidoc[]
 
				 
			
 
				 include::scripting.asciidoc[]
			
 
				 
			
 
				+include::data-management.asciidoc[]
			
 
				+
			
 
				 include::ilm/index.asciidoc[]
			
 
				 
			
 
				 ifdef::permanently-unreleased-branch[]