123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205 |
- [[indices-split-index]]
- === Split Index
- The split index API allows you to split an existing index into a new index,
- where each original primary shard is split into two or more primary shards in
- the new index.
- The number of times the index can be split (and the number of shards that each
- original shard can be split into) is determined by the
- `index.number_of_routing_shards` setting. The number of routing shards
- specifies the hashing space that is used internally to distribute documents
- across shards with consistent hashing. For instance, a 5 shard index with
- `number_of_routing_shards` set to `30` (`5 x 2 x 3`) could be split by a
- factor of `2` or `3`. In other words, it could be split as follows:
- * `5` -> `10` -> `30` (split by 2, then by 3)
- * `5` -> `15` -> `30` (split by 3, then by 2)
- * `5` -> `30` (split by 6)
- While you can set the `index.number_of_routing_shards` setting explicitly at
- index creation time, the default value depends upon the number of primary
- shards in the original index. The default is designed to allow you to split
- by factors of 2 up to a maximum of 1024 shards. However, the original number
- of primary shards must taken into account. For instance, an index created
- with 5 primary shards could be split into 10, 20, 40, 80, 160, 320, or a
- maximum of 640 shards (with a single split action or multiple split actions).
- If the original index contains one primary shard (or a multi-shard index has
- been <<indices-shrink-index,shrunk>> down to a single primary shard), then the
- index may by split into an arbitrary number of shards greater than 1. The
- properties of the default number of routing shards will then apply to the
- newly split index.
- [float]
- ==== How does splitting work?
- Splitting works as follows:
- * First, it creates a new target index with the same definition as the source
- index, but with a larger number of primary shards.
- * Then it hard-links segments from the source index into the target index. (If
- the file system doesn't support hard-linking, then all segments are copied
- into the new index, which is a much more time consuming process.)
- * Once the low level files are created all documents will be `hashed` again to delete
- documents that belong to a different shard.
- * Finally, it recovers the target index as though it were a closed index which
- had just been re-opened.
- [float]
- [[incremental-resharding]]
- ==== Why doesn't Elasticsearch support incremental resharding?
- Going from `N` shards to `N+1` shards, aka. incremental resharding, is indeed a
- feature that is supported by many key-value stores. Adding a new shard and
- pushing new data to this new shard only is not an option: this would likely be
- an indexing bottleneck, and figuring out which shard a document belongs to
- given its `_id`, which is necessary for get, delete and update requests, would
- become quite complex. This means that we need to rebalance existing data using
- a different hashing scheme.
- The most common way that key-value stores do this efficiently is by using
- consistent hashing. Consistent hashing only requires `1/N`-th of the keys to
- be relocated when growing the number of shards from `N` to `N+1`. However
- Elasticsearch's unit of storage, shards, are Lucene indices. Because of their
- search-oriented data structure, taking a significant portion of a Lucene index,
- be it only 5% of documents, deleting them and indexing them on another shard
- typically comes with a much higher cost than with a key-value store. This cost
- is kept reasonable when growing the number of shards by a multiplicative factor
- as described in the above section: this allows Elasticsearch to perform the
- split locally, which in-turn allows to perform the split at the index level
- rather than reindexing documents that need to move, as well as using hard links
- for efficient file copying.
- In the case of append-only data, it is possible to get more flexibility by
- creating a new index and pushing new data to it, while adding an alias that
- covers both the old and the new index for read operations. Assuming that the
- old and new indices have respectively +M+ and +N+ shards, this has no overhead
- compared to searching an index that would have +M+N+ shards.
- [float]
- ==== Preparing an index for splitting
- Create a new index:
- [source,js]
- --------------------------------------------------
- PUT my_source_index
- {
- "settings": {
- "index.number_of_shards" : 1
- }
- }
- --------------------------------------------------
- // CONSOLE
- In order to split an index, the index must be marked as read-only,
- and have <<cluster-health,health>> `green`.
- This can be achieved with the following request:
- [source,js]
- --------------------------------------------------
- PUT /my_source_index/_settings
- {
- "settings": {
- "index.blocks.write": true <1>
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[continued]
- <1> Prevents write operations to this index while still allowing metadata
- changes like deleting the index.
- [float]
- ==== Splitting an index
- To split `my_source_index` into a new index called `my_target_index`, issue
- the following request:
- [source,js]
- --------------------------------------------------
- POST my_source_index/_split/my_target_index
- {
- "settings": {
- "index.number_of_shards": 2
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[continued]
- The above request returns immediately once the target index has been added to
- the cluster state -- it doesn't wait for the split operation to start.
- [IMPORTANT]
- =====================================
- Indices can only be split if they satisfy the following requirements:
- * the target index must not exist
- * The source index must have fewer primary shards than the target index.
- * The number of primary shards in the target index must be a factor of the
- number of primary shards in the source index.
- * The node handling the split process must have sufficient free disk space to
- accommodate a second copy of the existing index.
- =====================================
- The `_split` API is similar to the <<indices-create-index, `create index` API>>
- and accepts `settings` and `aliases` parameters for the target index:
- [source,js]
- --------------------------------------------------
- POST my_source_index/_split/my_target_index
- {
- "settings": {
- "index.number_of_shards": 5 <1>
- },
- "aliases": {
- "my_search_indices": {}
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[s/^/PUT my_source_index\n{"settings": {"index.blocks.write": true, "index.number_of_shards": "1"}}\n/]
- <1> The number of shards in the target index. This must be a factor of the
- number of shards in the source index.
- NOTE: Mappings may not be specified in the `_split` request.
- [float]
- ==== Monitoring the split process
- The split process can be monitored with the <<cat-recovery,`_cat recovery`
- API>>, or the <<cluster-health, `cluster health` API>> can be used to wait
- until all primary shards have been allocated by setting the `wait_for_status`
- parameter to `yellow`.
- The `_split` API returns as soon as the target index has been added to the
- cluster state, before any shards have been allocated. At this point, all
- shards are in the state `unassigned`. If, for any reason, the target index
- can't be allocated, its primary shard will remain `unassigned` until it
- can be allocated on that node.
- Once the primary shard is allocated, it moves to state `initializing`, and the
- split process begins. When the split operation completes, the shard will
- become `active`. At that point, Elasticsearch will try to allocate any
- replicas and may decide to relocate the primary shard to another node.
- [float]
- ==== Wait For Active Shards
- Because the split operation creates a new index to split the shards to,
- the <<create-index-wait-for-active-shards,wait for active shards>> setting
- on index creation applies to the split index action as well.
|