Browse Source

Oversharding is also indices and fields (#81511)

Today the _Size your shards_ docs focus on shard size and count, but in
fact index count and field count are also important. This commit expands
these docs a bit to cover this observation too.
David Turner 3 years ago
parent
commit
7d69f1a974
1 changed files with 70 additions and 34 deletions
  1. 70 34
      docs/reference/how-to/size-your-shards.asciidoc

+ 70 - 34
docs/reference/how-to/size-your-shards.asciidoc

@@ -1,24 +1,30 @@
 [[size-your-shards]]
 [[size-your-shards]]
 == Size your shards
 == Size your shards
 
 
-To protect against hardware failure and increase capacity, {es} stores copies of
-an index’s data across multiple shards on multiple nodes. The number and size of
-these shards can have a significant impact on your cluster's health. One common
-problem is _oversharding_, a situation in which a cluster with a large number of
-shards becomes unstable.
+Each index in {es} is divided into one or more shards, each of which may be
+replicated across multiple nodes to protect against hardware failures. If you
+are using <<data-streams>> then each data stream is backed by a sequence of
+indices. There is a limit to the amount of data you can store on a single node
+so you can increase the capacity of your cluster by adding nodes and increasing
+the number of indices and shards to match. However, each index and shard has
+some overhead and if you divide your data across too many shards then the
+overhead can become overwhelming. A cluster with too many indices or shards is
+said to suffer from _oversharding_. An oversharded cluster will be less
+efficient at responding to searches and in extreme cases it may even become
+unstable.
 
 
 [discrete]
 [discrete]
 [[create-a-sharding-strategy]]
 [[create-a-sharding-strategy]]
 === Create a sharding strategy
 === Create a sharding strategy
 
 
-The best way to prevent oversharding and other shard-related issues
-is to create a sharding strategy. A sharding strategy helps you determine and
+The best way to prevent oversharding and other shard-related issues is to
+create a sharding strategy. A sharding strategy helps you determine and
 maintain the optimal number of shards for your cluster while limiting the size
 maintain the optimal number of shards for your cluster while limiting the size
 of those shards.
 of those shards.
 
 
 Unfortunately, there is no one-size-fits-all sharding strategy. A strategy that
 Unfortunately, there is no one-size-fits-all sharding strategy. A strategy that
-works in one environment may not scale in another. A good sharding strategy must
-account for your infrastructure, use case, and performance expectations.
+works in one environment may not scale in another. A good sharding strategy
+must account for your infrastructure, use case, and performance expectations.
 
 
 The best way to create a sharding strategy is to benchmark your production data
 The best way to create a sharding strategy is to benchmark your production data
 on production hardware using the same queries and indexing loads you'd see in
 on production hardware using the same queries and indexing loads you'd see in
@@ -28,9 +34,9 @@ cluster sizing video]. As you test different shard configurations, use {kib}'s
 {kibana-ref}/elasticsearch-metrics.html[{es} monitoring tools] to track your
 {kibana-ref}/elasticsearch-metrics.html[{es} monitoring tools] to track your
 cluster's stability and performance.
 cluster's stability and performance.
 
 
-The following sections provide some reminders and guidelines you should consider
-when designing your sharding strategy. If your cluster is already oversharded,
-see <<reduce-cluster-shard-count>>.
+The following sections provide some reminders and guidelines you should
+consider when designing your sharding strategy. If your cluster is already
+oversharded, see <<reduce-cluster-shard-count>>.
 
 
 [discrete]
 [discrete]
 [[shard-sizing-considerations]]
 [[shard-sizing-considerations]]
@@ -49,17 +55,22 @@ thread pool>>. This can result in low throughput and slow search speeds.
 
 
 [discrete]
 [discrete]
 [[each-shard-has-overhead]]
 [[each-shard-has-overhead]]
-==== Each shard has overhead
+==== Each index and shard has overhead
 
 
-Every shard uses memory and CPU resources. In most cases, a small
-set of large shards uses fewer resources than many small shards.
+Every index and every shard requires some memory and CPU resources. In most
+cases, a small set of large shards uses fewer resources than many small shards.
 
 
 Segments play a big role in a shard's resource usage. Most shards contain
 Segments play a big role in a shard's resource usage. Most shards contain
 several segments, which store its index data. {es} keeps segment metadata in
 several segments, which store its index data. {es} keeps segment metadata in
-JVM heap memory so it can be quickly retrieved for searches. As a
-shard grows, its segments are <<index-modules-merge,merged>> into fewer, larger
-segments. This decreases the number of segments, which means less metadata is
-kept in heap memory.
+JVM heap memory so it can be quickly retrieved for searches. As a shard grows,
+its segments are <<index-modules-merge,merged>> into fewer, larger segments.
+This decreases the number of segments, which means less metadata is kept in
+heap memory.
+
+Every mapped field also carries some overhead in terms of memory usage and disk
+space. By default {es} will automatically create a mapping for every field in
+every document it indexes, but you can switch off this behaviour to
+<<explicit-mapping,take control of your mappings>>.
 
 
 [discrete]
 [discrete]
 [[shard-auto-balance]]
 [[shard-auto-balance]]
@@ -110,7 +121,7 @@ Change the <<index-number-of-shards,`index.number_of_shards`>> setting in the
 data stream's <<data-streams-change-mappings-and-settings,matching index
 data stream's <<data-streams-change-mappings-and-settings,matching index
 template>>.
 template>>.
 
 
-* *Want larger shards?* +
+* *Want larger shards or fewer backing indices?* +
 Increase your {ilm-init} policy's <<ilm-rollover,rollover threshold>>.
 Increase your {ilm-init} policy's <<ilm-rollover,rollover threshold>>.
 
 
 * *Need indices that span shorter intervals?* +
 * *Need indices that span shorter intervals?* +
@@ -124,13 +135,18 @@ Every new backing index is an opportunity to further tune your strategy.
 [[shard-size-recommendation]]
 [[shard-size-recommendation]]
 ==== Aim for shard sizes between 10GB and 50GB
 ==== Aim for shard sizes between 10GB and 50GB
 
 
-Large shards may make a cluster less likely to recover from failure. When a node
-fails, {es} rebalances the node's shards across the data tier's remaining nodes.
-Large shards can be harder to move across a network and may tax node resources.
-
-While not a hard limit, shards between 10GB and 50GB tend to work well for logs
-and time series data. You may be able to use larger shards depending on
-your network and use case. Smaller shards may be appropriate for
+Larger shards take longer to recover after a failure. When a node fails, {es}
+rebalances the node's shards across the data tier's remaining nodes. This
+recovery process typically involves copying the shard contents across the
+network, so a 100GB shard will take twice as long to recover than a 50GB shard.
+In contrast, small shards carry proportionally more overhead and are less
+efficient to search. Searching fifty 1GB shards will take substantially more
+resources than searching a single 50GB shard containing the same data.
+
+There are no hard limits on shard size, but experience shows that shards
+between 10GB and 50GB typically work well for logs and time series data. You
+may be able to use larger shards depending on your network and use case.
+Smaller shards may be appropriate for
 {enterprise-search-ref}/index.html[Enterprise Search] and similar use cases.
 {enterprise-search-ref}/index.html[Enterprise Search] and similar use cases.
 
 
 If you use {ilm-init}, set the <<ilm-rollover,rollover action>>'s
 If you use {ilm-init}, set the <<ilm-rollover,rollover action>>'s
@@ -161,15 +177,15 @@ index                                 prirep shard store
 [[shard-count-recommendation]]
 [[shard-count-recommendation]]
 ==== Aim for 20 shards or fewer per GB of heap memory
 ==== Aim for 20 shards or fewer per GB of heap memory
 
 
-The number of shards a node can hold is proportional to the node's
-heap memory. For example, a node with 30GB of heap memory should
-have at most 600 shards. The further below this limit you can keep your nodes,
-the better. If you find your nodes exceeding more than 20 shards per GB,
-consider adding another node.
+The number of shards a data node can hold is proportional to the node's heap
+memory. For example, a node with 30GB of heap memory should have at most 600
+shards. The further below this limit you can keep your nodes, the better. If
+you find your nodes exceeding more than 20 shards per GB, consider adding
+another node.
 
 
 Some system indices for {enterprise-search-ref}/index.html[Enterprise Search]
 Some system indices for {enterprise-search-ref}/index.html[Enterprise Search]
-are nearly empty and rarely used. Due to their low overhead, you shouldn't count
-shards for these indices toward a node's shard limit.
+are nearly empty and rarely used. Due to their low overhead, you shouldn't
+count shards for these indices toward a node's shard limit.
 
 
 To check the current size of each node's heap, use the <<cat-nodes,cat nodes
 To check the current size of each node's heap, use the <<cat-nodes,cat nodes
 API>>.
 API>>.
@@ -214,6 +230,26 @@ PUT my-index-000001/_settings
 --------------------------------------------------
 --------------------------------------------------
 // TEST[setup:my_index]
 // TEST[setup:my_index]
 
 
+[discrete]
+[[avoid-unnecessary-fields]]
+==== Avoid unnecessary mapped fields
+
+By default {es} <<dynamic-mapping,automatically creates a mapping>> for every
+field in every document it indexes. Every mapped field corresponds to some data
+structures on disk which are needed for efficient search, retrieval, and
+aggregations on this field. Details about each mapped field are also held in
+memory. In many cases this overhead is unnecessary because a field is not used
+in any searches or aggregations. Use <<explicit-mapping>> instead of dynamic
+mapping to avoid creating fields that are never used. If a collection of fields
+are typically used together, consider using <<copy-to>> to consolidate them at
+index time. If a field is only rarely used, it may be better to make it a
+<<runtime,Runtime field>> instead.
+
+You can get information about which fields are being used with the
+<<field-usage-stats>> API, and you can analyze the disk usage of mapped fields
+using the <<indices-disk-usage>> API. Note however that unnecessary mapped
+fields also carry some memory overhead as well as their disk usage.
+
 [discrete]
 [discrete]
 [[reduce-cluster-shard-count]]
 [[reduce-cluster-shard-count]]
 === Reduce a cluster's shard count
 === Reduce a cluster's shard count