|  | @@ -0,0 +1,738 @@
 | 
	
		
			
				|  |  | +[[fix-common-cluster-issues]]
 | 
	
		
			
				|  |  | +== Fix common cluster issues
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +This guide describes how to fix common errors and problems with {es} clusters.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +=== Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +This error indicates a data node is critically low on disk space and has reached
 | 
	
		
			
				|  |  | +the <<cluster-routing-flood-stage,flood-stage disk usage watermark>>. To prevent
 | 
	
		
			
				|  |  | +a full disk, when a node reaches this watermark, {es} blocks writes to any index
 | 
	
		
			
				|  |  | +with a shard on the node. If the block affects related system indices, {kib} and
 | 
	
		
			
				|  |  | +other {stack} features may become unavailable.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +{es} will automatically remove the write block when the affected node's disk
 | 
	
		
			
				|  |  | +usage goes below the <<cluster-routing-watermark-high,high disk watermark>>. To
 | 
	
		
			
				|  |  | +achieve this, {es} automatically moves some of the affected node's shards to
 | 
	
		
			
				|  |  | +other nodes in the same data tier.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To verify that shards are moving off the affected node, use the <<cat-shards,cat
 | 
	
		
			
				|  |  | +shards API>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _cat/shards?v=true
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If shards remain on the node, use the <<cluster-allocation-explain,cluster
 | 
	
		
			
				|  |  | +allocation explanation API>> to get an explanation for their allocation status.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _cluster/allocation/explain
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "index": "my-index",
 | 
	
		
			
				|  |  | +  "shard": 0,
 | 
	
		
			
				|  |  | +  "primary": false,
 | 
	
		
			
				|  |  | +  "current_node": "my-node"
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +// TEST[s/"primary": false,/"primary": false/]
 | 
	
		
			
				|  |  | +// TEST[s/"current_node": "my-node"//]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To immediately restore write operations, you can temporarily increase the disk
 | 
	
		
			
				|  |  | +watermarks and remove the write block.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +PUT _cluster/settings
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "persistent": {
 | 
	
		
			
				|  |  | +    "cluster.routing.allocation.disk.watermark.low": "90%",
 | 
	
		
			
				|  |  | +    "cluster.routing.allocation.disk.watermark.high": "95%",
 | 
	
		
			
				|  |  | +    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
 | 
	
		
			
				|  |  | +  }
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +PUT */_settings?expand_wildcards=all
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "index.blocks.read_only_allow_delete": null
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +As a long-term solution, we recommend you add nodes to the affected data tiers
 | 
	
		
			
				|  |  | +or upgrade existing nodes to increase disk space. To free up additional disk
 | 
	
		
			
				|  |  | +space, you can delete unneeded indices using the <<indices-delete-index,delete
 | 
	
		
			
				|  |  | +index API>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +DELETE my-index
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +When a long-term solution is in place, reset or reconfigure the disk watermarks.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +PUT _cluster/settings
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "persistent": {
 | 
	
		
			
				|  |  | +    "cluster.routing.allocation.disk.watermark.low": null,
 | 
	
		
			
				|  |  | +    "cluster.routing.allocation.disk.watermark.high": null,
 | 
	
		
			
				|  |  | +    "cluster.routing.allocation.disk.watermark.flood_stage": null
 | 
	
		
			
				|  |  | +  }
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[circuit-breaker-errors]]
 | 
	
		
			
				|  |  | +=== Circuit breaker errors
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +{es} uses <<circuit-breaker,circuit breakers>> to prevent nodes from running out
 | 
	
		
			
				|  |  | +of JVM heap memory. If Elasticsearch estimates an operation would exceed a
 | 
	
		
			
				|  |  | +circuit breaker, it stops the operation and returns an error.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +By default, the <<parent-circuit-breaker,parent circuit breaker>> triggers at
 | 
	
		
			
				|  |  | +95% JVM memory usage. To prevent errors, we recommend taking steps to reduce
 | 
	
		
			
				|  |  | +memory pressure if usage consistently exceeds 85%.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[diagnose-circuit-breaker-errors]]
 | 
	
		
			
				|  |  | +==== Diagnose circuit breaker errors
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Error messages**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If a request triggers a circuit breaker, {es} returns an error with a `429` HTTP
 | 
	
		
			
				|  |  | +status code.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,js]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  'error': {
 | 
	
		
			
				|  |  | +    'type': 'circuit_breaking_exception',
 | 
	
		
			
				|  |  | +    'reason': '[parent] Data too large, data for [<http_request>] would be [123848638/118.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [120182112/114.6mb], new bytes reserved: [3666526/3.4mb]',
 | 
	
		
			
				|  |  | +    'bytes_wanted': 123848638,
 | 
	
		
			
				|  |  | +    'bytes_limit': 123273216,
 | 
	
		
			
				|  |  | +    'durability': 'TRANSIENT'
 | 
	
		
			
				|  |  | +  },
 | 
	
		
			
				|  |  | +  'status': 429
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// NOTCONSOLE
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +{es} also writes circuit breaker errors to <<logging,`elasticsearch.log`>>. This
 | 
	
		
			
				|  |  | +is helpful when automated processes, such as allocation, trigger a circuit
 | 
	
		
			
				|  |  | +breaker.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,txt]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [num/numGB], which is larger than the limit of [num/numGB], usages [request=0/0b, fielddata=num/numKB, in_flight_requests=num/numGB, accounting=num/numGB]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Check JVM memory usage**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If you've enabled Stack Monitoring, you can view JVM memory usage in {kib}. In
 | 
	
		
			
				|  |  | +the main menu, click **Stack Monitoring**. On the Stack Monitoring **Overview**
 | 
	
		
			
				|  |  | +page, click **Nodes**. The **JVM Heap** column lists the current memory usage
 | 
	
		
			
				|  |  | +for each node.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +You can also use the <<cat-nodes,cat nodes API>> to get the current
 | 
	
		
			
				|  |  | +`heap.percent` for each node.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _cat/nodes?v=true&h=name,node*,heap*
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To get the JVM memory usage for each circuit breaker, use the
 | 
	
		
			
				|  |  | +<<cluster-nodes-stats,node stats API>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _nodes/stats/breaker
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[prevent-circuit-breaker-errors]]
 | 
	
		
			
				|  |  | +==== Prevent circuit breaker errors
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Reduce JVM memory pressure**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +High JVM memory pressure often causes circuit breaker errors. See
 | 
	
		
			
				|  |  | +<<high-jvm-memory-pressure>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Avoid using fielddata on `text` fields**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +For high-cardinality `text` fields, fielddata can use a large amount of JVM
 | 
	
		
			
				|  |  | +memory. To avoid this, {es} disables fielddata on `text` fields by default. If
 | 
	
		
			
				|  |  | +you've enabled fielddata and triggered the <<fielddata-circuit-breaker,fielddata
 | 
	
		
			
				|  |  | +circuit breaker>>, consider disabling it and using a `keyword` field instead.
 | 
	
		
			
				|  |  | +See <<fielddata>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Clear the fieldata cache**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If you've triggered the fielddata circuit breaker and can't disable fielddata,
 | 
	
		
			
				|  |  | +use the <<indices-clearcache,clear cache API>> to clear the fielddata cache.
 | 
	
		
			
				|  |  | +This may disrupt any in-flight searches that use fielddata.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +POST _cache/clear?fielddata=true
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[high-cpu-usage]]
 | 
	
		
			
				|  |  | +=== High CPU usage
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +{es} uses <<modules-threadpool,thread pools>> to manage CPU resources for
 | 
	
		
			
				|  |  | +concurrent operations. High CPU usage typically means one or more thread pools
 | 
	
		
			
				|  |  | +are running low.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
 | 
	
		
			
				|  |  | +related to the thread pool. For example, if the `search` thread pool is
 | 
	
		
			
				|  |  | +depleted, {es} will reject search requests until more threads are available.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[diagnose-high-cpu-usage]]
 | 
	
		
			
				|  |  | +==== Diagnose high CPU usage
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Check CPU usage**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +include::{es-repo-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Check hot threads**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If a node has high CPU usage, use the <<cluster-nodes-hot-threads,nodes hot
 | 
	
		
			
				|  |  | +threads API>> to check for resource-intensive threads running on the node.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _nodes/my-node,my-other-node/hot_threads
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/\/my-node,my-other-node//]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +This API returns a breakdown of any hot threads in plain text.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[reduce-cpu-usage]]
 | 
	
		
			
				|  |  | +==== Reduce CPU usage
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +The following tips outline the most common causes of high CPU usage and their
 | 
	
		
			
				|  |  | +solutions.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Scale your cluster**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Heavy indexing and search loads can deplete smaller thread pools. To better
 | 
	
		
			
				|  |  | +handle heavy workloads, add more nodes to your cluster or upgrade your existing
 | 
	
		
			
				|  |  | +nodes to increase capacity.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Spread out bulk requests**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
 | 
	
		
			
				|  |  | +or <<search-multi-search,multi-search>> requests still require CPU resources. If
 | 
	
		
			
				|  |  | +possible, submit smaller requests and allow more time between them.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Cancel long-running searches**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Long-running searches can block threads in the `search` thread pool. To check
 | 
	
		
			
				|  |  | +for these searches, use the <<tasks,task management API>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _tasks?actions=*search&detailed
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +The response's `description` contains the search request and its queries.
 | 
	
		
			
				|  |  | +`running_time_in_nanos` shows how long the search has been running.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console-result]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "nodes" : {
 | 
	
		
			
				|  |  | +    "oTUltX4IQMOUUVeiohTt8A" : {
 | 
	
		
			
				|  |  | +      "name" : "my-node",
 | 
	
		
			
				|  |  | +      "transport_address" : "127.0.0.1:9300",
 | 
	
		
			
				|  |  | +      "host" : "127.0.0.1",
 | 
	
		
			
				|  |  | +      "ip" : "127.0.0.1:9300",
 | 
	
		
			
				|  |  | +      "tasks" : {
 | 
	
		
			
				|  |  | +        "oTUltX4IQMOUUVeiohTt8A:464" : {
 | 
	
		
			
				|  |  | +          "node" : "oTUltX4IQMOUUVeiohTt8A",
 | 
	
		
			
				|  |  | +          "id" : 464,
 | 
	
		
			
				|  |  | +          "type" : "transport",
 | 
	
		
			
				|  |  | +          "action" : "indices:data/read/search",
 | 
	
		
			
				|  |  | +          "description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
 | 
	
		
			
				|  |  | +          "start_time_in_millis" : 4081771730000,
 | 
	
		
			
				|  |  | +          "running_time_in_nanos" : 13991383,
 | 
	
		
			
				|  |  | +          "cancellable" : true
 | 
	
		
			
				|  |  | +        }
 | 
	
		
			
				|  |  | +      }
 | 
	
		
			
				|  |  | +    }
 | 
	
		
			
				|  |  | +  }
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TESTRESPONSE[skip: no way to get tasks]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To cancel a search and free up resources, use the API's `_cancel` endpoint.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +POST _tasks/oTUltX4IQMOUUVeiohTt8A:464/_cancel
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +For additional tips on how to track and avoid resource-intensive searches, see
 | 
	
		
			
				|  |  | +<<avoid-expensive-searches,Avoid expensive searches>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[high-jvm-memory-pressure]]
 | 
	
		
			
				|  |  | +=== High JVM memory pressure
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +High JVM memory usage can degrade cluster performance and trigger
 | 
	
		
			
				|  |  | +<<circuit-breaker-errors,circuit breaker errors>>. To prevent this, we recommend
 | 
	
		
			
				|  |  | +taking steps to reduce memory pressure if a node's JVM memory usage consistently
 | 
	
		
			
				|  |  | +exceeds 85%.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[diagnose-high-jvm-memory-pressure]]
 | 
	
		
			
				|  |  | +==== Diagnose high JVM memory pressure
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Check JVM memory pressure**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +include::{es-repo-dir}/tab-widgets/jvm-memory-pressure-widget.asciidoc[]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Check garbage collection logs**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +As memory usage increases, garbage collection becomes more frequent and takes
 | 
	
		
			
				|  |  | +longer. You can track the frequency and length of garbage collection events in
 | 
	
		
			
				|  |  | +<<logging,`elasticsearch.log`>>. For example, the following event states {es}
 | 
	
		
			
				|  |  | +spent more than 50% (21 seconds) of the last 40 seconds performing garbage
 | 
	
		
			
				|  |  | +collection.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,log]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +[timestamp_short_interval_from_last][INFO ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][number] overhead, spent [21s] collecting in the last [40s]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[reduce-jvm-memory-pressure]]
 | 
	
		
			
				|  |  | +==== Reduce JVM memory pressure
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Reduce your shard count**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Every shard uses memory. In most cases, a small set of large shards uses fewer
 | 
	
		
			
				|  |  | +resources than many small shards. For tips on reducing your shard count, see
 | 
	
		
			
				|  |  | +<<size-your-shards>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[avoid-expensive-searches]]
 | 
	
		
			
				|  |  | +**Avoid expensive searches**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Expensive searches can use large amounts of memory. To better track expensive
 | 
	
		
			
				|  |  | +searches on your cluster, enable <<index-modules-slowlog,slow logs>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Expensive searches may have a large <<paginate-search-results,`size` argument>>,
 | 
	
		
			
				|  |  | +use aggregations with a large number of buckets, or include
 | 
	
		
			
				|  |  | +<<query-dsl-allow-expensive-queries,expensive queries>>. To prevent expensive
 | 
	
		
			
				|  |  | +searches, consider the following setting changes:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* Lower the `size` limit using the
 | 
	
		
			
				|  |  | +<<index-max-result-window,`index.max_result_window`>> index setting.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* Decrease the maximum number of allowed aggregation buckets using the
 | 
	
		
			
				|  |  | +<<search-settings-max-buckets,search.max_buckets>> cluster setting.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* Disable expensive queries using the
 | 
	
		
			
				|  |  | +<<query-dsl-allow-expensive-queries,`search.allow_expensive_queries`>> cluster
 | 
	
		
			
				|  |  | +setting.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +PUT _settings
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "index.max_result_window": 5000
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +PUT _cluster/settings
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "persistent": {
 | 
	
		
			
				|  |  | +    "search.max_buckets": 20000,
 | 
	
		
			
				|  |  | +    "search.allow_expensive_queries": false
 | 
	
		
			
				|  |  | +  }
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Prevent mapping explosions**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Defining too many fields or nesting fields too deeply can lead to
 | 
	
		
			
				|  |  | +<<mapping-limit-settings,mapping explosions>> that use large amounts of memory.
 | 
	
		
			
				|  |  | +To prevent mapping explosions, use the <<mapping-settings-limit,mapping limit
 | 
	
		
			
				|  |  | +settings>> to limit the number of field mappings.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Spread out bulk requests**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
 | 
	
		
			
				|  |  | +or <<search-multi-search,multi-search>> requests can still create high JVM
 | 
	
		
			
				|  |  | +memory pressure. If possible, submit smaller requests and allow more time
 | 
	
		
			
				|  |  | +between them.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Upgrade node memory**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Heavy indexing and search loads can cause high JVM memory pressure. To better
 | 
	
		
			
				|  |  | +handle heavy workloads, upgrade your nodes to increase their memory capacity.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[red-yellow-cluster-status]]
 | 
	
		
			
				|  |  | +=== Red or yellow cluster status
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +A red or yellow cluster status indicates one or more shards are missing or
 | 
	
		
			
				|  |  | +unallocated. These unassigned shards increase your risk of data loss and can
 | 
	
		
			
				|  |  | +degrade cluster performance.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[diagnose-cluster-status]]
 | 
	
		
			
				|  |  | +==== Diagnose your cluster status
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Check your cluster status**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Use the <<cluster-health,cluster health API>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _cluster/health?filter_path=status,*_shards
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +A healthy cluster has a green `status` and zero `unassigned_shards`. A yellow
 | 
	
		
			
				|  |  | +status means only replicas are unassigned. A red status means one or
 | 
	
		
			
				|  |  | +more primary shards are unassigned.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**View unassigned shards**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To view unassigned shards, use the <<cat-shards,cat shards API>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Unassigned shards have a `state` of `UNASSIGNED`. The `prirep` value is `p` for
 | 
	
		
			
				|  |  | +primary shards and `r` for replicas.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To understand why an unassigned shard is not being assigned and what action
 | 
	
		
			
				|  |  | +you must take to allow {es} to assign it, use the
 | 
	
		
			
				|  |  | +<<cluster-allocation-explain,cluster allocation explanation API>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "index": "my-index",
 | 
	
		
			
				|  |  | +  "shard": 0,
 | 
	
		
			
				|  |  | +  "primary": false,
 | 
	
		
			
				|  |  | +  "current_node": "my-node"
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +// TEST[s/"primary": false,/"primary": false/]
 | 
	
		
			
				|  |  | +// TEST[s/"current_node": "my-node"//]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[fix-red-yellow-cluster-status]]
 | 
	
		
			
				|  |  | +==== Fix a red or yellow cluster status
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +A shard can become unassigned for several reasons. The following tips outline the
 | 
	
		
			
				|  |  | +most common causes and their solutions.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Re-enable shard allocation**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +You typically disable allocation during a <<restart-cluster,restart>> or other
 | 
	
		
			
				|  |  | +cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
 | 
	
		
			
				|  |  | +be unable to assign shards. To re-enable allocation, reset the
 | 
	
		
			
				|  |  | +`cluster.routing.allocation.enable` cluster setting.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +PUT _cluster/settings
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "persistent" : {
 | 
	
		
			
				|  |  | +    "cluster.routing.allocation.enable" : null
 | 
	
		
			
				|  |  | +  }
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Recover lost nodes**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Shards often become unassigned when a data node leaves the cluster. This can
 | 
	
		
			
				|  |  | +occur for several reasons, ranging from connectivity issues to hardware failure.
 | 
	
		
			
				|  |  | +After you resolve the issue and recover the node, it will rejoin the cluster.
 | 
	
		
			
				|  |  | +{es} will then automatically allocate any unassigned shards.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To avoid wasting resources on temporary issues, {es} <<delayed-allocation,delays
 | 
	
		
			
				|  |  | +allocation>> by one minute by default. If you've recovered a node and don’t want
 | 
	
		
			
				|  |  | +to wait for the delay period, you can call the <<cluster-reroute,cluster reroute
 | 
	
		
			
				|  |  | +API>> with no arguments to start the allocation process. The process runs
 | 
	
		
			
				|  |  | +asynchronously in the background.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +POST _cluster/reroute
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Fix allocation settings**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Misconfigured allocation settings can result in an unassigned primary shard.
 | 
	
		
			
				|  |  | +These settings include:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* <<shard-allocation-filtering,Shard allocation>> index settings
 | 
	
		
			
				|  |  | +* <<cluster-shard-allocation-filtering,Allocation filtering>> cluster settings
 | 
	
		
			
				|  |  | +* <<shard-allocation-awareness,Allocation awareness>> cluster settings
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To review your allocation settings, use the <<indices-get-settings,get index
 | 
	
		
			
				|  |  | +settings>> and <<cluster-get-settings,cluster get settings>> APIs.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET my-index/_settings?flat_settings=true&include_defaults=true
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +GET _cluster/settings?flat_settings=true&include_defaults=true
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +You can change the settings using the <<indices-update-settings,update index
 | 
	
		
			
				|  |  | +settings>> and <<cluster-update-settings,cluster update settings>> APIs.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Allocate or reduce replicas**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To protect against hardware failure, {es} will not assign a replica to the same
 | 
	
		
			
				|  |  | +node as its primary shard. If no other data nodes are available to host the
 | 
	
		
			
				|  |  | +replica, it remains unassigned. To fix this, you can:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* Add a data node to the same tier to host the replica.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* Change the `index.number_of_replicas` index setting to reduce the number of
 | 
	
		
			
				|  |  | +replicas for each primary shard. We recommend keeping at least one replica per
 | 
	
		
			
				|  |  | +primary.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +PUT _settings
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "index.number_of_replicas": 1
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Free up or increase disk space**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +{es} uses a <<disk-based-shard-allocation,low disk watermark>> to ensure data
 | 
	
		
			
				|  |  | +nodes have enough disk space for incoming shards. By default, {es} does not
 | 
	
		
			
				|  |  | +allocate shards to nodes using more than 85% of disk space.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To check the current disk space of your nodes, use the <<cat-allocation,cat
 | 
	
		
			
				|  |  | +allocation API>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET _cat/allocation?v=true&h=node,shards,disk.*
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If your nodes are running low on disk space, you have a few options:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* Upgrade your nodes to increase disk space.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* Delete unneeded indices to free up space. If you use {ilm-init}, you can
 | 
	
		
			
				|  |  | +update your lifecycle policy to use <<ilm-searchable-snapshot,searchable
 | 
	
		
			
				|  |  | +snapshots>> or add a delete phase. If you no longer need to search the data, you
 | 
	
		
			
				|  |  | +can use a <<snapshot-restore,snapshot>> to store it off-cluster.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* If you no longer write to an index, use the <<indices-forcemerge,force merge
 | 
	
		
			
				|  |  | +API>> or {ilm-init}'s <<ilm-forcemerge,force merge action>> to merge its
 | 
	
		
			
				|  |  | +segments into larger ones.
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +POST my-index/_forcemerge
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* If an index is read-only, use the <<indices-shrink-index,shrink index API>> or
 | 
	
		
			
				|  |  | +{ilm-init}'s <<ilm-shrink,shrink action>> to reduce its primary shard count.
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +POST my-index/_shrink/my-shrunken-index
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n{"settings":{"index.number_of_shards":2,"blocks.write":true}}\n/]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* If your node has a large disk capacity, you can increase the low disk
 | 
	
		
			
				|  |  | +watermark or set it to an explicit byte value.
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +PUT _cluster/settings
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "persistent": {
 | 
	
		
			
				|  |  | +    "cluster.routing.allocation.disk.watermark.low": "30gb"
 | 
	
		
			
				|  |  | +  }
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/"30gb"/null/]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Reduce JVM memory pressure**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Shard allocation requires JVM heap memory. High JVM memory pressure can trigger
 | 
	
		
			
				|  |  | +<<circuit-breaker,circuit breakers>> that stop allocation and leave shards
 | 
	
		
			
				|  |  | +unassigned. See <<high-jvm-memory-pressure>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Recover data for a lost primary shard**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If a node containing a primary shard is lost, {es} can typically replace it
 | 
	
		
			
				|  |  | +using a replica on another node. If you can't recover the node and replicas
 | 
	
		
			
				|  |  | +don't exist or are irrecoverable, you'll need to re-add the missing data from a
 | 
	
		
			
				|  |  | +<<snapshot-restore,snapshot>> or the original data source.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +WARNING: Only use this option if node recovery is no longer possible. This
 | 
	
		
			
				|  |  | +process allocates an empty primary shard. If the node later rejoins the cluster,
 | 
	
		
			
				|  |  | +{es} will overwrite its primary shard with data from this newer empty shard,
 | 
	
		
			
				|  |  | +resulting in data loss.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Use the <<cluster-reroute,cluster reroute API>> to manually allocate the
 | 
	
		
			
				|  |  | +unassigned primary shard to another data node in the same tier. Set
 | 
	
		
			
				|  |  | +`accept_data_loss` to `true`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +POST _cluster/reroute
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "commands": [
 | 
	
		
			
				|  |  | +    {
 | 
	
		
			
				|  |  | +      "allocate_empty_primary": {
 | 
	
		
			
				|  |  | +        "index": "my-index",
 | 
	
		
			
				|  |  | +        "shard": 0,
 | 
	
		
			
				|  |  | +        "node": "my-node",
 | 
	
		
			
				|  |  | +        "accept_data_loss": "true"
 | 
	
		
			
				|  |  | +      }
 | 
	
		
			
				|  |  | +    }
 | 
	
		
			
				|  |  | +  ]
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +// TEST[s/^/PUT my-index\n/]
 | 
	
		
			
				|  |  | +// TEST[catch:bad_request]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If you backed up the missing index data to a snapshot, use the
 | 
	
		
			
				|  |  | +<<restore-snapshot-api,restore snapshot API>> to restore the individual index.
 | 
	
		
			
				|  |  | +Alternatively, you can index the missing data from the original data source.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[rejected-requests]]
 | 
	
		
			
				|  |  | +=== Rejected requests
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +When {es} rejects a request, it stops the operation and returns an error with a
 | 
	
		
			
				|  |  | +`429` response code. Rejected requests are commonly caused by:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* A <<high-cpu-usage,depleted thread pool>>. A depleted `search` or `write`
 | 
	
		
			
				|  |  | +thread pool returns a `TOO_MANY_REQUESTS` error message.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* A <<circuit-breaker-errors,circuit breaker error>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* High <<index-modules-indexing-pressure,indexing pressure>> that exceeds the
 | 
	
		
			
				|  |  | +<<memory-limits,`indexing_pressure.memory.limit`>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[check-rejected-tasks]]
 | 
	
		
			
				|  |  | +==== Check rejected tasks
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To check the number of rejected tasks for each thread pool, use the
 | 
	
		
			
				|  |  | +<<cat-thread-pool,cat thread pool API>>. A high ratio of `rejected` to
 | 
	
		
			
				|  |  | +`completed` tasks, particularly in the `search` and `write` thread pools, means
 | 
	
		
			
				|  |  | +{es} regularly rejects requests.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[prevent-rejected-requests]]
 | 
	
		
			
				|  |  | +==== Prevent rejected requests
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Fix high CPU and memory usage**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If {es} regularly rejects requests and other tasks, your cluster likely has high
 | 
	
		
			
				|  |  | +CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
 | 
	
		
			
				|  |  | +<<high-jvm-memory-pressure>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Prevent circuit breaker errors**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
 | 
	
		
			
				|  |  | +for tips on diagnosing and preventing them.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[task-queue-backlog]]
 | 
	
		
			
				|  |  | +=== Task queue backlog
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +A backlogged task queue can prevent tasks from completing and 
 | 
	
		
			
				|  |  | +put the cluster into an unhealthy state. 
 | 
	
		
			
				|  |  | +Resource constraints, a large number of tasks being triggered at once,
 | 
	
		
			
				|  |  | +and long running tasks can all contribute to a backlogged task queue.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[diagnose-task-queue-backlog]]
 | 
	
		
			
				|  |  | +==== Diagnose a task queue backlog
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Check the thread pool status**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>. 
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +You can use the <<cat-thread-pool,cat thread pool API>> to 
 | 
	
		
			
				|  |  | +see the number of active threads in each thread pool and
 | 
	
		
			
				|  |  | +how many tasks are queued, how many have been rejected, and how many have completed. 
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Inspect the hot threads on each node**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If a particular thread pool queue is backed up, 
 | 
	
		
			
				|  |  | +you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API 
 | 
	
		
			
				|  |  | +to determine if the thread has sufficient 
 | 
	
		
			
				|  |  | +resources to progress and gauge how quickly it is progressing.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET /_nodes/hot_threads
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Look for long running tasks**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Long-running tasks can also cause a backlog. 
 | 
	
		
			
				|  |  | +You can use the <<tasks,task management>> API to get information about the tasks that are running. 
 | 
	
		
			
				|  |  | +Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete. 
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET /_tasks?filter_path=nodes.*.tasks
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[discrete]
 | 
	
		
			
				|  |  | +[[resolve-task-queue-backlog]]
 | 
	
		
			
				|  |  | +==== Resolve a task queue backlog
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Increase available resources** 
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If tasks are progressing slowly and the queue is backing up, 
 | 
	
		
			
				|  |  | +you might need to take steps to <<reduce-cpu-usage>>. 
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +In some cases, increasing the thread pool size might help.
 | 
	
		
			
				|  |  | +For example, the `force_merge` thread pool defaults to a single thread.
 | 
	
		
			
				|  |  | +Increasing the size to 2 might help reduce a backlog of force merge requests.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +**Cancel stuck tasks**
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If you find the active task's hot thread isn't progressing and there's a backlog, 
 | 
	
		
			
				|  |  | +consider canceling the task. 
 |