Browse Source

Split common cluster issues page into separate pages (#88495)

Abdon Pijpelink 3 years ago
parent
commit
26cc87360e

+ 95 - 0
docs/reference/troubleshooting/common-issues/circuit-breaker-errors.asciidoc

@@ -0,0 +1,95 @@
+[[circuit-breaker-errors]]
+=== Circuit breaker errors
+
+{es} uses <<circuit-breaker,circuit breakers>> to prevent nodes from running out
+of JVM heap memory. If Elasticsearch estimates an operation would exceed a
+circuit breaker, it stops the operation and returns an error.
+
+By default, the <<parent-circuit-breaker,parent circuit breaker>> triggers at
+95% JVM memory usage. To prevent errors, we recommend taking steps to reduce
+memory pressure if usage consistently exceeds 85%.
+
+[discrete]
+[[diagnose-circuit-breaker-errors]]
+==== Diagnose circuit breaker errors
+
+**Error messages**
+
+If a request triggers a circuit breaker, {es} returns an error with a `429` HTTP
+status code.
+
+[source,js]
+----
+{
+  'error': {
+    'type': 'circuit_breaking_exception',
+    'reason': '[parent] Data too large, data for [<http_request>] would be [123848638/118.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [120182112/114.6mb], new bytes reserved: [3666526/3.4mb]',
+    'bytes_wanted': 123848638,
+    'bytes_limit': 123273216,
+    'durability': 'TRANSIENT'
+  },
+  'status': 429
+}
+----
+// NOTCONSOLE
+
+{es} also writes circuit breaker errors to <<logging,`elasticsearch.log`>>. This
+is helpful when automated processes, such as allocation, trigger a circuit
+breaker.
+
+[source,txt]
+----
+Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [num/numGB], which is larger than the limit of [num/numGB], usages [request=0/0b, fielddata=num/numKB, in_flight_requests=num/numGB, accounting=num/numGB]
+----
+
+**Check JVM memory usage**
+
+If you've enabled Stack Monitoring, you can view JVM memory usage in {kib}. In
+the main menu, click **Stack Monitoring**. On the Stack Monitoring **Overview**
+page, click **Nodes**. The **JVM Heap** column lists the current memory usage
+for each node.
+
+You can also use the <<cat-nodes,cat nodes API>> to get the current
+`heap.percent` for each node.
+
+[source,console]
+----
+GET _cat/nodes?v=true&h=name,node*,heap*
+----
+
+To get the JVM memory usage for each circuit breaker, use the
+<<cluster-nodes-stats,node stats API>>.
+
+[source,console]
+----
+GET _nodes/stats/breaker
+----
+
+[discrete]
+[[prevent-circuit-breaker-errors]]
+==== Prevent circuit breaker errors
+
+**Reduce JVM memory pressure**
+
+High JVM memory pressure often causes circuit breaker errors. See
+<<high-jvm-memory-pressure>>.
+
+**Avoid using fielddata on `text` fields**
+
+For high-cardinality `text` fields, fielddata can use a large amount of JVM
+memory. To avoid this, {es} disables fielddata on `text` fields by default. If
+you've enabled fielddata and triggered the <<fielddata-circuit-breaker,fielddata
+circuit breaker>>, consider disabling it and using a `keyword` field instead.
+See <<fielddata>>.
+
+**Clear the fieldata cache**
+
+If you've triggered the fielddata circuit breaker and can't disable fielddata,
+use the <<indices-clearcache,clear cache API>> to clear the fielddata cache.
+This may disrupt any in-flight searches that use fielddata.
+
+[source,console]
+----
+POST _cache/clear?fielddata=true
+----
+// TEST[s/^/PUT my-index\n/]

+ 84 - 0
docs/reference/troubleshooting/common-issues/disk-usage-exceeded.asciidoc

@@ -0,0 +1,84 @@
+[[disk-usage-exceeded]]
+=== Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block
+
+This error indicates a data node is critically low on disk space and has reached
+the <<cluster-routing-flood-stage,flood-stage disk usage watermark>>. To prevent
+a full disk, when a node reaches this watermark, {es} blocks writes to any index
+with a shard on the node. If the block affects related system indices, {kib} and
+other {stack} features may become unavailable.
+
+{es} will automatically remove the write block when the affected node's disk
+usage goes below the <<cluster-routing-watermark-high,high disk watermark>>. To
+achieve this, {es} automatically moves some of the affected node's shards to
+other nodes in the same data tier.
+
+To verify that shards are moving off the affected node, use the <<cat-shards,cat
+shards API>>.
+
+[source,console]
+----
+GET _cat/shards?v=true
+----
+
+If shards remain on the node, use the <<cluster-allocation-explain,cluster
+allocation explanation API>> to get an explanation for their allocation status.
+
+[source,console]
+----
+GET _cluster/allocation/explain
+{
+  "index": "my-index",
+  "shard": 0,
+  "primary": false,
+  "current_node": "my-node"
+}
+----
+// TEST[s/^/PUT my-index\n/]
+// TEST[s/"primary": false,/"primary": false/]
+// TEST[s/"current_node": "my-node"//]
+
+To immediately restore write operations, you can temporarily increase the disk
+watermarks and remove the write block.
+
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent": {
+    "cluster.routing.allocation.disk.watermark.low": "90%",
+    "cluster.routing.allocation.disk.watermark.high": "95%",
+    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
+  }
+}
+
+PUT */_settings?expand_wildcards=all
+{
+  "index.blocks.read_only_allow_delete": null
+}
+----
+// TEST[s/^/PUT my-index\n/]
+
+As a long-term solution, we recommend you add nodes to the affected data tiers
+or upgrade existing nodes to increase disk space. To free up additional disk
+space, you can delete unneeded indices using the <<indices-delete-index,delete
+index API>>.
+
+[source,console]
+----
+DELETE my-index
+----
+// TEST[s/^/PUT my-index\n/]
+
+When a long-term solution is in place, reset or reconfigure the disk watermarks.
+
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent": {
+    "cluster.routing.allocation.disk.watermark.low": null,
+    "cluster.routing.allocation.disk.watermark.high": null,
+    "cluster.routing.allocation.disk.watermark.flood_stage": null
+  }
+}
+----

+ 100 - 0
docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc

@@ -0,0 +1,100 @@
+[[high-cpu-usage]]
+=== High CPU usage
+
+{es} uses <<modules-threadpool,thread pools>> to manage CPU resources for
+concurrent operations. High CPU usage typically means one or more thread pools
+are running low.
+
+If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
+related to the thread pool. For example, if the `search` thread pool is
+depleted, {es} will reject search requests until more threads are available.
+
+[discrete]
+[[diagnose-high-cpu-usage]]
+==== Diagnose high CPU usage
+
+**Check CPU usage**
+
+include::{es-repo-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
+
+**Check hot threads**
+
+If a node has high CPU usage, use the <<cluster-nodes-hot-threads,nodes hot
+threads API>> to check for resource-intensive threads running on the node.
+
+[source,console]
+----
+GET _nodes/my-node,my-other-node/hot_threads
+----
+// TEST[s/\/my-node,my-other-node//]
+
+This API returns a breakdown of any hot threads in plain text.
+
+[discrete]
+[[reduce-cpu-usage]]
+==== Reduce CPU usage
+
+The following tips outline the most common causes of high CPU usage and their
+solutions.
+
+**Scale your cluster**
+
+Heavy indexing and search loads can deplete smaller thread pools. To better
+handle heavy workloads, add more nodes to your cluster or upgrade your existing
+nodes to increase capacity.
+
+**Spread out bulk requests**
+
+While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
+or <<search-multi-search,multi-search>> requests still require CPU resources. If
+possible, submit smaller requests and allow more time between them.
+
+**Cancel long-running searches**
+
+Long-running searches can block threads in the `search` thread pool. To check
+for these searches, use the <<tasks,task management API>>.
+
+[source,console]
+----
+GET _tasks?actions=*search&detailed
+----
+
+The response's `description` contains the search request and its queries.
+`running_time_in_nanos` shows how long the search has been running.
+
+[source,console-result]
+----
+{
+  "nodes" : {
+    "oTUltX4IQMOUUVeiohTt8A" : {
+      "name" : "my-node",
+      "transport_address" : "127.0.0.1:9300",
+      "host" : "127.0.0.1",
+      "ip" : "127.0.0.1:9300",
+      "tasks" : {
+        "oTUltX4IQMOUUVeiohTt8A:464" : {
+          "node" : "oTUltX4IQMOUUVeiohTt8A",
+          "id" : 464,
+          "type" : "transport",
+          "action" : "indices:data/read/search",
+          "description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
+          "start_time_in_millis" : 4081771730000,
+          "running_time_in_nanos" : 13991383,
+          "cancellable" : true
+        }
+      }
+    }
+  }
+}
+----
+// TESTRESPONSE[skip: no way to get tasks]
+
+To cancel a search and free up resources, use the API's `_cancel` endpoint.
+
+[source,console]
+----
+POST _tasks/oTUltX4IQMOUUVeiohTt8A:464/_cancel
+----
+
+For additional tips on how to track and avoid resource-intensive searches, see
+<<avoid-expensive-searches,Avoid expensive searches>>.

+ 95 - 0
docs/reference/troubleshooting/common-issues/high-jvm-memory-pressure.asciidoc

@@ -0,0 +1,95 @@
+[[high-jvm-memory-pressure]]
+=== High JVM memory pressure
+
+High JVM memory usage can degrade cluster performance and trigger
+<<circuit-breaker-errors,circuit breaker errors>>. To prevent this, we recommend
+taking steps to reduce memory pressure if a node's JVM memory usage consistently
+exceeds 85%.
+
+[discrete]
+[[diagnose-high-jvm-memory-pressure]]
+==== Diagnose high JVM memory pressure
+
+**Check JVM memory pressure**
+
+include::{es-repo-dir}/tab-widgets/jvm-memory-pressure-widget.asciidoc[]
+
+**Check garbage collection logs**
+
+As memory usage increases, garbage collection becomes more frequent and takes
+longer. You can track the frequency and length of garbage collection events in
+<<logging,`elasticsearch.log`>>. For example, the following event states {es}
+spent more than 50% (21 seconds) of the last 40 seconds performing garbage
+collection.
+
+[source,log]
+----
+[timestamp_short_interval_from_last][INFO ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][number] overhead, spent [21s] collecting in the last [40s]
+----
+
+[discrete]
+[[reduce-jvm-memory-pressure]]
+==== Reduce JVM memory pressure
+
+**Reduce your shard count**
+
+Every shard uses memory. In most cases, a small set of large shards uses fewer
+resources than many small shards. For tips on reducing your shard count, see
+<<size-your-shards>>.
+
+[[avoid-expensive-searches]]
+**Avoid expensive searches**
+
+Expensive searches can use large amounts of memory. To better track expensive
+searches on your cluster, enable <<index-modules-slowlog,slow logs>>.
+
+Expensive searches may have a large <<paginate-search-results,`size` argument>>,
+use aggregations with a large number of buckets, or include
+<<query-dsl-allow-expensive-queries,expensive queries>>. To prevent expensive
+searches, consider the following setting changes:
+
+* Lower the `size` limit using the
+<<index-max-result-window,`index.max_result_window`>> index setting.
+
+* Decrease the maximum number of allowed aggregation buckets using the
+<<search-settings-max-buckets,search.max_buckets>> cluster setting.
+
+* Disable expensive queries using the
+<<query-dsl-allow-expensive-queries,`search.allow_expensive_queries`>> cluster
+setting.
+
+[source,console]
+----
+PUT _settings
+{
+  "index.max_result_window": 5000
+}
+
+PUT _cluster/settings
+{
+  "persistent": {
+    "search.max_buckets": 20000,
+    "search.allow_expensive_queries": false
+  }
+}
+----
+// TEST[s/^/PUT my-index\n/]
+
+**Prevent mapping explosions**
+
+Defining too many fields or nesting fields too deeply can lead to
+<<mapping-limit-settings,mapping explosions>> that use large amounts of memory.
+To prevent mapping explosions, use the <<mapping-settings-limit,mapping limit
+settings>> to limit the number of field mappings.
+
+**Spread out bulk requests**
+
+While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
+or <<search-multi-search,multi-search>> requests can still create high JVM
+memory pressure. If possible, submit smaller requests and allow more time
+between them.
+
+**Upgrade node memory**
+
+Heavy indexing and search loads can cause high JVM memory pressure. To better
+handle heavy workloads, upgrade your nodes to increase their memory capacity.

+ 240 - 0
docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc

@@ -0,0 +1,240 @@
+[[red-yellow-cluster-status]]
+=== Red or yellow cluster status
+
+A red or yellow cluster status indicates one or more shards are missing or
+unallocated. These unassigned shards increase your risk of data loss and can
+degrade cluster performance.
+
+[discrete]
+[[diagnose-cluster-status]]
+==== Diagnose your cluster status
+
+**Check your cluster status**
+
+Use the <<cluster-health,cluster health API>>.
+
+[source,console]
+----
+GET _cluster/health?filter_path=status,*_shards
+----
+
+A healthy cluster has a green `status` and zero `unassigned_shards`. A yellow
+status means only replicas are unassigned. A red status means one or
+more primary shards are unassigned.
+
+**View unassigned shards**
+
+To view unassigned shards, use the <<cat-shards,cat shards API>>.
+
+[source,console]
+----
+GET _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state
+----
+
+Unassigned shards have a `state` of `UNASSIGNED`. The `prirep` value is `p` for
+primary shards and `r` for replicas.
+
+To understand why an unassigned shard is not being assigned and what action
+you must take to allow {es} to assign it, use the
+<<cluster-allocation-explain,cluster allocation explanation API>>.
+
+[source,console]
+----
+GET _cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*
+{
+  "index": "my-index",
+  "shard": 0,
+  "primary": false,
+  "current_node": "my-node"
+}
+----
+// TEST[s/^/PUT my-index\n/]
+// TEST[s/"primary": false,/"primary": false/]
+// TEST[s/"current_node": "my-node"//]
+
+[discrete]
+[[fix-red-yellow-cluster-status]]
+==== Fix a red or yellow cluster status
+
+A shard can become unassigned for several reasons. The following tips outline the
+most common causes and their solutions.
+
+**Re-enable shard allocation**
+
+You typically disable allocation during a <<restart-cluster,restart>> or other
+cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
+be unable to assign shards. To re-enable allocation, reset the
+`cluster.routing.allocation.enable` cluster setting.
+
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent" : {
+    "cluster.routing.allocation.enable" : null
+  }
+}
+----
+
+**Recover lost nodes**
+
+Shards often become unassigned when a data node leaves the cluster. This can
+occur for several reasons, ranging from connectivity issues to hardware failure.
+After you resolve the issue and recover the node, it will rejoin the cluster.
+{es} will then automatically allocate any unassigned shards.
+
+To avoid wasting resources on temporary issues, {es} <<delayed-allocation,delays
+allocation>> by one minute by default. If you've recovered a node and don’t want
+to wait for the delay period, you can call the <<cluster-reroute,cluster reroute
+API>> with no arguments to start the allocation process. The process runs
+asynchronously in the background.
+
+[source,console]
+----
+POST _cluster/reroute
+----
+
+**Fix allocation settings**
+
+Misconfigured allocation settings can result in an unassigned primary shard.
+These settings include:
+
+* <<shard-allocation-filtering,Shard allocation>> index settings
+* <<cluster-shard-allocation-filtering,Allocation filtering>> cluster settings
+* <<shard-allocation-awareness,Allocation awareness>> cluster settings
+
+To review your allocation settings, use the <<indices-get-settings,get index
+settings>> and <<cluster-get-settings,cluster get settings>> APIs.
+
+[source,console]
+----
+GET my-index/_settings?flat_settings=true&include_defaults=true
+
+GET _cluster/settings?flat_settings=true&include_defaults=true
+----
+// TEST[s/^/PUT my-index\n/]
+
+You can change the settings using the <<indices-update-settings,update index
+settings>> and <<cluster-update-settings,cluster update settings>> APIs.
+
+**Allocate or reduce replicas**
+
+To protect against hardware failure, {es} will not assign a replica to the same
+node as its primary shard. If no other data nodes are available to host the
+replica, it remains unassigned. To fix this, you can:
+
+* Add a data node to the same tier to host the replica.
+
+* Change the `index.number_of_replicas` index setting to reduce the number of
+replicas for each primary shard. We recommend keeping at least one replica per
+primary.
+
+[source,console]
+----
+PUT _settings
+{
+  "index.number_of_replicas": 1
+}
+----
+// TEST[s/^/PUT my-index\n/]
+
+**Free up or increase disk space**
+
+{es} uses a <<disk-based-shard-allocation,low disk watermark>> to ensure data
+nodes have enough disk space for incoming shards. By default, {es} does not
+allocate shards to nodes using more than 85% of disk space.
+
+To check the current disk space of your nodes, use the <<cat-allocation,cat
+allocation API>>.
+
+[source,console]
+----
+GET _cat/allocation?v=true&h=node,shards,disk.*
+----
+
+If your nodes are running low on disk space, you have a few options:
+
+* Upgrade your nodes to increase disk space.
+
+* Delete unneeded indices to free up space. If you use {ilm-init}, you can
+update your lifecycle policy to use <<ilm-searchable-snapshot,searchable
+snapshots>> or add a delete phase. If you no longer need to search the data, you
+can use a <<snapshot-restore,snapshot>> to store it off-cluster.
+
+* If you no longer write to an index, use the <<indices-forcemerge,force merge
+API>> or {ilm-init}'s <<ilm-forcemerge,force merge action>> to merge its
+segments into larger ones.
++
+[source,console]
+----
+POST my-index/_forcemerge
+----
+// TEST[s/^/PUT my-index\n/]
+
+* If an index is read-only, use the <<indices-shrink-index,shrink index API>> or
+{ilm-init}'s <<ilm-shrink,shrink action>> to reduce its primary shard count.
++
+[source,console]
+----
+POST my-index/_shrink/my-shrunken-index
+----
+// TEST[s/^/PUT my-index\n{"settings":{"index.number_of_shards":2,"blocks.write":true}}\n/]
+
+* If your node has a large disk capacity, you can increase the low disk
+watermark or set it to an explicit byte value.
++
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent": {
+    "cluster.routing.allocation.disk.watermark.low": "30gb"
+  }
+}
+----
+// TEST[s/"30gb"/null/]
+
+**Reduce JVM memory pressure**
+
+Shard allocation requires JVM heap memory. High JVM memory pressure can trigger
+<<circuit-breaker,circuit breakers>> that stop allocation and leave shards
+unassigned. See <<high-jvm-memory-pressure>>.
+
+**Recover data for a lost primary shard**
+
+If a node containing a primary shard is lost, {es} can typically replace it
+using a replica on another node. If you can't recover the node and replicas
+don't exist or are irrecoverable, you'll need to re-add the missing data from a
+<<snapshot-restore,snapshot>> or the original data source.
+
+WARNING: Only use this option if node recovery is no longer possible. This
+process allocates an empty primary shard. If the node later rejoins the cluster,
+{es} will overwrite its primary shard with data from this newer empty shard,
+resulting in data loss.
+
+Use the <<cluster-reroute,cluster reroute API>> to manually allocate the
+unassigned primary shard to another data node in the same tier. Set
+`accept_data_loss` to `true`.
+
+[source,console]
+----
+POST _cluster/reroute
+{
+  "commands": [
+    {
+      "allocate_empty_primary": {
+        "index": "my-index",
+        "shard": 0,
+        "node": "my-node",
+        "accept_data_loss": "true"
+      }
+    }
+  ]
+}
+----
+// TEST[s/^/PUT my-index\n/]
+// TEST[catch:bad_request]
+
+If you backed up the missing index data to a snapshot, use the
+<<restore-snapshot-api,restore snapshot API>> to restore the individual index.
+Alternatively, you can index the missing data from the original data source.

+ 42 - 0
docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc

@@ -0,0 +1,42 @@
+[[rejected-requests]]
+=== Rejected requests
+
+When {es} rejects a request, it stops the operation and returns an error with a
+`429` response code. Rejected requests are commonly caused by:
+
+* A <<high-cpu-usage,depleted thread pool>>. A depleted `search` or `write`
+thread pool returns a `TOO_MANY_REQUESTS` error message.
+
+* A <<circuit-breaker-errors,circuit breaker error>>.
+
+* High <<index-modules-indexing-pressure,indexing pressure>> that exceeds the
+<<memory-limits,`indexing_pressure.memory.limit`>>.
+
+[discrete]
+[[check-rejected-tasks]]
+==== Check rejected tasks
+
+To check the number of rejected tasks for each thread pool, use the
+<<cat-thread-pool,cat thread pool API>>. A high ratio of `rejected` to
+`completed` tasks, particularly in the `search` and `write` thread pools, means
+{es} regularly rejects requests.
+
+[source,console]
+----
+GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
+----
+
+[discrete]
+[[prevent-rejected-requests]]
+==== Prevent rejected requests
+
+**Fix high CPU and memory usage**
+
+If {es} regularly rejects requests and other tasks, your cluster likely has high
+CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
+<<high-jvm-memory-pressure>>.
+
+**Prevent circuit breaker errors**
+
+If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
+for tips on diagnosing and preventing them.

+ 65 - 0
docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc

@@ -0,0 +1,65 @@
+[[task-queue-backlog]]
+=== Task queue backlog
+
+A backlogged task queue can prevent tasks from completing and 
+put the cluster into an unhealthy state. 
+Resource constraints, a large number of tasks being triggered at once,
+and long running tasks can all contribute to a backlogged task queue.
+
+[discrete]
+[[diagnose-task-queue-backlog]]
+==== Diagnose a task queue backlog
+
+**Check the thread pool status**
+
+A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>. 
+
+You can use the <<cat-thread-pool,cat thread pool API>> to 
+see the number of active threads in each thread pool and
+how many tasks are queued, how many have been rejected, and how many have completed. 
+
+[source,console]
+----
+GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
+----
+
+**Inspect the hot threads on each node**
+
+If a particular thread pool queue is backed up, 
+you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API 
+to determine if the thread has sufficient 
+resources to progress and gauge how quickly it is progressing.
+
+[source,console]
+----
+GET /_nodes/hot_threads
+----
+
+**Look for long running tasks**
+
+Long-running tasks can also cause a backlog. 
+You can use the <<tasks,task management>> API to get information about the tasks that are running. 
+Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete. 
+
+[source,console]
+----
+GET /_tasks?filter_path=nodes.*.tasks
+----
+
+[discrete]
+[[resolve-task-queue-backlog]]
+==== Resolve a task queue backlog
+
+**Increase available resources** 
+
+If tasks are progressing slowly and the queue is backing up, 
+you might need to take steps to <<reduce-cpu-usage>>. 
+
+In some cases, increasing the thread pool size might help.
+For example, the `force_merge` thread pool defaults to a single thread.
+Increasing the size to 2 might help reduce a backlog of force merge requests.
+
+**Cancel stuck tasks**
+
+If you find the active task's hot thread isn't progressing and there's a backlog, 
+consider canceling the task. 

+ 26 - 723
docs/reference/troubleshooting/fix-common-cluster-issues.asciidoc

@@ -3,736 +3,39 @@
 
 This guide describes how to fix common errors and problems with {es} clusters.
 
-[discrete]
-=== Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block
-
+<<disk-usage-exceeded,Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block>>::
 This error indicates a data node is critically low on disk space and has reached
-the <<cluster-routing-flood-stage,flood-stage disk usage watermark>>. To prevent
-a full disk, when a node reaches this watermark, {es} blocks writes to any index
-with a shard on the node. If the block affects related system indices, {kib} and
-other {stack} features may become unavailable.
-
-{es} will automatically remove the write block when the affected node's disk
-usage goes below the <<cluster-routing-watermark-high,high disk watermark>>. To
-achieve this, {es} automatically moves some of the affected node's shards to
-other nodes in the same data tier.
-
-To verify that shards are moving off the affected node, use the <<cat-shards,cat
-shards API>>.
-
-[source,console]
-----
-GET _cat/shards?v=true
-----
-
-If shards remain on the node, use the <<cluster-allocation-explain,cluster
-allocation explanation API>> to get an explanation for their allocation status.
-
-[source,console]
-----
-GET _cluster/allocation/explain
-{
-  "index": "my-index",
-  "shard": 0,
-  "primary": false,
-  "current_node": "my-node"
-}
-----
-// TEST[s/^/PUT my-index\n/]
-// TEST[s/"primary": false,/"primary": false/]
-// TEST[s/"current_node": "my-node"//]
-
-To immediately restore write operations, you can temporarily increase the disk
-watermarks and remove the write block.
-
-[source,console]
-----
-PUT _cluster/settings
-{
-  "persistent": {
-    "cluster.routing.allocation.disk.watermark.low": "90%",
-    "cluster.routing.allocation.disk.watermark.high": "95%",
-    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
-  }
-}
-
-PUT */_settings?expand_wildcards=all
-{
-  "index.blocks.read_only_allow_delete": null
-}
-----
-// TEST[s/^/PUT my-index\n/]
-
-As a long-term solution, we recommend you add nodes to the affected data tiers
-or upgrade existing nodes to increase disk space. To free up additional disk
-space, you can delete unneeded indices using the <<indices-delete-index,delete
-index API>>.
-
-[source,console]
-----
-DELETE my-index
-----
-// TEST[s/^/PUT my-index\n/]
-
-When a long-term solution is in place, reset or reconfigure the disk watermarks.
-
-[source,console]
-----
-PUT _cluster/settings
-{
-  "persistent": {
-    "cluster.routing.allocation.disk.watermark.low": null,
-    "cluster.routing.allocation.disk.watermark.high": null,
-    "cluster.routing.allocation.disk.watermark.flood_stage": null
-  }
-}
-----
-
-[discrete]
-[[circuit-breaker-errors]]
-=== Circuit breaker errors
-
-{es} uses <<circuit-breaker,circuit breakers>> to prevent nodes from running out
-of JVM heap memory. If Elasticsearch estimates an operation would exceed a
-circuit breaker, it stops the operation and returns an error.
-
-By default, the <<parent-circuit-breaker,parent circuit breaker>> triggers at
-95% JVM memory usage. To prevent errors, we recommend taking steps to reduce
-memory pressure if usage consistently exceeds 85%.
-
-[discrete]
-[[diagnose-circuit-breaker-errors]]
-==== Diagnose circuit breaker errors
-
-**Error messages**
-
-If a request triggers a circuit breaker, {es} returns an error with a `429` HTTP
-status code.
-
-[source,js]
-----
-{
-  'error': {
-    'type': 'circuit_breaking_exception',
-    'reason': '[parent] Data too large, data for [<http_request>] would be [123848638/118.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [120182112/114.6mb], new bytes reserved: [3666526/3.4mb]',
-    'bytes_wanted': 123848638,
-    'bytes_limit': 123273216,
-    'durability': 'TRANSIENT'
-  },
-  'status': 429
-}
-----
-// NOTCONSOLE
-
-{es} also writes circuit breaker errors to <<logging,`elasticsearch.log`>>. This
-is helpful when automated processes, such as allocation, trigger a circuit
-breaker.
-
-[source,txt]
-----
-Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [num/numGB], which is larger than the limit of [num/numGB], usages [request=0/0b, fielddata=num/numKB, in_flight_requests=num/numGB, accounting=num/numGB]
-----
-
-**Check JVM memory usage**
-
-If you've enabled Stack Monitoring, you can view JVM memory usage in {kib}. In
-the main menu, click **Stack Monitoring**. On the Stack Monitoring **Overview**
-page, click **Nodes**. The **JVM Heap** column lists the current memory usage
-for each node.
-
-You can also use the <<cat-nodes,cat nodes API>> to get the current
-`heap.percent` for each node.
-
-[source,console]
-----
-GET _cat/nodes?v=true&h=name,node*,heap*
-----
-
-To get the JVM memory usage for each circuit breaker, use the
-<<cluster-nodes-stats,node stats API>>.
-
-[source,console]
-----
-GET _nodes/stats/breaker
-----
-
-[discrete]
-[[prevent-circuit-breaker-errors]]
-==== Prevent circuit breaker errors
-
-**Reduce JVM memory pressure**
-
-High JVM memory pressure often causes circuit breaker errors. See
-<<high-jvm-memory-pressure>>.
-
-**Avoid using fielddata on `text` fields**
-
-For high-cardinality `text` fields, fielddata can use a large amount of JVM
-memory. To avoid this, {es} disables fielddata on `text` fields by default. If
-you've enabled fielddata and triggered the <<fielddata-circuit-breaker,fielddata
-circuit breaker>>, consider disabling it and using a `keyword` field instead.
-See <<fielddata>>.
-
-**Clear the fieldata cache**
-
-If you've triggered the fielddata circuit breaker and can't disable fielddata,
-use the <<indices-clearcache,clear cache API>> to clear the fielddata cache.
-This may disrupt any in-flight searches that use fielddata.
-
-[source,console]
-----
-POST _cache/clear?fielddata=true
-----
-// TEST[s/^/PUT my-index\n/]
-
-[discrete]
-[[high-cpu-usage]]
-=== High CPU usage
-
-{es} uses <<modules-threadpool,thread pools>> to manage CPU resources for
-concurrent operations. High CPU usage typically means one or more thread pools
-are running low.
-
-If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
-related to the thread pool. For example, if the `search` thread pool is
-depleted, {es} will reject search requests until more threads are available.
-
-[discrete]
-[[diagnose-high-cpu-usage]]
-==== Diagnose high CPU usage
-
-**Check CPU usage**
-
-include::{es-repo-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
-
-**Check hot threads**
-
-If a node has high CPU usage, use the <<cluster-nodes-hot-threads,nodes hot
-threads API>> to check for resource-intensive threads running on the node.
-
-[source,console]
-----
-GET _nodes/my-node,my-other-node/hot_threads
-----
-// TEST[s/\/my-node,my-other-node//]
-
-This API returns a breakdown of any hot threads in plain text.
-
-[discrete]
-[[reduce-cpu-usage]]
-==== Reduce CPU usage
-
-The following tips outline the most common causes of high CPU usage and their
-solutions.
-
-**Scale your cluster**
-
-Heavy indexing and search loads can deplete smaller thread pools. To better
-handle heavy workloads, add more nodes to your cluster or upgrade your existing
-nodes to increase capacity.
-
-**Spread out bulk requests**
-
-While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
-or <<search-multi-search,multi-search>> requests still require CPU resources. If
-possible, submit smaller requests and allow more time between them.
-
-**Cancel long-running searches**
-
-Long-running searches can block threads in the `search` thread pool. To check
-for these searches, use the <<tasks,task management API>>.
-
-[source,console]
-----
-GET _tasks?actions=*search&detailed
-----
-
-The response's `description` contains the search request and its queries.
-`running_time_in_nanos` shows how long the search has been running.
-
-[source,console-result]
-----
-{
-  "nodes" : {
-    "oTUltX4IQMOUUVeiohTt8A" : {
-      "name" : "my-node",
-      "transport_address" : "127.0.0.1:9300",
-      "host" : "127.0.0.1",
-      "ip" : "127.0.0.1:9300",
-      "tasks" : {
-        "oTUltX4IQMOUUVeiohTt8A:464" : {
-          "node" : "oTUltX4IQMOUUVeiohTt8A",
-          "id" : 464,
-          "type" : "transport",
-          "action" : "indices:data/read/search",
-          "description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
-          "start_time_in_millis" : 4081771730000,
-          "running_time_in_nanos" : 13991383,
-          "cancellable" : true
-        }
-      }
-    }
-  }
-}
-----
-// TESTRESPONSE[skip: no way to get tasks]
-
-To cancel a search and free up resources, use the API's `_cancel` endpoint.
-
-[source,console]
-----
-POST _tasks/oTUltX4IQMOUUVeiohTt8A:464/_cancel
-----
-
-For additional tips on how to track and avoid resource-intensive searches, see
-<<avoid-expensive-searches,Avoid expensive searches>>.
-
-[discrete]
-[[high-jvm-memory-pressure]]
-=== High JVM memory pressure
-
-High JVM memory usage can degrade cluster performance and trigger
-<<circuit-breaker-errors,circuit breaker errors>>. To prevent this, we recommend
-taking steps to reduce memory pressure if a node's JVM memory usage consistently
-exceeds 85%.
-
-[discrete]
-[[diagnose-high-jvm-memory-pressure]]
-==== Diagnose high JVM memory pressure
-
-**Check JVM memory pressure**
-
-include::{es-repo-dir}/tab-widgets/jvm-memory-pressure-widget.asciidoc[]
-
-**Check garbage collection logs**
-
-As memory usage increases, garbage collection becomes more frequent and takes
-longer. You can track the frequency and length of garbage collection events in
-<<logging,`elasticsearch.log`>>. For example, the following event states {es}
-spent more than 50% (21 seconds) of the last 40 seconds performing garbage
-collection.
-
-[source,log]
-----
-[timestamp_short_interval_from_last][INFO ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][number] overhead, spent [21s] collecting in the last [40s]
-----
-
-[discrete]
-[[reduce-jvm-memory-pressure]]
-==== Reduce JVM memory pressure
-
-**Reduce your shard count**
-
-Every shard uses memory. In most cases, a small set of large shards uses fewer
-resources than many small shards. For tips on reducing your shard count, see
-<<size-your-shards>>.
-
-[[avoid-expensive-searches]]
-**Avoid expensive searches**
-
-Expensive searches can use large amounts of memory. To better track expensive
-searches on your cluster, enable <<index-modules-slowlog,slow logs>>.
-
-Expensive searches may have a large <<paginate-search-results,`size` argument>>,
-use aggregations with a large number of buckets, or include
-<<query-dsl-allow-expensive-queries,expensive queries>>. To prevent expensive
-searches, consider the following setting changes:
-
-* Lower the `size` limit using the
-<<index-max-result-window,`index.max_result_window`>> index setting.
-
-* Decrease the maximum number of allowed aggregation buckets using the
-<<search-settings-max-buckets,search.max_buckets>> cluster setting.
-
-* Disable expensive queries using the
-<<query-dsl-allow-expensive-queries,`search.allow_expensive_queries`>> cluster
-setting.
-
-[source,console]
-----
-PUT _settings
-{
-  "index.max_result_window": 5000
-}
-
-PUT _cluster/settings
-{
-  "persistent": {
-    "search.max_buckets": 20000,
-    "search.allow_expensive_queries": false
-  }
-}
-----
-// TEST[s/^/PUT my-index\n/]
-
-**Prevent mapping explosions**
-
-Defining too many fields or nesting fields too deeply can lead to
-<<mapping-limit-settings,mapping explosions>> that use large amounts of memory.
-To prevent mapping explosions, use the <<mapping-settings-limit,mapping limit
-settings>> to limit the number of field mappings.
-
-**Spread out bulk requests**
+the flood-stage disk usage watermark.
 
-While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
-or <<search-multi-search,multi-search>> requests can still create high JVM
-memory pressure. If possible, submit smaller requests and allow more time
-between them.
+<<circuit-breaker-errors,Circuit breaker errors>>::
+{es} uses circuit breakers to prevent nodes from running out of JVM heap memory.
+If Elasticsearch estimates an operation would exceed a circuit breaker, it stops
+the operation and returns an error.
 
-**Upgrade node memory**
+<<high-cpu-usage,High CPU usage>>::
+The most common causes of high CPU usage and their solutions.
 
-Heavy indexing and search loads can cause high JVM memory pressure. To better
-handle heavy workloads, upgrade your nodes to increase their memory capacity.
-
-[discrete]
-[[red-yellow-cluster-status]]
-=== Red or yellow cluster status
+<<high-jvm-memory-pressure,High JVM memory pressure>>::
+High JVM memory usage can degrade cluster performance and trigger circuit 
+breaker errors. 
 
+<<red-yellow-cluster-status,Red or yellow cluster status>>::
 A red or yellow cluster status indicates one or more shards are missing or
 unallocated. These unassigned shards increase your risk of data loss and can
 degrade cluster performance.
 
-[discrete]
-[[diagnose-cluster-status]]
-==== Diagnose your cluster status
-
-**Check your cluster status**
-
-Use the <<cluster-health,cluster health API>>.
-
-[source,console]
-----
-GET _cluster/health?filter_path=status,*_shards
-----
-
-A healthy cluster has a green `status` and zero `unassigned_shards`. A yellow
-status means only replicas are unassigned. A red status means one or
-more primary shards are unassigned.
-
-**View unassigned shards**
-
-To view unassigned shards, use the <<cat-shards,cat shards API>>.
-
-[source,console]
-----
-GET _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state
-----
-
-Unassigned shards have a `state` of `UNASSIGNED`. The `prirep` value is `p` for
-primary shards and `r` for replicas.
-
-To understand why an unassigned shard is not being assigned and what action
-you must take to allow {es} to assign it, use the
-<<cluster-allocation-explain,cluster allocation explanation API>>.
-
-[source,console]
-----
-GET _cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*
-{
-  "index": "my-index",
-  "shard": 0,
-  "primary": false,
-  "current_node": "my-node"
-}
-----
-// TEST[s/^/PUT my-index\n/]
-// TEST[s/"primary": false,/"primary": false/]
-// TEST[s/"current_node": "my-node"//]
-
-[discrete]
-[[fix-red-yellow-cluster-status]]
-==== Fix a red or yellow cluster status
-
-A shard can become unassigned for several reasons. The following tips outline the
-most common causes and their solutions.
-
-**Re-enable shard allocation**
-
-You typically disable allocation during a <<restart-cluster,restart>> or other
-cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
-be unable to assign shards. To re-enable allocation, reset the
-`cluster.routing.allocation.enable` cluster setting.
-
-[source,console]
-----
-PUT _cluster/settings
-{
-  "persistent" : {
-    "cluster.routing.allocation.enable" : null
-  }
-}
-----
-
-**Recover lost nodes**
-
-Shards often become unassigned when a data node leaves the cluster. This can
-occur for several reasons, ranging from connectivity issues to hardware failure.
-After you resolve the issue and recover the node, it will rejoin the cluster.
-{es} will then automatically allocate any unassigned shards.
-
-To avoid wasting resources on temporary issues, {es} <<delayed-allocation,delays
-allocation>> by one minute by default. If you've recovered a node and don’t want
-to wait for the delay period, you can call the <<cluster-reroute,cluster reroute
-API>> with no arguments to start the allocation process. The process runs
-asynchronously in the background.
-
-[source,console]
-----
-POST _cluster/reroute
-----
-
-**Fix allocation settings**
-
-Misconfigured allocation settings can result in an unassigned primary shard.
-These settings include:
-
-* <<shard-allocation-filtering,Shard allocation>> index settings
-* <<cluster-shard-allocation-filtering,Allocation filtering>> cluster settings
-* <<shard-allocation-awareness,Allocation awareness>> cluster settings
-
-To review your allocation settings, use the <<indices-get-settings,get index
-settings>> and <<cluster-get-settings,cluster get settings>> APIs.
-
-[source,console]
-----
-GET my-index/_settings?flat_settings=true&include_defaults=true
-
-GET _cluster/settings?flat_settings=true&include_defaults=true
-----
-// TEST[s/^/PUT my-index\n/]
-
-You can change the settings using the <<indices-update-settings,update index
-settings>> and <<cluster-update-settings,cluster update settings>> APIs.
-
-**Allocate or reduce replicas**
-
-To protect against hardware failure, {es} will not assign a replica to the same
-node as its primary shard. If no other data nodes are available to host the
-replica, it remains unassigned. To fix this, you can:
-
-* Add a data node to the same tier to host the replica.
-
-* Change the `index.number_of_replicas` index setting to reduce the number of
-replicas for each primary shard. We recommend keeping at least one replica per
-primary.
-
-[source,console]
-----
-PUT _settings
-{
-  "index.number_of_replicas": 1
-}
-----
-// TEST[s/^/PUT my-index\n/]
-
-**Free up or increase disk space**
-
-{es} uses a <<disk-based-shard-allocation,low disk watermark>> to ensure data
-nodes have enough disk space for incoming shards. By default, {es} does not
-allocate shards to nodes using more than 85% of disk space.
-
-To check the current disk space of your nodes, use the <<cat-allocation,cat
-allocation API>>.
-
-[source,console]
-----
-GET _cat/allocation?v=true&h=node,shards,disk.*
-----
-
-If your nodes are running low on disk space, you have a few options:
-
-* Upgrade your nodes to increase disk space.
-
-* Delete unneeded indices to free up space. If you use {ilm-init}, you can
-update your lifecycle policy to use <<ilm-searchable-snapshot,searchable
-snapshots>> or add a delete phase. If you no longer need to search the data, you
-can use a <<snapshot-restore,snapshot>> to store it off-cluster.
-
-* If you no longer write to an index, use the <<indices-forcemerge,force merge
-API>> or {ilm-init}'s <<ilm-forcemerge,force merge action>> to merge its
-segments into larger ones.
-+
-[source,console]
-----
-POST my-index/_forcemerge
-----
-// TEST[s/^/PUT my-index\n/]
-
-* If an index is read-only, use the <<indices-shrink-index,shrink index API>> or
-{ilm-init}'s <<ilm-shrink,shrink action>> to reduce its primary shard count.
-+
-[source,console]
-----
-POST my-index/_shrink/my-shrunken-index
-----
-// TEST[s/^/PUT my-index\n{"settings":{"index.number_of_shards":2,"blocks.write":true}}\n/]
-
-* If your node has a large disk capacity, you can increase the low disk
-watermark or set it to an explicit byte value.
-+
-[source,console]
-----
-PUT _cluster/settings
-{
-  "persistent": {
-    "cluster.routing.allocation.disk.watermark.low": "30gb"
-  }
-}
-----
-// TEST[s/"30gb"/null/]
-
-**Reduce JVM memory pressure**
-
-Shard allocation requires JVM heap memory. High JVM memory pressure can trigger
-<<circuit-breaker,circuit breakers>> that stop allocation and leave shards
-unassigned. See <<high-jvm-memory-pressure>>.
-
-**Recover data for a lost primary shard**
-
-If a node containing a primary shard is lost, {es} can typically replace it
-using a replica on another node. If you can't recover the node and replicas
-don't exist or are irrecoverable, you'll need to re-add the missing data from a
-<<snapshot-restore,snapshot>> or the original data source.
-
-WARNING: Only use this option if node recovery is no longer possible. This
-process allocates an empty primary shard. If the node later rejoins the cluster,
-{es} will overwrite its primary shard with data from this newer empty shard,
-resulting in data loss.
-
-Use the <<cluster-reroute,cluster reroute API>> to manually allocate the
-unassigned primary shard to another data node in the same tier. Set
-`accept_data_loss` to `true`.
-
-[source,console]
-----
-POST _cluster/reroute
-{
-  "commands": [
-    {
-      "allocate_empty_primary": {
-        "index": "my-index",
-        "shard": 0,
-        "node": "my-node",
-        "accept_data_loss": "true"
-      }
-    }
-  ]
-}
-----
-// TEST[s/^/PUT my-index\n/]
-// TEST[catch:bad_request]
-
-If you backed up the missing index data to a snapshot, use the
-<<restore-snapshot-api,restore snapshot API>> to restore the individual index.
-Alternatively, you can index the missing data from the original data source.
-
-[discrete]
-[[rejected-requests]]
-=== Rejected requests
-
+<<rejected-requests,Rejected requests>>::
 When {es} rejects a request, it stops the operation and returns an error with a
-`429` response code. Rejected requests are commonly caused by:
-
-* A <<high-cpu-usage,depleted thread pool>>. A depleted `search` or `write`
-thread pool returns a `TOO_MANY_REQUESTS` error message.
-
-* A <<circuit-breaker-errors,circuit breaker error>>.
-
-* High <<index-modules-indexing-pressure,indexing pressure>> that exceeds the
-<<memory-limits,`indexing_pressure.memory.limit`>>.
-
-[discrete]
-[[check-rejected-tasks]]
-==== Check rejected tasks
-
-To check the number of rejected tasks for each thread pool, use the
-<<cat-thread-pool,cat thread pool API>>. A high ratio of `rejected` to
-`completed` tasks, particularly in the `search` and `write` thread pools, means
-{es} regularly rejects requests.
-
-[source,console]
-----
-GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
-----
-
-[discrete]
-[[prevent-rejected-requests]]
-==== Prevent rejected requests
-
-**Fix high CPU and memory usage**
-
-If {es} regularly rejects requests and other tasks, your cluster likely has high
-CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
-<<high-jvm-memory-pressure>>.
-
-**Prevent circuit breaker errors**
-
-If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
-for tips on diagnosing and preventing them.
-
-[discrete]
-[[task-queue-backlog]]
-=== Task queue backlog
-
-A backlogged task queue can prevent tasks from completing and 
-put the cluster into an unhealthy state. 
-Resource constraints, a large number of tasks being triggered at once,
-and long running tasks can all contribute to a backlogged task queue.
-
-[discrete]
-[[diagnose-task-queue-backlog]]
-==== Diagnose a task queue backlog
-
-**Check the thread pool status**
-
-A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>. 
-
-You can use the <<cat-thread-pool,cat thread pool API>> to 
-see the number of active threads in each thread pool and
-how many tasks are queued, how many have been rejected, and how many have completed. 
-
-[source,console]
-----
-GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
-----
-
-**Inspect the hot threads on each node**
-
-If a particular thread pool queue is backed up, 
-you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API 
-to determine if the thread has sufficient 
-resources to progress and gauge how quickly it is progressing.
-
-[source,console]
-----
-GET /_nodes/hot_threads
-----
-
-**Look for long running tasks**
-
-Long-running tasks can also cause a backlog. 
-You can use the <<tasks,task management>> API to get information about the tasks that are running. 
-Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete. 
-
-[source,console]
-----
-GET /_tasks?filter_path=nodes.*.tasks
-----
-
-[discrete]
-[[resolve-task-queue-backlog]]
-==== Resolve a task queue backlog
-
-**Increase available resources** 
-
-If tasks are progressing slowly and the queue is backing up, 
-you might need to take steps to <<reduce-cpu-usage>>. 
-
-In some cases, increasing the thread pool size might help.
-For example, the `force_merge` thread pool defaults to a single thread.
-Increasing the size to 2 might help reduce a backlog of force merge requests.
-
-**Cancel stuck tasks**
-
-If you find the active task's hot thread isn't progressing and there's a backlog, 
-consider canceling the task. 
+`429` response code.
+
+<<task-queue-backlog,Task queue backlog>>::
+A backlogged task queue can prevent tasks from completing and put the cluster 
+into an unhealthy state. 
+
+include::common-issues/disk-usage-exceeded.asciidoc[]
+include::common-issues/circuit-breaker-errors.asciidoc[]
+include::common-issues/high-cpu-usage.asciidoc[]
+include::common-issues/high-jvm-memory-pressure.asciidoc[]
+include::common-issues/red-yellow-cluster-status.asciidoc[]
+include::common-issues/rejected-requests.asciidoc[]
+include::common-issues/task-queue-backlog.asciidoc[]