123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750 |
- [[fix-common-cluster-issues]]
- == Fix common cluster issues
- This guide describes how to fix common errors and problems with {es} clusters.
- [discrete]
- === Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block
- This error indicates a data node is critically low on disk space and has reached
- the <<cluster-routing-flood-stage,flood-stage disk usage watermark>>. To prevent
- a full disk, when a node reaches this watermark, {es} blocks writes to any index
- with a shard on the node. If the block affects related system indices, {kib} and
- other {stack} features may become unavailable.
- {es} will automatically remove the write block when the affected node's disk
- usage goes below the <<cluster-routing-watermark-high,high disk watermark>>. To
- achieve this, {es} automatically moves some of the affected node's shards to
- other nodes in the same data tier.
- To verify that shards are moving off the affected node, use the <<cat-shards,cat
- shards API>>.
- [source,console]
- ----
- GET _cat/shards?v=true
- ----
- If shards remain on the node, use the <<cluster-allocation-explain,cluster
- allocation explanation API>> to get an explanation for their allocation status.
- [source,console]
- ----
- GET _cluster/allocation/explain
- {
- "index": "my-index",
- "shard": 0,
- "primary": false,
- "current_node": "my-node"
- }
- ----
- // TEST[s/^/PUT my-index\n/]
- // TEST[s/"primary": false,/"primary": false/]
- // TEST[s/"current_node": "my-node"//]
- To immediately restore write operations, you can temporarily increase the disk
- watermarks and remove the write block.
- [source,console]
- ----
- PUT _cluster/settings
- {
- "persistent": {
- "cluster.routing.allocation.disk.watermark.low": "90%",
- "cluster.routing.allocation.disk.watermark.low.max_headroom": "100GB",
- "cluster.routing.allocation.disk.watermark.high": "95%",
- "cluster.routing.allocation.disk.watermark.high.max_headroom": "20GB",
- "cluster.routing.allocation.disk.watermark.flood_stage": "97%",
- "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": "5GB",
- "cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
- "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "5GB"
- }
- }
- PUT */_settings?expand_wildcards=all
- {
- "index.blocks.read_only_allow_delete": null
- }
- ----
- // TEST[s/^/PUT my-index\n/]
- As a long-term solution, we recommend you add nodes to the affected data tiers
- or upgrade existing nodes to increase disk space. To free up additional disk
- space, you can delete unneeded indices using the <<indices-delete-index,delete
- index API>>.
- [source,console]
- ----
- DELETE my-index
- ----
- // TEST[s/^/PUT my-index\n/]
- When a long-term solution is in place, reset or reconfigure the disk watermarks.
- [source,console]
- ----
- PUT _cluster/settings
- {
- "persistent": {
- "cluster.routing.allocation.disk.watermark.low": null,
- "cluster.routing.allocation.disk.watermark.low.max_headroom": null,
- "cluster.routing.allocation.disk.watermark.high": null,
- "cluster.routing.allocation.disk.watermark.high.max_headroom": null,
- "cluster.routing.allocation.disk.watermark.flood_stage": null,
- "cluster.routing.allocation.disk.watermark.flood_stage.max_headroom": null,
- "cluster.routing.allocation.disk.watermark.flood_stage.frozen": null,
- "cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": null
- }
- }
- ----
- [discrete]
- [[circuit-breaker-errors]]
- === Circuit breaker errors
- {es} uses <<circuit-breaker,circuit breakers>> to prevent nodes from running out
- of JVM heap memory. If Elasticsearch estimates an operation would exceed a
- circuit breaker, it stops the operation and returns an error.
- By default, the <<parent-circuit-breaker,parent circuit breaker>> triggers at
- 95% JVM memory usage. To prevent errors, we recommend taking steps to reduce
- memory pressure if usage consistently exceeds 85%.
- [discrete]
- [[diagnose-circuit-breaker-errors]]
- ==== Diagnose circuit breaker errors
- **Error messages**
- If a request triggers a circuit breaker, {es} returns an error with a `429` HTTP
- status code.
- [source,js]
- ----
- {
- 'error': {
- 'type': 'circuit_breaking_exception',
- 'reason': '[parent] Data too large, data for [<http_request>] would be [123848638/118.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [120182112/114.6mb], new bytes reserved: [3666526/3.4mb]',
- 'bytes_wanted': 123848638,
- 'bytes_limit': 123273216,
- 'durability': 'TRANSIENT'
- },
- 'status': 429
- }
- ----
- // NOTCONSOLE
- {es} also writes circuit breaker errors to <<logging,`elasticsearch.log`>>. This
- is helpful when automated processes, such as allocation, trigger a circuit
- breaker.
- [source,txt]
- ----
- Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [num/numGB], which is larger than the limit of [num/numGB], usages [request=0/0b, fielddata=num/numKB, in_flight_requests=num/numGB, accounting=num/numGB]
- ----
- **Check JVM memory usage**
- If you've enabled Stack Monitoring, you can view JVM memory usage in {kib}. In
- the main menu, click **Stack Monitoring**. On the Stack Monitoring **Overview**
- page, click **Nodes**. The **JVM Heap** column lists the current memory usage
- for each node.
- You can also use the <<cat-nodes,cat nodes API>> to get the current
- `heap.percent` for each node.
- [source,console]
- ----
- GET _cat/nodes?v=true&h=name,node*,heap*
- ----
- See <<high-jvm-memory-pressure>> for more details.
- To get the JVM memory usage for each circuit breaker, use the
- <<cluster-nodes-stats,node stats API>>.
- [source,console]
- ----
- GET _nodes/stats/breaker
- ----
- [discrete]
- [[prevent-circuit-breaker-errors]]
- ==== Prevent circuit breaker errors
- **Reduce JVM memory pressure**
- High JVM memory pressure often causes circuit breaker errors. See
- <<high-jvm-memory-pressure>>.
- **Avoid using fielddata on `text` fields**
- For high-cardinality `text` fields, fielddata can use a large amount of JVM
- memory. To avoid this, {es} disables fielddata on `text` fields by default. If
- you've enabled fielddata and triggered the <<fielddata-circuit-breaker,fielddata
- circuit breaker>>, consider disabling it and using a `keyword` field instead.
- See <<fielddata-mapping-param>>.
- **Clear the fieldata cache**
- If you've triggered the fielddata circuit breaker and can't disable fielddata,
- use the <<indices-clearcache,clear cache API>> to clear the fielddata cache.
- This may disrupt any in-flight searches that use fielddata.
- [source,console]
- ----
- POST _cache/clear?fielddata=true
- ----
- // TEST[s/^/PUT my-index\n/]
- [discrete]
- [[high-cpu-usage]]
- === High CPU usage
- {es} uses <<modules-threadpool,thread pools>> to manage CPU resources for
- concurrent operations. High CPU usage typically means one or more thread pools
- are running low.
- If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
- related to the thread pool. For example, if the `search` thread pool is
- depleted, {es} will reject search requests until more threads are available.
- [discrete]
- [[diagnose-high-cpu-usage]]
- ==== Diagnose high CPU usage
- **Check CPU usage**
- include::{es-repo-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
- **Check hot threads**
- If a node has high CPU usage, use the <<cluster-nodes-hot-threads,nodes hot
- threads API>> to check for resource-intensive threads running on the node.
- [source,console]
- ----
- GET _nodes/my-node,my-other-node/hot_threads
- ----
- // TEST[s/\/my-node,my-other-node//]
- This API returns a breakdown of any hot threads in plain text.
- [discrete]
- [[reduce-cpu-usage]]
- ==== Reduce CPU usage
- The following tips outline the most common causes of high CPU usage and their
- solutions.
- **Scale your cluster**
- Heavy indexing and search loads can deplete smaller thread pools. To better
- handle heavy workloads, add more nodes to your cluster or upgrade your existing
- nodes to increase capacity.
- **Spread out bulk requests**
- While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
- or <<search-multi-search,multi-search>> requests still require CPU resources. If
- possible, submit smaller requests and allow more time between them.
- **Cancel long-running searches**
- Long-running searches can block threads in the `search` thread pool. To check
- for these searches, use the <<tasks,task management API>>.
- [source,console]
- ----
- GET _tasks?actions=*search&detailed
- ----
- The response's `description` contains the search request and its queries.
- `running_time_in_nanos` shows how long the search has been running.
- [source,console-result]
- ----
- {
- "nodes" : {
- "oTUltX4IQMOUUVeiohTt8A" : {
- "name" : "my-node",
- "transport_address" : "127.0.0.1:9300",
- "host" : "127.0.0.1",
- "ip" : "127.0.0.1:9300",
- "tasks" : {
- "oTUltX4IQMOUUVeiohTt8A:464" : {
- "node" : "oTUltX4IQMOUUVeiohTt8A",
- "id" : 464,
- "type" : "transport",
- "action" : "indices:data/read/search",
- "description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
- "start_time_in_millis" : 4081771730000,
- "running_time_in_nanos" : 13991383,
- "cancellable" : true
- }
- }
- }
- }
- }
- ----
- // TESTRESPONSE[skip: no way to get tasks]
- To cancel a search and free up resources, use the API's `_cancel` endpoint.
- [source,console]
- ----
- POST _tasks/oTUltX4IQMOUUVeiohTt8A:464/_cancel
- ----
- For additional tips on how to track and avoid resource-intensive searches, see
- <<avoid-expensive-searches,Avoid expensive searches>>.
- [discrete]
- [[high-jvm-memory-pressure]]
- === High JVM memory pressure
- High JVM memory usage can degrade cluster performance and trigger
- <<circuit-breaker-errors,circuit breaker errors>>. To prevent this, we recommend
- taking steps to reduce memory pressure if a node's JVM memory usage consistently
- exceeds 85%.
- [discrete]
- [[diagnose-high-jvm-memory-pressure]]
- ==== Diagnose high JVM memory pressure
- **Check JVM memory pressure**
- include::{es-repo-dir}/tab-widgets/jvm-memory-pressure-widget.asciidoc[]
- **Check garbage collection logs**
- As memory usage increases, garbage collection becomes more frequent and takes
- longer. You can track the frequency and length of garbage collection events in
- <<logging,`elasticsearch.log`>>. For example, the following event states {es}
- spent more than 50% (21 seconds) of the last 40 seconds performing garbage
- collection.
- [source,log]
- ----
- [timestamp_short_interval_from_last][INFO ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][number] overhead, spent [21s] collecting in the last [40s]
- ----
- [discrete]
- [[reduce-jvm-memory-pressure]]
- ==== Reduce JVM memory pressure
- **Reduce your shard count**
- Every shard uses memory. In most cases, a small set of large shards uses fewer
- resources than many small shards. For tips on reducing your shard count, see
- <<size-your-shards>>.
- [[avoid-expensive-searches]]
- **Avoid expensive searches**
- Expensive searches can use large amounts of memory. To better track expensive
- searches on your cluster, enable <<index-modules-slowlog,slow logs>>.
- Expensive searches may have a large <<paginate-search-results,`size` argument>>,
- use aggregations with a large number of buckets, or include
- <<query-dsl-allow-expensive-queries,expensive queries>>. To prevent expensive
- searches, consider the following setting changes:
- * Lower the `size` limit using the
- <<index-max-result-window,`index.max_result_window`>> index setting.
- * Decrease the maximum number of allowed aggregation buckets using the
- <<search-settings-max-buckets,search.max_buckets>> cluster setting.
- * Disable expensive queries using the
- <<query-dsl-allow-expensive-queries,`search.allow_expensive_queries`>> cluster
- setting.
- [source,console]
- ----
- PUT _settings
- {
- "index.max_result_window": 5000
- }
- PUT _cluster/settings
- {
- "persistent": {
- "search.max_buckets": 20000,
- "search.allow_expensive_queries": false
- }
- }
- ----
- // TEST[s/^/PUT my-index\n/]
- **Prevent mapping explosions**
- Defining too many fields or nesting fields too deeply can lead to
- <<mapping-limit-settings,mapping explosions>> that use large amounts of memory.
- To prevent mapping explosions, use the <<mapping-settings-limit,mapping limit
- settings>> to limit the number of field mappings.
- **Spread out bulk requests**
- While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
- or <<search-multi-search,multi-search>> requests can still create high JVM
- memory pressure. If possible, submit smaller requests and allow more time
- between them.
- **Upgrade node memory**
- Heavy indexing and search loads can cause high JVM memory pressure. To better
- handle heavy workloads, upgrade your nodes to increase their memory capacity.
- [discrete]
- [[red-yellow-cluster-status]]
- === Red or yellow cluster status
- A red or yellow cluster status indicates one or more shards are missing or
- unallocated. These unassigned shards increase your risk of data loss and can
- degrade cluster performance.
- [discrete]
- [[diagnose-cluster-status]]
- ==== Diagnose your cluster status
- **Check your cluster status**
- Use the <<cluster-health,cluster health API>>.
- [source,console]
- ----
- GET _cluster/health?filter_path=status,*_shards
- ----
- A healthy cluster has a green `status` and zero `unassigned_shards`. A yellow
- status means only replicas are unassigned. A red status means one or
- more primary shards are unassigned.
- **View unassigned shards**
- To view unassigned shards, use the <<cat-shards,cat shards API>>.
- [source,console]
- ----
- GET _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state
- ----
- Unassigned shards have a `state` of `UNASSIGNED`. The `prirep` value is `p` for
- primary shards and `r` for replicas.
- To understand why an unassigned shard is not being assigned and what action
- you must take to allow {es} to assign it, use the
- <<cluster-allocation-explain,cluster allocation explanation API>>.
- [source,console]
- ----
- GET _cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*
- {
- "index": "my-index",
- "shard": 0,
- "primary": false,
- "current_node": "my-node"
- }
- ----
- // TEST[s/^/PUT my-index\n/]
- // TEST[s/"primary": false,/"primary": false/]
- // TEST[s/"current_node": "my-node"//]
- [discrete]
- [[fix-red-yellow-cluster-status]]
- ==== Fix a red or yellow cluster status
- A shard can become unassigned for several reasons. The following tips outline the
- most common causes and their solutions.
- **Re-enable shard allocation**
- You typically disable allocation during a <<restart-cluster,restart>> or other
- cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
- be unable to assign shards. To re-enable allocation, reset the
- `cluster.routing.allocation.enable` cluster setting.
- [source,console]
- ----
- PUT _cluster/settings
- {
- "persistent" : {
- "cluster.routing.allocation.enable" : null
- }
- }
- ----
- **Recover lost nodes**
- Shards often become unassigned when a data node leaves the cluster. This can
- occur for several reasons, ranging from connectivity issues to hardware failure.
- After you resolve the issue and recover the node, it will rejoin the cluster.
- {es} will then automatically allocate any unassigned shards.
- To avoid wasting resources on temporary issues, {es} <<delayed-allocation,delays
- allocation>> by one minute by default. If you've recovered a node and don’t want
- to wait for the delay period, you can call the <<cluster-reroute,cluster reroute
- API>> with no arguments to start the allocation process. The process runs
- asynchronously in the background.
- [source,console]
- ----
- POST _cluster/reroute?metric=none
- ----
- **Fix allocation settings**
- Misconfigured allocation settings can result in an unassigned primary shard.
- These settings include:
- * <<shard-allocation-filtering,Shard allocation>> index settings
- * <<cluster-shard-allocation-filtering,Allocation filtering>> cluster settings
- * <<shard-allocation-awareness,Allocation awareness>> cluster settings
- To review your allocation settings, use the <<indices-get-settings,get index
- settings>> and <<cluster-get-settings,cluster get settings>> APIs.
- [source,console]
- ----
- GET my-index/_settings?flat_settings=true&include_defaults=true
- GET _cluster/settings?flat_settings=true&include_defaults=true
- ----
- // TEST[s/^/PUT my-index\n/]
- You can change the settings using the <<indices-update-settings,update index
- settings>> and <<cluster-update-settings,cluster update settings>> APIs.
- **Allocate or reduce replicas**
- To protect against hardware failure, {es} will not assign a replica to the same
- node as its primary shard. If no other data nodes are available to host the
- replica, it remains unassigned. To fix this, you can:
- * Add a data node to the same tier to host the replica.
- * Change the `index.number_of_replicas` index setting to reduce the number of
- replicas for each primary shard. We recommend keeping at least one replica per
- primary.
- [source,console]
- ----
- PUT _settings
- {
- "index.number_of_replicas": 1
- }
- ----
- // TEST[s/^/PUT my-index\n/]
- **Free up or increase disk space**
- {es} uses a <<disk-based-shard-allocation,low disk watermark>> to ensure data
- nodes have enough disk space for incoming shards. By default, {es} does not
- allocate shards to nodes using more than 85% of disk space.
- To check the current disk space of your nodes, use the <<cat-allocation,cat
- allocation API>>.
- [source,console]
- ----
- GET _cat/allocation?v=true&h=node,shards,disk.*
- ----
- If your nodes are running low on disk space, you have a few options:
- * Upgrade your nodes to increase disk space.
- * Delete unneeded indices to free up space. If you use {ilm-init}, you can
- update your lifecycle policy to use <<ilm-searchable-snapshot,searchable
- snapshots>> or add a delete phase. If you no longer need to search the data, you
- can use a <<snapshot-restore,snapshot>> to store it off-cluster.
- * If you no longer write to an index, use the <<indices-forcemerge,force merge
- API>> or {ilm-init}'s <<ilm-forcemerge,force merge action>> to merge its
- segments into larger ones.
- +
- [source,console]
- ----
- POST my-index/_forcemerge
- ----
- // TEST[s/^/PUT my-index\n/]
- * If an index is read-only, use the <<indices-shrink-index,shrink index API>> or
- {ilm-init}'s <<ilm-shrink,shrink action>> to reduce its primary shard count.
- +
- [source,console]
- ----
- POST my-index/_shrink/my-shrunken-index
- ----
- // TEST[s/^/PUT my-index\n{"settings":{"index.number_of_shards":2,"blocks.write":true}}\n/]
- * If your node has a large disk capacity, you can increase the low disk
- watermark or set it to an explicit byte value.
- +
- [source,console]
- ----
- PUT _cluster/settings
- {
- "persistent": {
- "cluster.routing.allocation.disk.watermark.low": "30gb"
- }
- }
- ----
- // TEST[s/"30gb"/null/]
- **Reduce JVM memory pressure**
- Shard allocation requires JVM heap memory. High JVM memory pressure can trigger
- <<circuit-breaker,circuit breakers>> that stop allocation and leave shards
- unassigned. See <<high-jvm-memory-pressure>>.
- **Recover data for a lost primary shard**
- If a node containing a primary shard is lost, {es} can typically replace it
- using a replica on another node. If you can't recover the node and replicas
- don't exist or are irrecoverable, you'll need to re-add the missing data from a
- <<snapshot-restore,snapshot>> or the original data source.
- WARNING: Only use this option if node recovery is no longer possible. This
- process allocates an empty primary shard. If the node later rejoins the cluster,
- {es} will overwrite its primary shard with data from this newer empty shard,
- resulting in data loss.
- Use the <<cluster-reroute,cluster reroute API>> to manually allocate the
- unassigned primary shard to another data node in the same tier. Set
- `accept_data_loss` to `true`.
- [source,console]
- ----
- POST _cluster/reroute?metric=none
- {
- "commands": [
- {
- "allocate_empty_primary": {
- "index": "my-index",
- "shard": 0,
- "node": "my-node",
- "accept_data_loss": "true"
- }
- }
- ]
- }
- ----
- // TEST[s/^/PUT my-index\n/]
- // TEST[catch:bad_request]
- If you backed up the missing index data to a snapshot, use the
- <<restore-snapshot-api,restore snapshot API>> to restore the individual index.
- Alternatively, you can index the missing data from the original data source.
- [discrete]
- [[rejected-requests]]
- === Rejected requests
- When {es} rejects a request, it stops the operation and returns an error with a
- `429` response code. Rejected requests are commonly caused by:
- * A <<high-cpu-usage,depleted thread pool>>. A depleted `search` or `write`
- thread pool returns a `TOO_MANY_REQUESTS` error message.
- * A <<circuit-breaker-errors,circuit breaker error>>.
- * High <<index-modules-indexing-pressure,indexing pressure>> that exceeds the
- <<memory-limits,`indexing_pressure.memory.limit`>>.
- [discrete]
- [[check-rejected-tasks]]
- ==== Check rejected tasks
- To check the number of rejected tasks for each thread pool, use the
- <<cat-thread-pool,cat thread pool API>>. A high ratio of `rejected` to
- `completed` tasks, particularly in the `search` and `write` thread pools, means
- {es} regularly rejects requests.
- [source,console]
- ----
- GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
- ----
- [discrete]
- [[prevent-rejected-requests]]
- ==== Prevent rejected requests
- **Fix high CPU and memory usage**
- If {es} regularly rejects requests and other tasks, your cluster likely has high
- CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
- <<high-jvm-memory-pressure>>.
- **Prevent circuit breaker errors**
- If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
- for tips on diagnosing and preventing them.
- [discrete]
- [[task-queue-backlog]]
- === Task queue backlog
- A backlogged task queue can prevent tasks from completing and
- put the cluster into an unhealthy state.
- Resource constraints, a large number of tasks being triggered at once,
- and long running tasks can all contribute to a backlogged task queue.
- [discrete]
- [[diagnose-task-queue-backlog]]
- ==== Diagnose a task queue backlog
- **Check the thread pool status**
- A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.
- You can use the <<cat-thread-pool,cat thread pool API>> to
- see the number of active threads in each thread pool and
- how many tasks are queued, how many have been rejected, and how many have completed.
- [source,console]
- ----
- GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
- ----
- **Inspect the hot threads on each node**
- If a particular thread pool queue is backed up,
- you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
- to determine if the thread has sufficient
- resources to progress and gauge how quickly it is progressing.
- [source,console]
- ----
- GET /_nodes/hot_threads
- ----
- **Look for long running tasks**
- Long-running tasks can also cause a backlog.
- You can use the <<tasks,task management>> API to get information about the tasks that are running.
- Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.
- [source,console]
- ----
- GET /_tasks?filter_path=nodes.*.tasks
- ----
- [discrete]
- [[resolve-task-queue-backlog]]
- ==== Resolve a task queue backlog
- **Increase available resources**
- If tasks are progressing slowly and the queue is backing up,
- you might need to take steps to <<reduce-cpu-usage>>.
- In some cases, increasing the thread pool size might help.
- For example, the `force_merge` thread pool defaults to a single thread.
- Increasing the size to 2 might help reduce a backlog of force merge requests.
- **Cancel stuck tasks**
- If you find the active task's hot thread isn't progressing and there's a backlog,
- consider canceling the task.
|