task-queue-backlog.asciidoc 3.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
  1. [[task-queue-backlog]]
  2. === Task queue backlog
  3. A backlogged task queue can prevent tasks from completing and put the cluster
  4. into an unhealthy state. Resource constraints, a large number of tasks being
  5. triggered at once, and long running tasks can all contribute to a backlogged
  6. task queue.
  7. [discrete]
  8. [[diagnose-task-queue-backlog]]
  9. ==== Diagnose a task queue backlog
  10. **Check the thread pool status**
  11. A <<high-cpu-usage,depleted thread pool>> can result in
  12. <<rejected-requests,rejected requests>>.
  13. Thread pool depletion might be restricted to a specific <<data-tiers,data tier>>. If <<hotspotting,hot spotting>> is occuring, one node might experience depletion faster than other nodes, leading to performance issues and a growing task backlog.
  14. You can use the <<cat-thread-pool,cat thread pool API>> to see the number of
  15. active threads in each thread pool and how many tasks are queued, how many
  16. have been rejected, and how many have completed.
  17. [source,console]
  18. ----
  19. GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
  20. ----
  21. The `active` and `queue` statistics are instantaneous while the `rejected` and
  22. `completed` statistics are cumulative from node startup.
  23. **Inspect the hot threads on each node**
  24. If a particular thread pool queue is backed up, you can periodically poll the
  25. <<cluster-nodes-hot-threads,Nodes hot threads>> API to determine if the thread
  26. has sufficient resources to progress and gauge how quickly it is progressing.
  27. [source,console]
  28. ----
  29. GET /_nodes/hot_threads
  30. ----
  31. **Look for long running node tasks**
  32. Long-running tasks can also cause a backlog. You can use the <<tasks,task
  33. management>> API to get information about the node tasks that are running.
  34. Check the `running_time_in_nanos` to identify tasks that are taking an
  35. excessive amount of time to complete.
  36. [source,console]
  37. ----
  38. GET /_tasks?pretty=true&human=true&detailed=true
  39. ----
  40. If a particular `action` is suspected, you can filter the tasks further. The most common long-running tasks are <<docs-bulk,bulk index>>- or search-related.
  41. * Filter for <<docs-bulk,bulk index>> actions:
  42. +
  43. [source,console]
  44. ----
  45. GET /_tasks?human&detailed&actions=indices:data/write/bulk
  46. ----
  47. * Filter for search actions:
  48. +
  49. [source,console]
  50. ----
  51. GET /_tasks?human&detailed&actions=indices:data/write/search
  52. ----
  53. The API response may contain additional tasks columns, including `description` and `header`, which provides the task parameters, target, and requestor. You can use this information to perform further diagnosis.
  54. **Look for long running cluster tasks**
  55. A task backlog might also appear as a delay in synchronizing the cluster state. You
  56. can use the <<cluster-pending,cluster pending tasks API>> to get information
  57. about the pending cluster state sync tasks that are running.
  58. [source,console]
  59. ----
  60. GET /_cluster/pending_tasks
  61. ----
  62. Check the `timeInQueue` to identify tasks that are taking an excessive amount
  63. of time to complete.
  64. [discrete]
  65. [[resolve-task-queue-backlog]]
  66. ==== Resolve a task queue backlog
  67. **Increase available resources**
  68. If tasks are progressing slowly and the queue is backing up,
  69. you might need to take steps to <<reduce-cpu-usage>>.
  70. In some cases, increasing the thread pool size might help.
  71. For example, the `force_merge` thread pool defaults to a single thread.
  72. Increasing the size to 2 might help reduce a backlog of force merge requests.
  73. **Cancel stuck tasks**
  74. If you find the active task's hot thread isn't progressing and there's a backlog,
  75. consider canceling the task.