1 year ago · d039c280af
--- a/docs/reference/modules/indices/circuit_breaker.asciidoc
+++ b/docs/reference/modules/indices/circuit_breaker.asciidoc
@@ -175,7 +175,8 @@ an `OutOfMemory` exception which would bring down the node.
 
				 To prevent this from happening, a special <<circuit-breaker, circuit breaker>> is used,
			
 
				 which limits the memory allocation during the execution of a <<eql-sequences, sequence>>
			
 
				 query. When the breaker is triggered, an `org.elasticsearch.common.breaker.CircuitBreakingException`
			
 
				-is thrown and a descriptive error message is returned to the user.
			
 
				+is thrown and a descriptive error message including `circuit_breaking_exception`
			
 
				+is returned to the user.
			
 
				 
			
 
				 This <<circuit-breaker, circuit breaker>> can be configured using the following settings:
			
 
				 
			
--- a/docs/reference/tab-widgets/cpu-usage.asciidoc
+++ b/docs/reference/tab-widgets/cpu-usage.asciidoc
@@ -1,30 +1,20 @@
 
				 // tag::cloud[]
			
 
				-From your deployment menu, click **Performance**. The page's **CPU Usage** chart
			
 
				-shows your deployment's CPU usage as a percentage.
			
 
				+* (Recommended) Enable {cloud}/ec-monitoring-setup.html[logs and metrics]. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page. 
			
 
				++
			
 
				+You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email.
			
 
				 
			
 
				-High CPU usage can also deplete your CPU credits. CPU credits let {ess} provide
			
 
				-smaller clusters with a performance boost when needed. The **CPU credits**
			
 
				-chart shows your remaining CPU credits, measured in seconds of CPU time.
			
 
				+* From your deployment menu, view the {cloud}/ec-saas-metrics-accessing.html[**Performance**] page. On this page, you can view two key metrics:
			
 
				+** **CPU usage**: Your deployment's CPU usage, represented as a percentage.
			
 
				+** **CPU credits**: Your remaining CPU credits, measured in seconds of CPU time.
			
 
				 
			
 
				-You can also use the <<cat-nodes,cat nodes API>> to get the current CPU usage
			
 
				-for each node.
			
 
				-
			
 
				-// tag::cpu-usage-cat-nodes[]
			
 
				-[source,console]
			
 
				-----
			
 
				-GET _cat/nodes?v=true&s=cpu:desc
			
 
				-----
			
 
				-
			
 
				-The response's `cpu` column contains the current CPU usage as a percentage. The
			
 
				-`name` column contains the node's name.
			
 
				-// end::cpu-usage-cat-nodes[]
			
 
				+{ess} grants {cloud}/ec-vcpu-boost-instance.html[CPU credits] per deployment
			
 
				+to provide smaller clusters with performance boosts when needed. High CPU
			
 
				+usage can deplete these credits, which might lead to {cloud}/ec-scenario_why_is_performance_degrading_over_time.html[performance degradation] and {cloud}/ec-scenario_why_are_my_cluster_response_times_suddenly_so_much_worse.html[increased cluster response times].
			
 
				 
			
 
				 // end::cloud[]
			
 
				 
			
 
				 // tag::self-managed[]
			
 
				-
			
 
				-Use the <<cat-nodes,cat nodes API>> to get the current CPU usage for each node.
			
 
				-
			
 
				-include::cpu-usage.asciidoc[tag=cpu-usage-cat-nodes]
			
 
				-
			
 
				+* Enable <<monitoring-overview,{es} monitoring>>. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page.
			
 
				++
			
 
				+You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email.
			
 
				 // end::self-managed[]
			
--- a/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc
+++ b/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc
@@ -9,12 +9,29 @@ If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
 
				 related to the thread pool. For example, if the `search` thread pool is
			
 
				 depleted, {es} will reject search requests until more threads are available.
			
 
				 
			
 
				+You might experience high CPU usage if a <<data-tiers,data tier>>, and therefore the nodes assigned to that tier, is experiencing more traffic than other tiers. This imbalance in resource utilization is also known as <<hotspotting,hot spotting>>.
			
 
				+
			
 
				 [discrete]
			
 
				 [[diagnose-high-cpu-usage]]
			
 
				 ==== Diagnose high CPU usage
			
 
				 
			
 
				 **Check CPU usage**
			
 
				 
			
 
				+You can check the CPU usage per node using the <<cat-nodes,cat nodes API>>:
			
 
				+
			
 
				+// tag::cpu-usage-cat-nodes[]
			
 
				+[source,console]
			
 
				+----
			
 
				+GET _cat/nodes?v=true&s=cpu:desc
			
 
				+----
			
 
				+
			
 
				+The response's `cpu` column contains the current CPU usage as a percentage.
			
 
				+The `name` column contains the node's name. Elevated but transient CPU usage is
			
 
				+normal. However, if CPU usage is elevated for an extended duration, it should be
			
 
				+investigated.
			
 
				+
			
 
				+To track CPU usage over time, we recommend enabling monitoring:
			
 
				+
			
 
				 include::{es-ref-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
			
 
				 
			
 
				 **Check hot threads**
			
@@ -24,11 +41,13 @@ threads API>> to check for resource-intensive threads running on the node.
 
				 
			
 
				 [source,console]
			
 
				 ----
			
 
				-GET _nodes/my-node,my-other-node/hot_threads
			
 
				+GET _nodes/hot_threads
			
 
				 ----
			
 
				 // TEST[s/\/my-node,my-other-node//]
			
 
				 
			
 
				-This API returns a breakdown of any hot threads in plain text.
			
 
				+This API returns a breakdown of any hot threads in plain text. High CPU usage
			
 
				+frequently correlates to <<task-queue-backlog,a long-running task, or a
			
 
				+backlog of tasks>>.
			
 
				 
			
 
				 [discrete]
			
 
				 [[reduce-cpu-usage]]
			
--- a/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc
+++ b/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc
@@ -23,9 +23,52 @@ To check the number of rejected tasks for each thread pool, use the
 
				 
			
 
				 [source,console]
			
 
				 ----
			
 
				-GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
			
 
				+GET /_cat/thread_pool?v=true&h=id,name,queue,active,rejected,completed
			
 
				 ----
			
 
				 
			
 
				+`write` thread pool rejections frequently appear in the erring API and
			
 
				+correlating log as `EsRejectedExecutionException` with either
			
 
				+`QueueResizingEsThreadPoolExecutor` or `queue capacity`.
			
 
				+
			
 
				+These errors are often related to <<task-queue-backlog,backlogged tasks>>.
			
 
				+
			
 
				+[discrete]
			
 
				+[[check-circuit-breakers]]
			
 
				+==== Check circuit breakers
			
 
				+
			
 
				+To check the number of tripped <<circuit-breaker,circuit breakers>>, use the
			
 
				+<<cluster-nodes-stats,node stats API>>.
			
 
				+
			
 
				+[source,console]
			
 
				+----
			
 
				+GET /_nodes/stats/breaker
			
 
				+----
			
 
				+
			
 
				+These statistics are cumulative from node startup. For more information, see
			
 
				+<<circuit-breaker,circuit breaker errors>>.
			
 
				+
			
 
				+[discrete]
			
 
				+[[check-indexing-pressure]]
			
 
				+==== Check indexing pressure
			
 
				+
			
 
				+To check the number of <<index-modules-indexing-pressure,indexing pressure>>
			
 
				+rejections, use the <<cluster-nodes-stats,node stats API>>.
			
 
				+
			
 
				+[source,console]
			
 
				+----
			
 
				+GET _nodes/stats?human&filter_path=nodes.*.indexing_pressure
			
 
				+----
			
 
				+
			
 
				+These stats are cumulative from node startup. 
			
 
				+
			
 
				+Indexing pressure rejections appear as an
			
 
				+`EsRejectedExecutionException`, and indicate that they were rejected due
			
 
				+to `coordinating_and_primary_bytes`, `coordinating`, `primary`, or `replica`.
			
 
				+
			
 
				+These errors are often related to <<task-queue-backlog,backlogged tasks>>,
			
 
				+<<docs-bulk,bulk index>> sizing, or the ingest target's
			
 
				+<<index-modules,`refresh_interval` setting>>.
			
 
				+
			
 
				 [discrete]
			
 
				 [[prevent-rejected-requests]]
			
 
				 ==== Prevent rejected requests
			
@@ -34,9 +77,4 @@ GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
 
				 
			
 
				 If {es} regularly rejects requests and other tasks, your cluster likely has high
			
 
				 CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
			
 
				-<<high-jvm-memory-pressure>>.
			
 
				-
			
 
				-**Prevent circuit breaker errors**
			
 
				-
			
 
				-If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
			
 
				-for tips on diagnosing and preventing them.
			
 
				+<<high-jvm-memory-pressure>>.
			
--- a/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc
+++ b/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc
@@ -1,10 +1,10 @@
 
				 [[task-queue-backlog]]
			
 
				 === Task queue backlog
			
 
				 
			
 
				-A backlogged task queue can prevent tasks from completing and 
			
 
				-put the cluster into an unhealthy state. 
			
 
				-Resource constraints, a large number of tasks being triggered at once,
			
 
				-and long running tasks can all contribute to a backlogged task queue.
			
 
				+A backlogged task queue can prevent tasks from completing and put the cluster
			
 
				+into an unhealthy state. Resource constraints, a large number of tasks being
			
 
				+triggered at once, and long running tasks can all contribute to a backlogged
			
 
				+task queue.
			
 
				 
			
 
				 [discrete]
			
 
				 [[diagnose-task-queue-backlog]]
			
@@ -12,39 +12,77 @@ and long running tasks can all contribute to a backlogged task queue.
 
				 
			
 
				 **Check the thread pool status**
			
 
				 
			
 
				-A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>. 
			
 
				+A <<high-cpu-usage,depleted thread pool>> can result in
			
 
				+<<rejected-requests,rejected requests>>. 
			
 
				 
			
 
				-You can use the <<cat-thread-pool,cat thread pool API>> to 
			
 
				-see the number of active threads in each thread pool and
			
 
				-how many tasks are queued, how many have been rejected, and how many have completed. 
			
 
				+Thread pool depletion might be restricted to a specific <<data-tiers,data tier>>. If <<hotspotting,hot spotting>> is occuring, one node might experience depletion faster than other nodes, leading to performance issues and a growing task backlog.
			
 
				+
			
 
				+You can use the <<cat-thread-pool,cat thread pool API>> to see the number of
			
 
				+active threads in each thread pool and how many tasks are queued, how many
			
 
				+have been rejected, and how many have completed.
			
 
				 
			
 
				 [source,console]
			
 
				 ----
			
 
				 GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
			
 
				 ----
			
 
				 
			
 
				+The `active` and `queue` statistics are instantaneous while the `rejected` and
			
 
				+`completed` statistics are cumulative from node startup.
			
 
				+
			
 
				 **Inspect the hot threads on each node**
			
 
				 
			
 
				-If a particular thread pool queue is backed up, 
			
 
				-you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API 
			
 
				-to determine if the thread has sufficient 
			
 
				-resources to progress and gauge how quickly it is progressing.
			
 
				+If a particular thread pool queue is backed up, you can periodically poll the
			
 
				+<<cluster-nodes-hot-threads,Nodes hot threads>> API to determine if the thread
			
 
				+has sufficient resources to progress and gauge how quickly it is progressing.
			
 
				 
			
 
				 [source,console]
			
 
				 ----
			
 
				 GET /_nodes/hot_threads
			
 
				 ----
			
 
				 
			
 
				-**Look for long running tasks**
			
 
				+**Look for long running node tasks**
			
 
				+
			
 
				+Long-running tasks can also cause a backlog. You can use the <<tasks,task
			
 
				+management>> API to get information about the node tasks that are running.
			
 
				+Check the `running_time_in_nanos` to identify tasks that are taking an
			
 
				+excessive amount of time to complete.
			
 
				+
			
 
				+[source,console]
			
 
				+----
			
 
				+GET /_tasks?pretty=true&human=true&detailed=true
			
 
				+----
			
 
				 
			
 
				-Long-running tasks can also cause a backlog. 
			
 
				-You can use the <<tasks,task management>> API to get information about the tasks that are running. 
			
 
				-Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete. 
			
 
				+If a particular `action` is suspected, you can filter the tasks further. The most common long-running tasks are <<docs-bulk,bulk index>>- or search-related.
			
 
				 
			
 
				+* Filter for <<docs-bulk,bulk index>> actions:
			
 
				++
			
 
				 [source,console]
			
 
				 ----
			
 
				-GET /_tasks?filter_path=nodes.*.tasks
			
 
				+GET /_tasks?human&detailed&actions=indices:data/write/bulk
			
 
				+----
			
 
				+
			
 
				+* Filter for search actions:
			
 
				++
			
 
				+[source,console]
			
 
				 ----
			
 
				+GET /_tasks?human&detailed&actions=indices:data/write/search
			
 
				+----
			
 
				+
			
 
				+The API response may contain additional tasks columns, including `description` and `header`, which provides the task parameters, target, and requestor. You can use this information to perform further diagnosis.
			
 
				+
			
 
				+**Look for long running cluster tasks**
			
 
				+
			
 
				+A task backlog might also appear as a delay in synchronizing the cluster state. You
			
 
				+can use the <<cluster-pending,cluster pending tasks API>> to get information
			
 
				+about the pending cluster state sync tasks that are running. 
			
 
				+
			
 
				+[source,console]
			
 
				+----
			
 
				+GET /_cluster/pending_tasks
			
 
				+----
			
 
				+
			
 
				+Check the `timeInQueue` to identify tasks that are taking an excessive amount 
			
 
				+of time to complete.
			
 
				 
			
 
				 [discrete]
			
 
				 [[resolve-task-queue-backlog]]