3 years ago · 6a273886e9
--- a/docs/reference/modules/discovery/discovery.asciidoc
+++ b/docs/reference/modules/discovery/discovery.asciidoc
@@ -27,6 +27,58 @@ enough masterless master-eligible nodes to complete an election. If neither of
 
				 these occur quickly enough then the node will retry after
			
 
				 `discovery.find_peers_interval` which defaults to `1s`.
			
 
				 
			
 
				+Once a master is elected, it will normally remain as the elected master until
			
 
				+it is deliberately stopped. It may also stop acting as the master if
			
 
				+<<cluster-fault-detection,fault detection>> determines the cluster to be
			
 
				+faulty. When a node stops being the elected master, it begins the discovery
			
 
				+process again.
			
 
				+
			
 
				+[[modules-discovery-troubleshooting]]
			
 
				+==== Troubleshooting discovery
			
 
				+
			
 
				+In most cases, the discovery process completes quickly, and the master node
			
 
				+remains elected for a long period of time. If the cluster has no master for
			
 
				+more than a few seconds or the master is unstable, the logs for each node will
			
 
				+contain information explaining why:
			
 
				+
			
 
				+* All nodes repeatedly log messages indicating that a master cannot be
			
 
				+discovered or elected using a logger called
			
 
				+`org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
			
 
				+default, this happens every 10 seconds.
			
 
				+
			
 
				+* If a node wins the election, it logs a message containing
			
 
				+`elected-as-master`. If this happens repeatedly, the master node is unstable.
			
 
				+
			
 
				+* When a node discovers the master or believes the master to have failed, it
			
 
				+logs a message containing `master node changed`.
			
 
				+
			
 
				+* If a node is unable to discover or elect a master for several minutes, it
			
 
				+starts to report additional details about the failures in its logs. Be sure to
			
 
				+capture log messages covering at least five minutes of discovery problems.
			
 
				+
			
 
				+If your cluster doesn't have a stable master, many of its features won't work
			
 
				+correctly. The cluster may report many kinds of error to clients and in its
			
 
				+logs. You must fix the master node's instability before addressing these other
			
 
				+issues. It will not be possible to solve any other issues while the master node
			
 
				+is unstable.
			
 
				+
			
 
				+The logs from the `ClusterFormationFailureHelper` may indicate that a master
			
 
				+election requires a certain set of nodes and that it has not discovered enough
			
 
				+nodes to form a quorum. If so, you must address the reason preventing {es} from
			
 
				+discovering the missing nodes. The missing nodes are needed to reconstruct the
			
 
				+cluster metadata. Without the cluster metadata, the data in your cluster is
			
 
				+meaningless. The cluster metadata is stored on a subset of the master-eligible
			
 
				+nodes in the cluster. If a quorum cannot be discovered then the missing nodes
			
 
				+were the ones holding the cluster metadata. If you cannot bring the missing
			
 
				+nodes back into the cluster, start a new cluster and restore data from a recent
			
 
				+snapshot. Refer to <<modules-discovery-quorums>> for more information.
			
 
				+
			
 
				+The logs from the `ClusterFormationFailureHelper` may also indicate that it has
			
 
				+discovered a possible quorum of master-eligible nodes. If so, the usual reason
			
 
				+that the cluster cannot elect a master is that one of the other nodes cannot
			
 
				+discover a quorum. Inspect the logs on the other master-eligible nodes and
			
 
				+ensure that every node has discovered a quorum.
			
 
				+
			
 
				 [[built-in-hosts-providers]]
			
 
				 ==== Seed hosts providers
			
 
				 
			
--- a/docs/reference/modules/discovery/fault-detection.asciidoc
+++ b/docs/reference/modules/discovery/fault-detection.asciidoc
@@ -26,8 +26,317 @@ its data path is unhealthy then it is removed from the cluster until the data
 
				 path recovers. You can control this behavior with the
			
 
				 <<modules-discovery-settings,`monitor.fs.health` settings>>.
			
 
				 
			
 
				-[[cluster-fault-detection-cluster-state-publishing]]
			
 
				-The elected master node will also remove nodes from the cluster if nodes are unable
			
 
				-to apply an updated cluster state within a reasonable time, which is by default
			
 
				-120s after the cluster state update started. See
			
 
				-<<cluster-state-publishing, cluster state publishing>> for a more detailed description.
			
 
				+[[cluster-fault-detection-cluster-state-publishing]] The elected master node
			
 
				+will also remove nodes from the cluster if nodes are unable to apply an updated
			
 
				+cluster state within a reasonable time. The timeout defaults to 2 minutes
			
 
				+starting from the beginning of the cluster state update. Refer to
			
 
				+<<cluster-state-publishing>> for a more detailed description.
			
 
				+
			
 
				+[[cluster-fault-detection-troubleshooting]]
			
 
				+==== Troubleshooting an unstable cluster
			
 
				+
			
 
				+Normally, a node will only leave a cluster if deliberately shut down. If a node
			
 
				+leaves the cluster unexpectedly, it's important to address the cause. A cluster
			
 
				+in which nodes leave unexpectedly is unstable and can create several issues.
			
 
				+For instance:
			
 
				+
			
 
				+* The cluster health may be yellow or red.
			
 
				+
			
 
				+* Some shards will be initializing and other shards may be failing.
			
 
				+
			
 
				+* Search, indexing, and monitoring operations may fail and report exceptions in
			
 
				+logs.
			
 
				+
			
 
				+* The `.security` index may be unavailable, blocking access to the cluster.
			
 
				+
			
 
				+* The master may appear busy due to frequent cluster state updates.
			
 
				+
			
 
				+To troubleshoot a cluster in this state, first ensure the cluster has a
			
 
				+<<modules-discovery-troubleshooting,stable master>>. Next, focus on the nodes
			
 
				+unexpectedly leaving the cluster ahead of all other issues. It will not be
			
 
				+possible to solve other issues until the cluster has a stable master node and
			
 
				+stable node membership.
			
 
				+
			
 
				+Diagnostics and statistics are usually not useful in an unstable cluster. These
			
 
				+tools only offer a view of the state of the cluster at a single point in time.
			
 
				+Instead, look at the cluster logs to see the pattern of behaviour over time.
			
 
				+Focus particularly on logs from the elected master. When a node leaves the
			
 
				+cluster, logs for the elected master include a message like this (with line
			
 
				+breaks added for clarity):
			
 
				+
			
 
				+[source,text]
			
 
				+----
			
 
				+[2022-03-21T11:02:35,513][INFO ][o.e.c.s.MasterService    ]
			
 
				+    [instance-0000000000] node-left[
			
 
				+        {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}
			
 
				+            reason: disconnected,
			
 
				+        {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv}
			
 
				+            reason: disconnected
			
 
				+        ], term: 14, version: 1653415, ...
			
 
				+----
			
 
				+
			
 
				+This message says that the `MasterService` on the elected master
			
 
				+(`instance-0000000000`) is processing a `node-left` task. It lists the nodes
			
 
				+that are being removed and the reasons for their removal. Other nodes may log
			
 
				+similar messages, but report fewer details:
			
 
				+
			
 
				+[source,text]
			
 
				+----
			
 
				+[2020-01-29T11:02:36,985][INFO ][o.e.c.s.ClusterApplierService]
			
 
				+    [instance-0000000001] removed {
			
 
				+        {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}
			
 
				+        {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv}
			
 
				+    }, term: 14, version: 1653415, reason: Publication{term=14, version=1653415}
			
 
				+----
			
 
				+
			
 
				+Focus on the one from the `MasterService` which is only emitted on the elected
			
 
				+master, since it contains more details. If you don't see the messages from the
			
 
				+`MasterService`, check that:
			
 
				+
			
 
				+* You're looking at the logs for the elected master node.
			
 
				+
			
 
				+* The logs cover the correct time period.
			
 
				+
			
 
				+* Logging is enabled at `INFO` level.
			
 
				+
			
 
				+Nodes will also log a message containing `master node changed` whenever they
			
 
				+start or stop following the elected master. You can use these messages to
			
 
				+determine each node's view of the state of the master over time.
			
 
				+
			
 
				+If a node restarts, it will leave the cluster and then join the cluster again.
			
 
				+When it rejoins, the `MasterService` will log that it is processing a
			
 
				+`node-join` task. You can tell from the master logs that the node was restarted
			
 
				+because the `node-join` message will indicate that it is
			
 
				+`joining after restart`. In older {es} versions, you can also determine that a
			
 
				+node restarted by looking at the second "ephemeral" ID in the `node-left` and
			
 
				+subsequent `node-join` messages. This ephemeral ID is different each time the
			
 
				+node starts up. If a node is unexpectedly restarting, you'll need to look at
			
 
				+the node's logs to see why it is shutting down.
			
 
				+
			
 
				+If the node did not restart then you should look at the reason for its
			
 
				+departure in the `node-left` message, which is reported after each node. There
			
 
				+are three possible reasons:
			
 
				+
			
 
				+* `disconnected`: The connection from the master node to the removed node was
			
 
				+closed.
			
 
				+
			
 
				+* `lagging`: The master published a cluster state update, but the removed node
			
 
				+did not apply it within the permitted timeout. By default, this timeout is 2
			
 
				+minutes. Refer to <<modules-discovery-settings>> for information about the
			
 
				+settings which control this mechanism.
			
 
				+
			
 
				+* `followers check retry count exceeded`: The master sent a number of
			
 
				+consecutive health checks to the removed node. These checks were rejected or
			
 
				+timed out. By default, each health check times out after 10 seconds and {es}
			
 
				+removes the node removed after three consecutively failed health checks. Refer
			
 
				+to <<modules-discovery-settings>> for information about the settings which
			
 
				+control this mechanism.
			
 
				+
			
 
				+===== Diagnosing `disconnected` nodes
			
 
				+
			
 
				+{es} is designed to run on a fairly reliable network. It opens a number of TCP
			
 
				+connections between nodes and expects these connections to remain open forever.
			
 
				+If a connection is closed then {es} will try and reconnect, so the occasional
			
 
				+blip should have limited impact on the cluster even if the affected node
			
 
				+briefly leaves the cluster. In contrast, repeatedly-dropped connections will
			
 
				+severely affect its operation.
			
 
				+
			
 
				+The connections from the elected master node to every other node in the cluster
			
 
				+are particularly important. The elected master never spontaneously closes its
			
 
				+outbound connections to other nodes. Similarly, once a connection is fully
			
 
				+established, a node never spontaneously close its inbound connections unless
			
 
				+the node is shutting down.
			
 
				+
			
 
				+If you see a node unexpectedly leave the cluster with the `disconnected`
			
 
				+reason, something other than {es} likely caused the connection to close. A
			
 
				+common cause is a misconfigured firewall with an improper timeout or another
			
 
				+policy that's <<long-lived-connections,incompatible with {es}>>. It could also
			
 
				+be caused by general connectivity issues, such as packet loss due to faulty
			
 
				+hardware or network congestion. If you're an advanced user, you can get more
			
 
				+detailed information about network exceptions by configuring the following
			
 
				+loggers:
			
 
				+
			
 
				+[source,yaml]
			
 
				+----
			
 
				+logger.org.elasticsearch.transport.TcpTransport: DEBUG
			
 
				+logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG
			
 
				+----
			
 
				+
			
 
				+In extreme cases, you may need to take packet captures using `tcpdump` to
			
 
				+determine whether messages between nodes are being dropped or rejected by some
			
 
				+other device on the network.
			
 
				+
			
 
				+===== Diagnosing `lagging` nodes
			
 
				+
			
 
				+{es} needs every node to process cluster state updates reasonably quickly. If a
			
 
				+node takes too long to process a cluster state update, it can be harmful to the
			
 
				+cluster. The master will remove these nodes with the `lagging` reason. Refer to
			
 
				+<<modules-discovery-settings>> for information about the settings which control
			
 
				+this mechanism.
			
 
				+
			
 
				+Lagging is typically caused by performance issues on the removed node. However,
			
 
				+a node may also lag due to severe network delays. To rule out network delays,
			
 
				+ensure that `net.ipv4.tcp_retries2` is <<system-config-tcpretries,configured
			
 
				+properly>>. Log messages that contain `warn threshold` may provide more
			
 
				+information about the root cause.
			
 
				+
			
 
				+If you're an advanced user, you can get more detailed information about what
			
 
				+the node was doing when it was removed by configuring the following logger:
			
 
				+
			
 
				+[source,yaml]
			
 
				+----
			
 
				+logger.org.elasticsearch.cluster.coordination.LagDetector: DEBUG
			
 
				+----
			
 
				+
			
 
				+When this logger is enabled, {es} will attempt to run the
			
 
				+<<cluster-nodes-hot-threads>> API on the faulty node and report the results in
			
 
				+the logs on the elected master.
			
 
				+
			
 
				+===== Diagnosing `follower check retry count exceeded` nodes
			
 
				+
			
 
				+{es} needs every node to respond to network messages successfully and
			
 
				+reasonably quickly. If a node rejects requests or does not respond at all then
			
 
				+it can be harmful to the cluster. If enough consecutive checks fail then the
			
 
				+master will remove the node with reason `follower check retry count exceeded`
			
 
				+and will indicate in the `node-left` message how many of the consecutive
			
 
				+unsuccessful checks failed and how many of them timed out. Refer to
			
 
				+<<modules-discovery-settings>> for information about the settings which control
			
 
				+this mechanism.
			
 
				+
			
 
				+Timeouts and failures may be due to network delays or performance problems on
			
 
				+the affected nodes. Ensure that `net.ipv4.tcp_retries2` is
			
 
				+<<system-config-tcpretries,configured properly>> to eliminate network delays as
			
 
				+a possible cause for this kind of instability. Log messages containing
			
 
				+`warn threshold` may give further clues about the cause of the instability.
			
 
				+
			
 
				+If the last check failed with an exception then the exception is reported, and
			
 
				+typically indicates the problem that needs to be addressed. If any of the
			
 
				+checks timed out, it may be necessary to understand the detailed sequence of
			
 
				+steps involved in a successful check. Here is an example of such a sequence:
			
 
				+
			
 
				+. The master's `FollowerChecker`, running on thread
			
 
				+`elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
			
 
				+the check request message to a follower node.
			
 
				+
			
 
				+. The master's `TransportService` running on thread
			
 
				+`elasticsearch[master][transport_worker][T#2]` passes the check request message
			
 
				+onto the operating system.
			
 
				+
			
 
				+. The operating system on the master converts the message into one or more
			
 
				+packets and sends them out over the network.
			
 
				+
			
 
				+. Miscellaneous routers, firewalls, and other devices between the master node
			
 
				+and the follower node forward the packets, possibly fragmenting or
			
 
				+defragmenting them on the way.
			
 
				+
			
 
				+. The operating system on the follower node receives the packets and notifies
			
 
				+{es} that they've been received.
			
 
				+
			
 
				+. The follower's `TransportService`, running on thread
			
 
				+`elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
			
 
				+It then reconstructs and processes the check request. Usually, the check
			
 
				+quickly succeeds. If so, the same thread immediately constructs a response and
			
 
				+passes it back to the operating system.
			
 
				+
			
 
				+. If the check doesn't immediately succeed (for example, an election started
			
 
				+recently) then:
			
 
				+
			
 
				+.. The follower's `FollowerChecker`, running on thread
			
 
				+`elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
			
 
				+constructs a response and tells the `TransportService` to send the response
			
 
				+back to the master.
			
 
				+
			
 
				+.. The follower's `TransportService`, running on thread
			
 
				+`elasticsearch[follower][transport_worker][T#3]`, passes the response to the
			
 
				+operating system.
			
 
				+
			
 
				+. The operating system on the follower converts the response into one or more
			
 
				+packets and sends them out over the network.
			
 
				+
			
 
				+. Miscellaneous routers, firewalls, and other devices between master and
			
 
				+follower forward the packets, possibly fragmenting or defragmenting them on the
			
 
				+way.
			
 
				+
			
 
				+. The operating system on the master receives the packets and notifies {es}
			
 
				+that they've been received.
			
 
				+
			
 
				+. The master's `TransportService`, running on thread
			
 
				+`elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
			
 
				+reconstructs the check response, and processes it as long as the check didn't
			
 
				+already time out.
			
 
				+
			
 
				+There are a lot of different things that can delay the completion of a check
			
 
				+and cause it to time out. Here are some examples for each step:
			
 
				+
			
 
				+. There may be a long garbage collection (GC) or virtual machine (VM) pause
			
 
				+after passing the check request to the `TransportService`.
			
 
				+
			
 
				+. There may be a long wait for the specific `transport_worker` thread to become
			
 
				+available, or there may be a long GC or VM pause before passing the check
			
 
				+request onto the operating system.
			
 
				+
			
 
				+. A system fault (for example, a broken network card) on the master may delay
			
 
				+sending the message over the network, possibly indefinitely.
			
 
				+
			
 
				+. Intermediate devices may delay, drop, or corrupt packets along the way. The
			
 
				+operating system for the master will wait and retransmit any unacknowledged or
			
 
				+corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
			
 
				+<<system-config-tcpretries,reducing this value>> since the default represents a
			
 
				+very long delay.
			
 
				+
			
 
				+. A system fault (for example, a broken network card) on the follower may delay
			
 
				+receiving the message from the network.
			
 
				+
			
 
				+. There may be a long wait for the specific `transport_worker` thread to become
			
 
				+available, or there may be a long GC or VM pause during the processing of the
			
 
				+request on the follower.
			
 
				+
			
 
				+. There may be a long wait for the `cluster_coordination` thread to become
			
 
				+available, or for the specific `transport_worker` thread to become available
			
 
				+again. There may also be a long GC or VM pause during the processing of the
			
 
				+request.
			
 
				+
			
 
				+. A system fault (for example, a broken network card) on the follower may delay
			
 
				+sending the response from the network.
			
 
				+
			
 
				+. Intermediate devices may delay, drop, or corrupt packets along the way again,
			
 
				+causing retransmissions.
			
 
				+
			
 
				+. A system fault (for example, a broken network card) on the master may delay
			
 
				+receiving the message from the network.
			
 
				+
			
 
				+. There may be a long wait for the specific `transport_worker` thread to become
			
 
				+available to process the response, or a long GC or VM pause.
			
 
				+
			
 
				+To determine why follower checks are timing out, we can narrow down the reason
			
 
				+for the delay as follows:
			
 
				+
			
 
				+* GC pauses are recorded in the GC logs that {es} emits by default, and also
			
 
				+usually by the `JvmMonitorService` in the main node logs. Use these logs to
			
 
				+confirm whether or not GC is resulting in delays.
			
 
				+
			
 
				+* VM pauses also affect other processes on the same host. A VM pause also
			
 
				+typically causes a discontinuity in the system clock, which {es} will report in
			
 
				+its logs.
			
 
				+
			
 
				+* Packet captures will reveal system-level and network-level faults, especially
			
 
				+if you capture the network traffic simultaneously at the elected master and the
			
 
				+faulty node. The connection used for follower checks is not used for any other
			
 
				+traffic so it can be easily identified from the flow pattern alone, even if TLS
			
 
				+is in use: almost exactly every second there will be a few hundred bytes sent
			
 
				+each way, first the request by the master and then the response by the
			
 
				+follower. You should be able to observe any retransmissions, packet loss, or
			
 
				+other delays on such a connection.
			
 
				+
			
 
				+* Long waits for particular threads to be available can be identified by taking
			
 
				+stack dumps (for example, using `jstack`) or a profiling trace (for example,
			
 
				+using Java Flight Recorder) in the few seconds leading up to a node departure.
			
 
				+The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
			
 
				+bear in mind that this API also requires a number of `transport_worker` and
			
 
				+`generic` threads across all the nodes in the cluster. The API may be affected
			
 
				+by the very problem you're trying to diagnose. `jstack` is much more reliable
			
 
				+since it doesn't require any JVM threads. The threads involved in the follower
			
 
				+checks are `transport_worker` and `cluster_coordination` threads, for which
			
 
				+there should never be a long wait. There may also be evidence of long waits for
			
 
				+threads in the {es} logs. Refer to <<modules-network-threading-model>> for more
			
 
				+information.
			
--- a/docs/reference/modules/network.asciidoc
+++ b/docs/reference/modules/network.asciidoc
@@ -252,3 +252,5 @@ include::http.asciidoc[]
 
				 include::transport.asciidoc[]
			
 
				 
			
 
				 include::network/tracers.asciidoc[]
			
 
				+
			
 
				+include::network/threading.asciidoc[]
			
--- a/docs/reference/modules/network/threading.asciidoc
+++ b/docs/reference/modules/network/threading.asciidoc
@@ -0,0 +1,110 @@
 
				+[[modules-network-threading-model]]
			
 
				+==== Networking threading model
			
 
				+
			
 
				+This section describes the threading model used by the networking subsystem in
			
 
				+{es}. This information isn't required to use {es}, but it may be useful to
			
 
				+advanced users who are diagnosing network problems in a cluster.
			
 
				+
			
 
				+{es} nodes communicate over a collection of TCP channels that together form a
			
 
				+transport connection. {es} clients communicate with the cluster over HTTP,
			
 
				+which also uses one or more TCP channels. Each of these TCP channels is owned
			
 
				+by exactly one of the `transport_worker` threads in the node. This owning
			
 
				+thread is chosen when the channel is opened and remains the same for the
			
 
				+lifetime of the channel.
			
 
				+
			
 
				+Each `transport_worker` thread has sole responsibility for sending and
			
 
				+receiving data over the channels it owns. One of the `transport_worker` threads
			
 
				+is also responsible for accepting new incoming transport connections, and one
			
 
				+is responsible for accepting new HTTP connections.
			
 
				+
			
 
				+If a thread in {es} wants to send data over a particular channel, it passes the
			
 
				+data to the owning `transport_worker` thread for the actual transmission.
			
 
				+
			
 
				+Normally the `transport_worker` threads will not completely handle the messages
			
 
				+they receive. Instead, they will do a small amount of preliminary processing
			
 
				+and then dispatch (hand off) the message to a different
			
 
				+<<modules-threadpool,threadpool>> for the rest of their handling. For instance,
			
 
				+bulk messages are dispatched to the `write` threadpool, searches are dispatched
			
 
				+to one of the `search` threadpools, and requests for statistics and other
			
 
				+management tasks are mostly dispatched to the `management` threadpool. However
			
 
				+in some cases the processing of a message is expected to be so quick that {es}
			
 
				+will do all of the processing on the `transport_worker` thread rather than
			
 
				+incur the overhead of dispatching it elsewhere.
			
 
				+
			
 
				+By default, there is one `transport_worker` thread per CPU. In contrast, there
			
 
				+may sometimes be tens-of-thousands of TCP channels. If data arrives on a TCP
			
 
				+channel and its owning `transport_worker` thread is busy, the data isn't
			
 
				+processed until the thread finishes whatever it is doing. Similarly, outgoing
			
 
				+data are not sent over a channel until the owning `transport_worker` thread is
			
 
				+free. This means that we require every `transport_worker` thread to be idle
			
 
				+frequently. An idle `transport_worker` looks something like this in a stack
			
 
				+dump:
			
 
				+
			
 
				+[source,text]
			
 
				+----
			
 
				+"elasticsearch[instance-0000000004][transport_worker][T#1]" #32 daemon prio=5 os_prio=0 cpu=9645.94ms elapsed=501.63s tid=0x00007fb83b6307f0 nid=0x1c4 runnable  [0x00007fb7b8ffe000]
			
 
				+   java.lang.Thread.State: RUNNABLE
			
 
				+	at sun.nio.ch.EPoll.wait(java.base@17.0.2/Native Method)
			
 
				+	at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.2/EPollSelectorImpl.java:118)
			
 
				+	at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.2/SelectorImpl.java:129)
			
 
				+	- locked <0x00000000c443c518> (a sun.nio.ch.Util$2)
			
 
				+	- locked <0x00000000c38f7700> (a sun.nio.ch.EPollSelectorImpl)
			
 
				+	at sun.nio.ch.SelectorImpl.select(java.base@17.0.2/SelectorImpl.java:146)
			
 
				+	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:813)
			
 
				+	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
			
 
				+	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
			
 
				+	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
			
 
				+	at java.lang.Thread.run(java.base@17.0.2/Thread.java:833)
			
 
				+----
			
 
				+
			
 
				+In the <<cluster-nodes-hot-threads>> API an idle `transport_worker` thread is
			
 
				+reported like this:
			
 
				+
			
 
				+[source,text]
			
 
				+----
			
 
				+   100.0% [cpu=0.0%, other=100.0%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000004][transport_worker][T#1]'
			
 
				+     10/10 snapshots sharing following 9 elements
			
 
				+       java.base@17.0.2/sun.nio.ch.EPoll.wait(Native Method)
			
 
				+       java.base@17.0.2/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:118)
			
 
				+       java.base@17.0.2/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:129)
			
 
				+       java.base@17.0.2/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:146)
			
 
				+       io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:813)
			
 
				+       io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
			
 
				+       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
			
 
				+       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
			
 
				+       java.base@17.0.2/java.lang.Thread.run(Thread.java:833)
			
 
				+----
			
 
				+
			
 
				+Note that `transport_worker` threads should always be in state `RUNNABLE`, even
			
 
				+when waiting for input, because they block in the native `EPoll#wait` method.
			
 
				+This means the hot threads API will report these threads at 100% overall
			
 
				+utilisation. This is normal, and the breakdown of time into `cpu=` and `other=`
			
 
				+fractions shows how much time the thread spent running and waiting for input
			
 
				+respectively.
			
 
				+
			
 
				+If a `transport_worker` thread is not frequently idle, it may build up a
			
 
				+backlog of work. This can cause delays in processing messages on the channels
			
 
				+that it owns. It's hard to predict exactly which work will be delayed:
			
 
				+
			
 
				+* There are many more channels than threads. If work related to one channel is
			
 
				+causing delays to its worker thread, all other channels owned by that thread
			
 
				+will also suffer delays.
			
 
				+
			
 
				+* The mapping from TCP channels to worker threads is fixed but arbitrary. Each
			
 
				+channel is assigned an owning thread in a round-robin fashion when the channel
			
 
				+is opened. Each worker thread is responsible for many different kinds of
			
 
				+channel.
			
 
				+
			
 
				+* There are many channels open between each pair of nodes. For each request,
			
 
				+{es} will choose from the appropriate channels in a round-robin fashion. Some
			
 
				+requests may end up on a channel owned by a delayed worker while other
			
 
				+identical requests will be sent on a channel that's working smoothly.
			
 
				+
			
 
				+If the backlog builds up too far, some messages may be delayed by many seconds.
			
 
				+The node might even <<cluster-fault-detection,fail its health checks>> and be
			
 
				+removed from the cluster. Sometimes, you can find evidence of busy
			
 
				+`transport_worker` threads using the <<cluster-nodes-hot-threads>> API.
			
 
				+However, this API itself sends network messages so may not work correctly if
			
 
				+the `transport_worker` threads are too busy. It is more reliable to use
			
 
				+`jstack` to obtain stack dumps or use Java Flight Recorder to obtain a
			
 
				+profiling trace. These tools are independent of any work the JVM is performing.