|
@@ -26,7 +26,8 @@ its data path is unhealthy then it is removed from the cluster until the data
|
|
|
path recovers. You can control this behavior with the
|
|
|
<<modules-discovery-settings,`monitor.fs.health` settings>>.
|
|
|
|
|
|
-[[cluster-fault-detection-cluster-state-publishing]] The elected master node
|
|
|
+[[cluster-fault-detection-cluster-state-publishing]]
|
|
|
+The elected master node
|
|
|
will also remove nodes from the cluster if nodes are unable to apply an updated
|
|
|
cluster state within a reasonable time. The timeout defaults to 2 minutes
|
|
|
starting from the beginning of the cluster state update. Refer to
|
|
@@ -120,6 +121,9 @@ When it rejoins, the `NodeJoinExecutor` will log that it processed a
|
|
|
is unexpectedly restarting, look at the node's logs to see why it is shutting
|
|
|
down.
|
|
|
|
|
|
+The <<health-api>> API on the affected node will also provide some useful
|
|
|
+information about the situation.
|
|
|
+
|
|
|
If the node did not restart then you should look at the reason for its
|
|
|
departure more closely. Each reason has different troubleshooting steps,
|
|
|
described below. There are three possible reasons:
|
|
@@ -244,141 +248,17 @@ a possible cause for this kind of instability. Log messages containing
|
|
|
|
|
|
If the last check failed with an exception then the exception is reported, and
|
|
|
typically indicates the problem that needs to be addressed. If any of the
|
|
|
-checks timed out, it may be necessary to understand the detailed sequence of
|
|
|
-steps involved in a successful check. Here is an example of such a sequence:
|
|
|
-
|
|
|
-. The master's `FollowerChecker`, running on thread
|
|
|
-`elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
|
|
|
-the check request message to a follower node.
|
|
|
-
|
|
|
-. The master's `TransportService` running on thread
|
|
|
-`elasticsearch[master][transport_worker][T#2]` passes the check request message
|
|
|
-onto the operating system.
|
|
|
-
|
|
|
-. The operating system on the master converts the message into one or more
|
|
|
-packets and sends them out over the network.
|
|
|
-
|
|
|
-. Miscellaneous routers, firewalls, and other devices between the master node
|
|
|
-and the follower node forward the packets, possibly fragmenting or
|
|
|
-defragmenting them on the way.
|
|
|
-
|
|
|
-. The operating system on the follower node receives the packets and notifies
|
|
|
-{es} that they've been received.
|
|
|
-
|
|
|
-. The follower's `TransportService`, running on thread
|
|
|
-`elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
|
|
|
-It then reconstructs and processes the check request. Usually, the check
|
|
|
-quickly succeeds. If so, the same thread immediately constructs a response and
|
|
|
-passes it back to the operating system.
|
|
|
-
|
|
|
-. If the check doesn't immediately succeed (for example, an election started
|
|
|
-recently) then:
|
|
|
-
|
|
|
-.. The follower's `FollowerChecker`, running on thread
|
|
|
-`elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
|
|
|
-constructs a response and tells the `TransportService` to send the response
|
|
|
-back to the master.
|
|
|
-
|
|
|
-.. The follower's `TransportService`, running on thread
|
|
|
-`elasticsearch[follower][transport_worker][T#3]`, passes the response to the
|
|
|
-operating system.
|
|
|
-
|
|
|
-. The operating system on the follower converts the response into one or more
|
|
|
-packets and sends them out over the network.
|
|
|
-
|
|
|
-. Miscellaneous routers, firewalls, and other devices between master and
|
|
|
-follower forward the packets, possibly fragmenting or defragmenting them on the
|
|
|
-way.
|
|
|
-
|
|
|
-. The operating system on the master receives the packets and notifies {es}
|
|
|
-that they've been received.
|
|
|
-
|
|
|
-. The master's `TransportService`, running on thread
|
|
|
-`elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
|
|
|
-reconstructs the check response, and processes it as long as the check didn't
|
|
|
-already time out.
|
|
|
-
|
|
|
-There are a lot of different things that can delay the completion of a check
|
|
|
-and cause it to time out. Here are some examples for each step:
|
|
|
-
|
|
|
-. There may be a long garbage collection (GC) or virtual machine (VM) pause
|
|
|
-after passing the check request to the `TransportService`.
|
|
|
-
|
|
|
-. There may be a long wait for the specific `transport_worker` thread to become
|
|
|
-available, or there may be a long GC or VM pause before passing the check
|
|
|
-request onto the operating system.
|
|
|
-
|
|
|
-. A system fault (for example, a broken network card) on the master may delay
|
|
|
-sending the message over the network, possibly indefinitely.
|
|
|
-
|
|
|
-. Intermediate devices may delay, drop, or corrupt packets along the way. The
|
|
|
-operating system for the master will wait and retransmit any unacknowledged or
|
|
|
-corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
|
|
|
-<<system-config-tcpretries,reducing this value>> since the default represents a
|
|
|
-very long delay.
|
|
|
-
|
|
|
-. A system fault (for example, a broken network card) on the follower may delay
|
|
|
-receiving the message from the network.
|
|
|
-
|
|
|
-. There may be a long wait for the specific `transport_worker` thread to become
|
|
|
-available, or there may be a long GC or VM pause during the processing of the
|
|
|
-request on the follower.
|
|
|
-
|
|
|
-. There may be a long wait for the `cluster_coordination` thread to become
|
|
|
-available, or for the specific `transport_worker` thread to become available
|
|
|
-again. There may also be a long GC or VM pause during the processing of the
|
|
|
-request.
|
|
|
-
|
|
|
-. A system fault (for example, a broken network card) on the follower may delay
|
|
|
-sending the response from the network.
|
|
|
-
|
|
|
-. Intermediate devices may delay, drop, or corrupt packets along the way again,
|
|
|
-causing retransmissions.
|
|
|
-
|
|
|
-. A system fault (for example, a broken network card) on the master may delay
|
|
|
-receiving the message from the network.
|
|
|
-
|
|
|
-. There may be a long wait for the specific `transport_worker` thread to become
|
|
|
-available to process the response, or a long GC or VM pause.
|
|
|
-
|
|
|
-To determine why follower checks are timing out, we can narrow down the reason
|
|
|
-for the delay as follows:
|
|
|
+checks timed out then narrow down the problem as follows.
|
|
|
|
|
|
-* GC pauses are recorded in the GC logs that {es} emits by default, and also
|
|
|
-usually by the `JvmMonitorService` in the main node logs. Use these logs to
|
|
|
-confirm whether or not GC is resulting in delays.
|
|
|
+include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
|
|
|
|
|
-* VM pauses also affect other processes on the same host. A VM pause also
|
|
|
-typically causes a discontinuity in the system clock, which {es} will report in
|
|
|
-its logs.
|
|
|
+include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection]
|
|
|
|
|
|
-* Packet captures will reveal system-level and network-level faults, especially
|
|
|
-if you capture the network traffic simultaneously at the elected master and the
|
|
|
-faulty node. The connection used for follower checks is not used for any other
|
|
|
-traffic so it can be easily identified from the flow pattern alone, even if TLS
|
|
|
-is in use: almost exactly every second there will be a few hundred bytes sent
|
|
|
-each way, first the request by the master and then the response by the
|
|
|
-follower. You should be able to observe any retransmissions, packet loss, or
|
|
|
-other delays on such a connection.
|
|
|
+include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
|
|
|
|
|
-* Long waits for particular threads to be available can be identified by taking
|
|
|
-stack dumps (for example, using `jstack`) or a profiling trace (for example,
|
|
|
-using Java Flight Recorder) in the few seconds leading up to a node departure.
|
|
|
-+
|
|
|
By default the follower checks will time out after 30s, so if node departures
|
|
|
are unpredictable then capture stack dumps every 15s to be sure that at least
|
|
|
one stack dump was taken at the right time.
|
|
|
-+
|
|
|
-The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
|
|
|
-bear in mind that this API also requires a number of `transport_worker` and
|
|
|
-`generic` threads across all the nodes in the cluster. The API may be affected
|
|
|
-by the very problem you're trying to diagnose. `jstack` is much more reliable
|
|
|
-since it doesn't require any JVM threads.
|
|
|
-+
|
|
|
-The threads involved in the follower checks are `transport_worker` and
|
|
|
-`cluster_coordination` threads, for which there should never be a long wait.
|
|
|
-There may also be evidence of long waits for threads in the {es} logs. See
|
|
|
-<<modules-network-threading-model>> for more information.
|
|
|
|
|
|
===== Diagnosing `ShardLockObtainFailedException` failures
|
|
|
|