|
@@ -26,8 +26,317 @@ its data path is unhealthy then it is removed from the cluster until the data
|
|
|
path recovers. You can control this behavior with the
|
|
|
<<modules-discovery-settings,`monitor.fs.health` settings>>.
|
|
|
|
|
|
-[[cluster-fault-detection-cluster-state-publishing]]
|
|
|
-The elected master node will also remove nodes from the cluster if nodes are unable
|
|
|
-to apply an updated cluster state within a reasonable time, which is by default
|
|
|
-120s after the cluster state update started. See
|
|
|
-<<cluster-state-publishing, cluster state publishing>> for a more detailed description.
|
|
|
+[[cluster-fault-detection-cluster-state-publishing]] The elected master node
|
|
|
+will also remove nodes from the cluster if nodes are unable to apply an updated
|
|
|
+cluster state within a reasonable time. The timeout defaults to 2 minutes
|
|
|
+starting from the beginning of the cluster state update. Refer to
|
|
|
+<<cluster-state-publishing>> for a more detailed description.
|
|
|
+
|
|
|
+[[cluster-fault-detection-troubleshooting]]
|
|
|
+==== Troubleshooting an unstable cluster
|
|
|
+
|
|
|
+Normally, a node will only leave a cluster if deliberately shut down. If a node
|
|
|
+leaves the cluster unexpectedly, it's important to address the cause. A cluster
|
|
|
+in which nodes leave unexpectedly is unstable and can create several issues.
|
|
|
+For instance:
|
|
|
+
|
|
|
+* The cluster health may be yellow or red.
|
|
|
+
|
|
|
+* Some shards will be initializing and other shards may be failing.
|
|
|
+
|
|
|
+* Search, indexing, and monitoring operations may fail and report exceptions in
|
|
|
+logs.
|
|
|
+
|
|
|
+* The `.security` index may be unavailable, blocking access to the cluster.
|
|
|
+
|
|
|
+* The master may appear busy due to frequent cluster state updates.
|
|
|
+
|
|
|
+To troubleshoot a cluster in this state, first ensure the cluster has a
|
|
|
+<<modules-discovery-troubleshooting,stable master>>. Next, focus on the nodes
|
|
|
+unexpectedly leaving the cluster ahead of all other issues. It will not be
|
|
|
+possible to solve other issues until the cluster has a stable master node and
|
|
|
+stable node membership.
|
|
|
+
|
|
|
+Diagnostics and statistics are usually not useful in an unstable cluster. These
|
|
|
+tools only offer a view of the state of the cluster at a single point in time.
|
|
|
+Instead, look at the cluster logs to see the pattern of behaviour over time.
|
|
|
+Focus particularly on logs from the elected master. When a node leaves the
|
|
|
+cluster, logs for the elected master include a message like this (with line
|
|
|
+breaks added for clarity):
|
|
|
+
|
|
|
+[source,text]
|
|
|
+----
|
|
|
+[2022-03-21T11:02:35,513][INFO ][o.e.c.s.MasterService ]
|
|
|
+ [instance-0000000000] node-left[
|
|
|
+ {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}
|
|
|
+ reason: disconnected,
|
|
|
+ {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv}
|
|
|
+ reason: disconnected
|
|
|
+ ], term: 14, version: 1653415, ...
|
|
|
+----
|
|
|
+
|
|
|
+This message says that the `MasterService` on the elected master
|
|
|
+(`instance-0000000000`) is processing a `node-left` task. It lists the nodes
|
|
|
+that are being removed and the reasons for their removal. Other nodes may log
|
|
|
+similar messages, but report fewer details:
|
|
|
+
|
|
|
+[source,text]
|
|
|
+----
|
|
|
+[2020-01-29T11:02:36,985][INFO ][o.e.c.s.ClusterApplierService]
|
|
|
+ [instance-0000000001] removed {
|
|
|
+ {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}
|
|
|
+ {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv}
|
|
|
+ }, term: 14, version: 1653415, reason: Publication{term=14, version=1653415}
|
|
|
+----
|
|
|
+
|
|
|
+Focus on the one from the `MasterService` which is only emitted on the elected
|
|
|
+master, since it contains more details. If you don't see the messages from the
|
|
|
+`MasterService`, check that:
|
|
|
+
|
|
|
+* You're looking at the logs for the elected master node.
|
|
|
+
|
|
|
+* The logs cover the correct time period.
|
|
|
+
|
|
|
+* Logging is enabled at `INFO` level.
|
|
|
+
|
|
|
+Nodes will also log a message containing `master node changed` whenever they
|
|
|
+start or stop following the elected master. You can use these messages to
|
|
|
+determine each node's view of the state of the master over time.
|
|
|
+
|
|
|
+If a node restarts, it will leave the cluster and then join the cluster again.
|
|
|
+When it rejoins, the `MasterService` will log that it is processing a
|
|
|
+`node-join` task. You can tell from the master logs that the node was restarted
|
|
|
+because the `node-join` message will indicate that it is
|
|
|
+`joining after restart`. In older {es} versions, you can also determine that a
|
|
|
+node restarted by looking at the second "ephemeral" ID in the `node-left` and
|
|
|
+subsequent `node-join` messages. This ephemeral ID is different each time the
|
|
|
+node starts up. If a node is unexpectedly restarting, you'll need to look at
|
|
|
+the node's logs to see why it is shutting down.
|
|
|
+
|
|
|
+If the node did not restart then you should look at the reason for its
|
|
|
+departure in the `node-left` message, which is reported after each node. There
|
|
|
+are three possible reasons:
|
|
|
+
|
|
|
+* `disconnected`: The connection from the master node to the removed node was
|
|
|
+closed.
|
|
|
+
|
|
|
+* `lagging`: The master published a cluster state update, but the removed node
|
|
|
+did not apply it within the permitted timeout. By default, this timeout is 2
|
|
|
+minutes. Refer to <<modules-discovery-settings>> for information about the
|
|
|
+settings which control this mechanism.
|
|
|
+
|
|
|
+* `followers check retry count exceeded`: The master sent a number of
|
|
|
+consecutive health checks to the removed node. These checks were rejected or
|
|
|
+timed out. By default, each health check times out after 10 seconds and {es}
|
|
|
+removes the node removed after three consecutively failed health checks. Refer
|
|
|
+to <<modules-discovery-settings>> for information about the settings which
|
|
|
+control this mechanism.
|
|
|
+
|
|
|
+===== Diagnosing `disconnected` nodes
|
|
|
+
|
|
|
+{es} is designed to run on a fairly reliable network. It opens a number of TCP
|
|
|
+connections between nodes and expects these connections to remain open forever.
|
|
|
+If a connection is closed then {es} will try and reconnect, so the occasional
|
|
|
+blip should have limited impact on the cluster even if the affected node
|
|
|
+briefly leaves the cluster. In contrast, repeatedly-dropped connections will
|
|
|
+severely affect its operation.
|
|
|
+
|
|
|
+The connections from the elected master node to every other node in the cluster
|
|
|
+are particularly important. The elected master never spontaneously closes its
|
|
|
+outbound connections to other nodes. Similarly, once a connection is fully
|
|
|
+established, a node never spontaneously close its inbound connections unless
|
|
|
+the node is shutting down.
|
|
|
+
|
|
|
+If you see a node unexpectedly leave the cluster with the `disconnected`
|
|
|
+reason, something other than {es} likely caused the connection to close. A
|
|
|
+common cause is a misconfigured firewall with an improper timeout or another
|
|
|
+policy that's <<long-lived-connections,incompatible with {es}>>. It could also
|
|
|
+be caused by general connectivity issues, such as packet loss due to faulty
|
|
|
+hardware or network congestion. If you're an advanced user, you can get more
|
|
|
+detailed information about network exceptions by configuring the following
|
|
|
+loggers:
|
|
|
+
|
|
|
+[source,yaml]
|
|
|
+----
|
|
|
+logger.org.elasticsearch.transport.TcpTransport: DEBUG
|
|
|
+logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG
|
|
|
+----
|
|
|
+
|
|
|
+In extreme cases, you may need to take packet captures using `tcpdump` to
|
|
|
+determine whether messages between nodes are being dropped or rejected by some
|
|
|
+other device on the network.
|
|
|
+
|
|
|
+===== Diagnosing `lagging` nodes
|
|
|
+
|
|
|
+{es} needs every node to process cluster state updates reasonably quickly. If a
|
|
|
+node takes too long to process a cluster state update, it can be harmful to the
|
|
|
+cluster. The master will remove these nodes with the `lagging` reason. Refer to
|
|
|
+<<modules-discovery-settings>> for information about the settings which control
|
|
|
+this mechanism.
|
|
|
+
|
|
|
+Lagging is typically caused by performance issues on the removed node. However,
|
|
|
+a node may also lag due to severe network delays. To rule out network delays,
|
|
|
+ensure that `net.ipv4.tcp_retries2` is <<system-config-tcpretries,configured
|
|
|
+properly>>. Log messages that contain `warn threshold` may provide more
|
|
|
+information about the root cause.
|
|
|
+
|
|
|
+If you're an advanced user, you can get more detailed information about what
|
|
|
+the node was doing when it was removed by configuring the following logger:
|
|
|
+
|
|
|
+[source,yaml]
|
|
|
+----
|
|
|
+logger.org.elasticsearch.cluster.coordination.LagDetector: DEBUG
|
|
|
+----
|
|
|
+
|
|
|
+When this logger is enabled, {es} will attempt to run the
|
|
|
+<<cluster-nodes-hot-threads>> API on the faulty node and report the results in
|
|
|
+the logs on the elected master.
|
|
|
+
|
|
|
+===== Diagnosing `follower check retry count exceeded` nodes
|
|
|
+
|
|
|
+{es} needs every node to respond to network messages successfully and
|
|
|
+reasonably quickly. If a node rejects requests or does not respond at all then
|
|
|
+it can be harmful to the cluster. If enough consecutive checks fail then the
|
|
|
+master will remove the node with reason `follower check retry count exceeded`
|
|
|
+and will indicate in the `node-left` message how many of the consecutive
|
|
|
+unsuccessful checks failed and how many of them timed out. Refer to
|
|
|
+<<modules-discovery-settings>> for information about the settings which control
|
|
|
+this mechanism.
|
|
|
+
|
|
|
+Timeouts and failures may be due to network delays or performance problems on
|
|
|
+the affected nodes. Ensure that `net.ipv4.tcp_retries2` is
|
|
|
+<<system-config-tcpretries,configured properly>> to eliminate network delays as
|
|
|
+a possible cause for this kind of instability. Log messages containing
|
|
|
+`warn threshold` may give further clues about the cause of the instability.
|
|
|
+
|
|
|
+If the last check failed with an exception then the exception is reported, and
|
|
|
+typically indicates the problem that needs to be addressed. If any of the
|
|
|
+checks timed out, it may be necessary to understand the detailed sequence of
|
|
|
+steps involved in a successful check. Here is an example of such a sequence:
|
|
|
+
|
|
|
+. The master's `FollowerChecker`, running on thread
|
|
|
+`elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
|
|
|
+the check request message to a follower node.
|
|
|
+
|
|
|
+. The master's `TransportService` running on thread
|
|
|
+`elasticsearch[master][transport_worker][T#2]` passes the check request message
|
|
|
+onto the operating system.
|
|
|
+
|
|
|
+. The operating system on the master converts the message into one or more
|
|
|
+packets and sends them out over the network.
|
|
|
+
|
|
|
+. Miscellaneous routers, firewalls, and other devices between the master node
|
|
|
+and the follower node forward the packets, possibly fragmenting or
|
|
|
+defragmenting them on the way.
|
|
|
+
|
|
|
+. The operating system on the follower node receives the packets and notifies
|
|
|
+{es} that they've been received.
|
|
|
+
|
|
|
+. The follower's `TransportService`, running on thread
|
|
|
+`elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
|
|
|
+It then reconstructs and processes the check request. Usually, the check
|
|
|
+quickly succeeds. If so, the same thread immediately constructs a response and
|
|
|
+passes it back to the operating system.
|
|
|
+
|
|
|
+. If the check doesn't immediately succeed (for example, an election started
|
|
|
+recently) then:
|
|
|
+
|
|
|
+.. The follower's `FollowerChecker`, running on thread
|
|
|
+`elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
|
|
|
+constructs a response and tells the `TransportService` to send the response
|
|
|
+back to the master.
|
|
|
+
|
|
|
+.. The follower's `TransportService`, running on thread
|
|
|
+`elasticsearch[follower][transport_worker][T#3]`, passes the response to the
|
|
|
+operating system.
|
|
|
+
|
|
|
+. The operating system on the follower converts the response into one or more
|
|
|
+packets and sends them out over the network.
|
|
|
+
|
|
|
+. Miscellaneous routers, firewalls, and other devices between master and
|
|
|
+follower forward the packets, possibly fragmenting or defragmenting them on the
|
|
|
+way.
|
|
|
+
|
|
|
+. The operating system on the master receives the packets and notifies {es}
|
|
|
+that they've been received.
|
|
|
+
|
|
|
+. The master's `TransportService`, running on thread
|
|
|
+`elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
|
|
|
+reconstructs the check response, and processes it as long as the check didn't
|
|
|
+already time out.
|
|
|
+
|
|
|
+There are a lot of different things that can delay the completion of a check
|
|
|
+and cause it to time out. Here are some examples for each step:
|
|
|
+
|
|
|
+. There may be a long garbage collection (GC) or virtual machine (VM) pause
|
|
|
+after passing the check request to the `TransportService`.
|
|
|
+
|
|
|
+. There may be a long wait for the specific `transport_worker` thread to become
|
|
|
+available, or there may be a long GC or VM pause before passing the check
|
|
|
+request onto the operating system.
|
|
|
+
|
|
|
+. A system fault (for example, a broken network card) on the master may delay
|
|
|
+sending the message over the network, possibly indefinitely.
|
|
|
+
|
|
|
+. Intermediate devices may delay, drop, or corrupt packets along the way. The
|
|
|
+operating system for the master will wait and retransmit any unacknowledged or
|
|
|
+corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
|
|
|
+<<system-config-tcpretries,reducing this value>> since the default represents a
|
|
|
+very long delay.
|
|
|
+
|
|
|
+. A system fault (for example, a broken network card) on the follower may delay
|
|
|
+receiving the message from the network.
|
|
|
+
|
|
|
+. There may be a long wait for the specific `transport_worker` thread to become
|
|
|
+available, or there may be a long GC or VM pause during the processing of the
|
|
|
+request on the follower.
|
|
|
+
|
|
|
+. There may be a long wait for the `cluster_coordination` thread to become
|
|
|
+available, or for the specific `transport_worker` thread to become available
|
|
|
+again. There may also be a long GC or VM pause during the processing of the
|
|
|
+request.
|
|
|
+
|
|
|
+. A system fault (for example, a broken network card) on the follower may delay
|
|
|
+sending the response from the network.
|
|
|
+
|
|
|
+. Intermediate devices may delay, drop, or corrupt packets along the way again,
|
|
|
+causing retransmissions.
|
|
|
+
|
|
|
+. A system fault (for example, a broken network card) on the master may delay
|
|
|
+receiving the message from the network.
|
|
|
+
|
|
|
+. There may be a long wait for the specific `transport_worker` thread to become
|
|
|
+available to process the response, or a long GC or VM pause.
|
|
|
+
|
|
|
+To determine why follower checks are timing out, we can narrow down the reason
|
|
|
+for the delay as follows:
|
|
|
+
|
|
|
+* GC pauses are recorded in the GC logs that {es} emits by default, and also
|
|
|
+usually by the `JvmMonitorService` in the main node logs. Use these logs to
|
|
|
+confirm whether or not GC is resulting in delays.
|
|
|
+
|
|
|
+* VM pauses also affect other processes on the same host. A VM pause also
|
|
|
+typically causes a discontinuity in the system clock, which {es} will report in
|
|
|
+its logs.
|
|
|
+
|
|
|
+* Packet captures will reveal system-level and network-level faults, especially
|
|
|
+if you capture the network traffic simultaneously at the elected master and the
|
|
|
+faulty node. The connection used for follower checks is not used for any other
|
|
|
+traffic so it can be easily identified from the flow pattern alone, even if TLS
|
|
|
+is in use: almost exactly every second there will be a few hundred bytes sent
|
|
|
+each way, first the request by the master and then the response by the
|
|
|
+follower. You should be able to observe any retransmissions, packet loss, or
|
|
|
+other delays on such a connection.
|
|
|
+
|
|
|
+* Long waits for particular threads to be available can be identified by taking
|
|
|
+stack dumps (for example, using `jstack`) or a profiling trace (for example,
|
|
|
+using Java Flight Recorder) in the few seconds leading up to a node departure.
|
|
|
+The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
|
|
|
+bear in mind that this API also requires a number of `transport_worker` and
|
|
|
+`generic` threads across all the nodes in the cluster. The API may be affected
|
|
|
+by the very problem you're trying to diagnose. `jstack` is much more reliable
|
|
|
+since it doesn't require any JVM threads. The threads involved in the follower
|
|
|
+checks are `transport_worker` and `cluster_coordination` threads, for which
|
|
|
+there should never be a long wait. There may also be evidence of long waits for
|
|
|
+threads in the {es} logs. Refer to <<modules-network-threading-model>> for more
|
|
|
+information.
|