123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417 |
- [[cluster-fault-detection]]
- === Cluster fault detection
- The elected master periodically checks each of the nodes in the cluster to
- ensure that they are still connected and healthy. Each node in the cluster also
- periodically checks the health of the elected master. These checks are known
- respectively as _follower checks_ and _leader checks_.
- Elasticsearch allows these checks to occasionally fail or timeout without
- taking any action. It considers a node to be faulty only after a number of
- consecutive checks have failed. You can control fault detection behavior with
- <<modules-discovery-settings,`cluster.fault_detection.*` settings>>.
- If the elected master detects that a node has disconnected, however, this
- situation is treated as an immediate failure. The master bypasses the timeout
- and retry setting values and attempts to remove the node from the cluster.
- Similarly, if a node detects that the elected master has disconnected, this
- situation is treated as an immediate failure. The node bypasses the timeout and
- retry settings and restarts its discovery phase to try and find or elect a new
- master.
- [[cluster-fault-detection-filesystem-health]]
- Additionally, each node periodically verifies that its data path is healthy by
- writing a small file to disk and then deleting it again. If a node discovers
- its data path is unhealthy then it is removed from the cluster until the data
- path recovers. You can control this behavior with the
- <<modules-discovery-settings,`monitor.fs.health` settings>>.
- [[cluster-fault-detection-cluster-state-publishing]] The elected master node
- will also remove nodes from the cluster if nodes are unable to apply an updated
- cluster state within a reasonable time. The timeout defaults to 2 minutes
- starting from the beginning of the cluster state update. Refer to
- <<cluster-state-publishing>> for a more detailed description.
- [[cluster-fault-detection-troubleshooting]]
- ==== Troubleshooting an unstable cluster
- Normally, a node will only leave a cluster if deliberately shut down. If a node
- leaves the cluster unexpectedly, it's important to address the cause. A cluster
- in which nodes leave unexpectedly is unstable and can create several issues.
- For instance:
- * The cluster health may be yellow or red.
- * Some shards will be initializing and other shards may be failing.
- * Search, indexing, and monitoring operations may fail and report exceptions in
- logs.
- * The `.security` index may be unavailable, blocking access to the cluster.
- * The master may appear busy due to frequent cluster state updates.
- To troubleshoot a cluster in this state, first ensure the cluster has a
- <<discovery-troubleshooting,stable master>>. Next, focus on the nodes
- unexpectedly leaving the cluster ahead of all other issues. It will not be
- possible to solve other issues until the cluster has a stable master node and
- stable node membership.
- Diagnostics and statistics are usually not useful in an unstable cluster. These
- tools only offer a view of the state of the cluster at a single point in time.
- Instead, look at the cluster logs to see the pattern of behaviour over time.
- Focus particularly on logs from the elected master. When a node leaves the
- cluster, logs for the elected master include a message like this (with line
- breaks added to make it easier to read):
- [source,text]
- ----
- [2022-03-21T11:02:35,513][INFO ][o.e.c.c.NodeLeftExecutor] [instance-0000000000]
- node-left: [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}]
- with reason [disconnected]
- ----
- This message says that the `NodeLeftExecutor` on the elected master
- (`instance-0000000000`) processed a `node-left` task, identifying the node that
- was removed and the reason for its removal. When the node joins the cluster
- again, logs for the elected master will include a message like this (with line
- breaks added to make it easier to read):
- [source,text]
- ----
- [2022-03-21T11:02:59,892][INFO ][o.e.c.c.NodeJoinExecutor] [instance-0000000000]
- node-join: [{instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{UNw_RuazQCSBskWZV8ID_w}{172.27.47.21}{172.27.47.21:19054}{m}]
- with reason [joining after restart, removed [24s] ago with reason [disconnected]]
- ----
- This message says that the `NodeJoinExecutor` on the elected master
- (`instance-0000000000`) processed a `node-join` task, identifying the node that
- was added to the cluster and the reason for the task.
- Other nodes may log similar messages, but report fewer details:
- [source,text]
- ----
- [2020-01-29T11:02:36,985][INFO ][o.e.c.s.ClusterApplierService]
- [instance-0000000001] removed {
- {instance-0000000004}{bfcMDTiDRkietFb9v_di7w}{aNlyORLASam1ammv2DzYXA}{172.27.47.21}{172.27.47.21:19054}{m}
- {tiebreaker-0000000003}{UNw_RuazQCSBskWZV8ID_w}{bltyVOQ-RNu20OQfTHSLtA}{172.27.161.154}{172.27.161.154:19251}{mv}
- }, term: 14, version: 1653415, reason: Publication{term=14, version=1653415}
- ----
- These messages are not especially useful for troubleshooting, so focus on the
- ones from the `NodeLeftExecutor` and `NodeJoinExecutor` which are only emitted
- on the elected master and which contain more details. If you don't see the
- messages from the `NodeLeftExecutor` and `NodeJoinExecutor`, check that:
- * You're looking at the logs for the elected master node.
- * The logs cover the correct time period.
- * Logging is enabled at `INFO` level.
- Nodes will also log a message containing `master node changed` whenever they
- start or stop following the elected master. You can use these messages to
- determine each node's view of the state of the master over time.
- If a node restarts, it will leave the cluster and then join the cluster again.
- When it rejoins, the `NodeJoinExecutor` will log that it processed a
- `node-join` task indicating that the node is `joining after restart`. If a node
- is unexpectedly restarting, look at the node's logs to see why it is shutting
- down.
- If the node did not restart then you should look at the reason for its
- departure more closely. Each reason has different troubleshooting steps,
- described below. There are three possible reasons:
- * `disconnected`: The connection from the master node to the removed node was
- closed.
- * `lagging`: The master published a cluster state update, but the removed node
- did not apply it within the permitted timeout. By default, this timeout is 2
- minutes. Refer to <<modules-discovery-settings>> for information about the
- settings which control this mechanism.
- * `followers check retry count exceeded`: The master sent a number of
- consecutive health checks to the removed node. These checks were rejected or
- timed out. By default, each health check times out after 10 seconds and {es}
- removes the node removed after three consecutively failed health checks. Refer
- to <<modules-discovery-settings>> for information about the settings which
- control this mechanism.
- ===== Diagnosing `disconnected` nodes
- Nodes typically leave the cluster with reason `disconnected` when they shut
- down, but if they rejoin the cluster without restarting then there is some
- other problem.
- {es} is designed to run on a fairly reliable network. It opens a number of TCP
- connections between nodes and expects these connections to remain open forever.
- If a connection is closed then {es} will try and reconnect, so the occasional
- blip should have limited impact on the cluster even if the affected node
- briefly leaves the cluster. In contrast, repeatedly-dropped connections will
- severely affect its operation.
- The connections from the elected master node to every other node in the cluster
- are particularly important. The elected master never spontaneously closes its
- outbound connections to other nodes. Similarly, once a connection is fully
- established, a node never spontaneously close its inbound connections unless
- the node is shutting down.
- If you see a node unexpectedly leave the cluster with the `disconnected`
- reason, something other than {es} likely caused the connection to close. A
- common cause is a misconfigured firewall with an improper timeout or another
- policy that's <<long-lived-connections,incompatible with {es}>>. It could also
- be caused by general connectivity issues, such as packet loss due to faulty
- hardware or network congestion. If you're an advanced user, you can get more
- detailed information about network exceptions by configuring the following
- loggers:
- [source,yaml]
- ----
- logger.org.elasticsearch.transport.TcpTransport: DEBUG
- logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG
- ----
- In extreme cases, you may need to take packet captures using `tcpdump` to
- determine whether messages between nodes are being dropped or rejected by some
- other device on the network.
- ===== Diagnosing `lagging` nodes
- {es} needs every node to process cluster state updates reasonably quickly. If a
- node takes too long to process a cluster state update, it can be harmful to the
- cluster. The master will remove these nodes with the `lagging` reason. Refer to
- <<modules-discovery-settings>> for information about the settings which control
- this mechanism.
- Lagging is typically caused by performance issues on the removed node. However,
- a node may also lag due to severe network delays. To rule out network delays,
- ensure that `net.ipv4.tcp_retries2` is <<system-config-tcpretries,configured
- properly>>. Log messages that contain `warn threshold` may provide more
- information about the root cause.
- If you're an advanced user, you can get more detailed information about what
- the node was doing when it was removed by configuring the following logger:
- [source,yaml]
- ----
- logger.org.elasticsearch.cluster.coordination.LagDetector: DEBUG
- ----
- When this logger is enabled, {es} will attempt to run the
- <<cluster-nodes-hot-threads>> API on the faulty node and report the results in
- the logs on the elected master. The results are compressed, encoded, and split
- into chunks to avoid truncation:
- [source,text]
- ----
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 1]: H4sIAAAAAAAA/x...
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 2]: p7x3w1hmOQVtuV...
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 3]: v7uTboMGDbyOy+...
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] [part 4]: 4tse0RnPnLeDNN...
- [DEBUG][o.e.c.c.LagDetector ] [master] hot threads from node [{node}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] lagging at version [183619] despite commit of cluster state version [183620] (gzip compressed, base64-encoded, and split into 4 parts on preceding log lines)
- ----
- To reconstruct the output, base64-decode the data and decompress it using
- `gzip`. For instance, on Unix-like systems:
- [source,sh]
- ----
- cat lagdetector.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
- ----
- ===== Diagnosing `follower check retry count exceeded` nodes
- Nodes sometimes leave the cluster with reason `follower check retry count
- exceeded` when they shut down, but if they rejoin the cluster without
- restarting then there is some other problem.
- {es} needs every node to respond to network messages successfully and
- reasonably quickly. If a node rejects requests or does not respond at all then
- it can be harmful to the cluster. If enough consecutive checks fail then the
- master will remove the node with reason `follower check retry count exceeded`
- and will indicate in the `node-left` message how many of the consecutive
- unsuccessful checks failed and how many of them timed out. Refer to
- <<modules-discovery-settings>> for information about the settings which control
- this mechanism.
- Timeouts and failures may be due to network delays or performance problems on
- the affected nodes. Ensure that `net.ipv4.tcp_retries2` is
- <<system-config-tcpretries,configured properly>> to eliminate network delays as
- a possible cause for this kind of instability. Log messages containing
- `warn threshold` may give further clues about the cause of the instability.
- If the last check failed with an exception then the exception is reported, and
- typically indicates the problem that needs to be addressed. If any of the
- checks timed out, it may be necessary to understand the detailed sequence of
- steps involved in a successful check. Here is an example of such a sequence:
- . The master's `FollowerChecker`, running on thread
- `elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
- the check request message to a follower node.
- . The master's `TransportService` running on thread
- `elasticsearch[master][transport_worker][T#2]` passes the check request message
- onto the operating system.
- . The operating system on the master converts the message into one or more
- packets and sends them out over the network.
- . Miscellaneous routers, firewalls, and other devices between the master node
- and the follower node forward the packets, possibly fragmenting or
- defragmenting them on the way.
- . The operating system on the follower node receives the packets and notifies
- {es} that they've been received.
- . The follower's `TransportService`, running on thread
- `elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
- It then reconstructs and processes the check request. Usually, the check
- quickly succeeds. If so, the same thread immediately constructs a response and
- passes it back to the operating system.
- . If the check doesn't immediately succeed (for example, an election started
- recently) then:
- .. The follower's `FollowerChecker`, running on thread
- `elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
- constructs a response and tells the `TransportService` to send the response
- back to the master.
- .. The follower's `TransportService`, running on thread
- `elasticsearch[follower][transport_worker][T#3]`, passes the response to the
- operating system.
- . The operating system on the follower converts the response into one or more
- packets and sends them out over the network.
- . Miscellaneous routers, firewalls, and other devices between master and
- follower forward the packets, possibly fragmenting or defragmenting them on the
- way.
- . The operating system on the master receives the packets and notifies {es}
- that they've been received.
- . The master's `TransportService`, running on thread
- `elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
- reconstructs the check response, and processes it as long as the check didn't
- already time out.
- There are a lot of different things that can delay the completion of a check
- and cause it to time out. Here are some examples for each step:
- . There may be a long garbage collection (GC) or virtual machine (VM) pause
- after passing the check request to the `TransportService`.
- . There may be a long wait for the specific `transport_worker` thread to become
- available, or there may be a long GC or VM pause before passing the check
- request onto the operating system.
- . A system fault (for example, a broken network card) on the master may delay
- sending the message over the network, possibly indefinitely.
- . Intermediate devices may delay, drop, or corrupt packets along the way. The
- operating system for the master will wait and retransmit any unacknowledged or
- corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
- <<system-config-tcpretries,reducing this value>> since the default represents a
- very long delay.
- . A system fault (for example, a broken network card) on the follower may delay
- receiving the message from the network.
- . There may be a long wait for the specific `transport_worker` thread to become
- available, or there may be a long GC or VM pause during the processing of the
- request on the follower.
- . There may be a long wait for the `cluster_coordination` thread to become
- available, or for the specific `transport_worker` thread to become available
- again. There may also be a long GC or VM pause during the processing of the
- request.
- . A system fault (for example, a broken network card) on the follower may delay
- sending the response from the network.
- . Intermediate devices may delay, drop, or corrupt packets along the way again,
- causing retransmissions.
- . A system fault (for example, a broken network card) on the master may delay
- receiving the message from the network.
- . There may be a long wait for the specific `transport_worker` thread to become
- available to process the response, or a long GC or VM pause.
- To determine why follower checks are timing out, we can narrow down the reason
- for the delay as follows:
- * GC pauses are recorded in the GC logs that {es} emits by default, and also
- usually by the `JvmMonitorService` in the main node logs. Use these logs to
- confirm whether or not GC is resulting in delays.
- * VM pauses also affect other processes on the same host. A VM pause also
- typically causes a discontinuity in the system clock, which {es} will report in
- its logs.
- * Packet captures will reveal system-level and network-level faults, especially
- if you capture the network traffic simultaneously at the elected master and the
- faulty node. The connection used for follower checks is not used for any other
- traffic so it can be easily identified from the flow pattern alone, even if TLS
- is in use: almost exactly every second there will be a few hundred bytes sent
- each way, first the request by the master and then the response by the
- follower. You should be able to observe any retransmissions, packet loss, or
- other delays on such a connection.
- * Long waits for particular threads to be available can be identified by taking
- stack dumps (for example, using `jstack`) or a profiling trace (for example,
- using Java Flight Recorder) in the few seconds leading up to a node departure.
- +
- By default the follower checks will time out after 30s, so if node departures
- are unpredictable then capture stack dumps every 15s to be sure that at least
- one stack dump was taken at the right time.
- +
- The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
- bear in mind that this API also requires a number of `transport_worker` and
- `generic` threads across all the nodes in the cluster. The API may be affected
- by the very problem you're trying to diagnose. `jstack` is much more reliable
- since it doesn't require any JVM threads.
- +
- The threads involved in the follower checks are `transport_worker` and
- `cluster_coordination` threads, for which there should never be a long wait.
- There may also be evidence of long waits for threads in the {es} logs. See
- <<modules-network-threading-model>> for more information.
- ===== Diagnosing `ShardLockObtainFailedException` failures
- If a node leaves and rejoins the cluster then {es} will usually shut down and
- re-initialize its shards. If the shards do not shut down quickly enough then
- {es} may fail to re-initialize them due to a `ShardLockObtainFailedException`.
- To gather more information about the reason for shards shutting down slowly,
- configure the following logger:
- [source,yaml]
- ----
- logger.org.elasticsearch.env.NodeEnvironment: DEBUG
- ----
- When this logger is enabled, {es} will attempt to run the
- <<cluster-nodes-hot-threads>> API whenever it encounters a
- `ShardLockObtainFailedException`. The results are compressed, encoded, and
- split into chunks to avoid truncation:
- [source,text]
- ----
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 1]: H4sIAAAAAAAA/x...
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 2]: p7x3w1hmOQVtuV...
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 3]: v7uTboMGDbyOy+...
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] [part 4]: 4tse0RnPnLeDNN...
- [DEBUG][o.e.e.NodeEnvironment ] [master] hot threads while failing to obtain shard lock for [index][0] (gzip compressed, base64-encoded, and split into 4 parts on preceding log lines)
- ----
- To reconstruct the output, base64-decode the data and decompress it using
- `gzip`. For instance, on Unix-like systems:
- [source,sh]
- ----
- cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
- ----
|