123456789101112131415161718192021222324252627282930313233 |
- [[cluster-fault-detection]]
- === Cluster fault detection
- The elected master periodically checks each of the nodes in the cluster to
- ensure that they are still connected and healthy. Each node in the cluster also
- periodically checks the health of the elected master. These checks are known
- respectively as _follower checks_ and _leader checks_.
- Elasticsearch allows these checks to occasionally fail or timeout without
- taking any action. It considers a node to be faulty only after a number of
- consecutive checks have failed. You can control fault detection behavior with
- <<modules-discovery-settings,`cluster.fault_detection.*` settings>>.
- If the elected master detects that a node has disconnected, however, this
- situation is treated as an immediate failure. The master bypasses the timeout
- and retry setting values and attempts to remove the node from the cluster.
- Similarly, if a node detects that the elected master has disconnected, this
- situation is treated as an immediate failure. The node bypasses the timeout and
- retry settings and restarts its discovery phase to try and find or elect a new
- master.
- [[cluster-fault-detection-filesystem-health]]
- Additionally, each node periodically verifies that its data path is healthy by
- writing a small file to disk and then deleting it again. If a node discovers
- its data path is unhealthy then it is removed from the cluster until the data
- path recovers. You can control this behavior with the
- <<modules-discovery-settings,`monitor.fs.health` settings>>.
- [[cluster-fault-detection-cluster-state-publishing]]
- The elected master node will also remove nodes from the cluster if nodes are unable
- to apply an updated cluster state within a reasonable time, which is by default
- 120s after the cluster state update started. See
- <<cluster-state-publishing, cluster state publishing>> for a more detailed description.
|