12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 |
- [[cluster-fault-detection]]
- === Cluster fault detection
- The elected master periodically checks each of the nodes in the cluster to
- ensure that they are still connected and healthy. Each node in the cluster also
- periodically checks the health of the elected master. These checks are known
- respectively as _follower checks_ and _leader checks_.
- Elasticsearch allows these checks to occasionally fail or timeout without
- taking any action. It considers a node to be faulty only after a number of
- consecutive checks have failed. You can control fault detection behavior with
- <<modules-discovery-settings,`cluster.fault_detection.*` settings>>.
- If the elected master detects that a node has disconnected, however, this
- situation is treated as an immediate failure. The master bypasses the timeout
- and retry setting values and attempts to remove the node from the cluster.
- Similarly, if a node detects that the elected master has disconnected, this
- situation is treated as an immediate failure. The node bypasses the timeout and
- retry settings and restarts its discovery phase to try and find or elect a new
- master.
- [[cluster-fault-detection-filesystem-health]]
- Additionally, each node periodically verifies that its data path is healthy by
- writing a small file to disk and then deleting it again. If a node discovers
- its data path is unhealthy then it is removed from the cluster until the data
- path recovers. You can control this behavior with the
- <<modules-discovery-settings,`monitor.fs.health` settings>>.
- [[cluster-fault-detection-cluster-state-publishing]]
- The elected master node
- will also remove nodes from the cluster if nodes are unable to apply an updated
- cluster state within a reasonable time. The timeout defaults to 2 minutes
- starting from the beginning of the cluster state update. Refer to
- <<cluster-state-publishing>> for a more detailed description.
- [[cluster-fault-detection-troubleshooting]]
- ==== Troubleshooting an unstable cluster
- See <<troubleshooting-unstable-cluster>>.
- [discrete]
- ===== Diagnosing `disconnected` nodes
- See <<troubleshooting-unstable-cluster-disconnected>>.
- [discrete]
- ===== Diagnosing `lagging` nodes
- See <<troubleshooting-unstable-cluster-lagging>>.
- [discrete]
- ===== Diagnosing `follower check retry count exceeded` nodes
- See <<troubleshooting-unstable-cluster-follower-check>>.
- [discrete]
- ===== Diagnosing `ShardLockObtainFailedException` failures
- See <<troubleshooting-unstable-cluster-shardlockobtainfailedexception>>.
- [discrete]
- ===== Diagnosing other network disconnections
- See <<troubleshooting-unstable-cluster-network>>.
|