fault-detection.asciidoc 2.5 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
  1. [[cluster-fault-detection]]
  2. === Cluster fault detection
  3. The elected master periodically checks each of the nodes in the cluster to
  4. ensure that they are still connected and healthy. Each node in the cluster also
  5. periodically checks the health of the elected master. These checks are known
  6. respectively as _follower checks_ and _leader checks_.
  7. Elasticsearch allows these checks to occasionally fail or timeout without
  8. taking any action. It considers a node to be faulty only after a number of
  9. consecutive checks have failed. You can control fault detection behavior with
  10. <<modules-discovery-settings,`cluster.fault_detection.*` settings>>.
  11. If the elected master detects that a node has disconnected, however, this
  12. situation is treated as an immediate failure. The master bypasses the timeout
  13. and retry setting values and attempts to remove the node from the cluster.
  14. Similarly, if a node detects that the elected master has disconnected, this
  15. situation is treated as an immediate failure. The node bypasses the timeout and
  16. retry settings and restarts its discovery phase to try and find or elect a new
  17. master.
  18. [[cluster-fault-detection-filesystem-health]]
  19. Additionally, each node periodically verifies that its data path is healthy by
  20. writing a small file to disk and then deleting it again. If a node discovers
  21. its data path is unhealthy then it is removed from the cluster until the data
  22. path recovers. You can control this behavior with the
  23. <<modules-discovery-settings,`monitor.fs.health` settings>>.
  24. [[cluster-fault-detection-cluster-state-publishing]]
  25. The elected master node
  26. will also remove nodes from the cluster if nodes are unable to apply an updated
  27. cluster state within a reasonable time. The timeout defaults to 2 minutes
  28. starting from the beginning of the cluster state update. Refer to
  29. <<cluster-state-publishing>> for a more detailed description.
  30. [[cluster-fault-detection-troubleshooting]]
  31. ==== Troubleshooting an unstable cluster
  32. See <<troubleshooting-unstable-cluster>>.
  33. [discrete]
  34. ===== Diagnosing `disconnected` nodes
  35. See <<troubleshooting-unstable-cluster-disconnected>>.
  36. [discrete]
  37. ===== Diagnosing `lagging` nodes
  38. See <<troubleshooting-unstable-cluster-lagging>>.
  39. [discrete]
  40. ===== Diagnosing `follower check retry count exceeded` nodes
  41. See <<troubleshooting-unstable-cluster-follower-check>>.
  42. [discrete]
  43. ===== Diagnosing `ShardLockObtainFailedException` failures
  44. See <<troubleshooting-unstable-cluster-shardlockobtainfailedexception>>.
  45. [discrete]
  46. ===== Diagnosing other network disconnections
  47. See <<troubleshooting-unstable-cluster-network>>.