discovery-issues.asciidoc 5.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
  1. [[discovery-troubleshooting]]
  2. == Troubleshooting discovery
  3. In most cases, the discovery and election process completes quickly, and the
  4. master node remains elected for a long period of time.
  5. If your cluster doesn't have a stable master, many of its features won't work
  6. correctly and {es} will report errors to clients and in its logs. You must fix
  7. the master node's instability before addressing these other issues. It will not
  8. be possible to solve any other issues while there is no elected master node or
  9. the elected master node is unstable.
  10. If your cluster has a stable master but some nodes can't discover or join it,
  11. these nodes will report errors to clients and in their logs. You must address
  12. the obstacles preventing these nodes from joining the cluster before addressing
  13. other issues. It will not be possible to solve any other issues reported by
  14. these nodes while they are unable to join the cluster.
  15. If the cluster has no elected master node for more than a few seconds, the
  16. master is unstable, or some nodes are unable to discover or join a stable
  17. master, then {es} will record information in its logs explaining why. If the
  18. problems persist for more than a few minutes, {es} will record additional
  19. information in its logs. To properly troubleshoot discovery and election
  20. problems, collect and analyse logs covering at least five minutes from all
  21. nodes.
  22. The following sections describe some common discovery and election problems.
  23. [discrete]
  24. [[discovery-no-master]]
  25. === No master is elected
  26. When a node wins the master election, it logs a message containing
  27. `elected-as-master` and all nodes log a message containing
  28. `master node changed` identifying the new elected master node.
  29. If there is no elected master node and no node can win an election, all
  30. nodes will repeatedly log messages about the problem using a logger called
  31. `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
  32. default, this happens every 10 seconds.
  33. Master elections only involve master-eligible nodes, so focus your attention on
  34. the master-eligible nodes in this situation. These nodes' logs will indicate
  35. the requirements for a master election, such as the discovery of a certain set
  36. of nodes. The <<health-api>> API on these nodes will also provide useful
  37. information about the situation.
  38. If the logs or the health report indicate that {es} can't discover enough nodes
  39. to form a quorum, you must address the reasons preventing {es} from discovering
  40. the missing nodes. The missing nodes are needed to reconstruct the cluster
  41. metadata. Without the cluster metadata, the data in your cluster is
  42. meaningless. The cluster metadata is stored on a subset of the master-eligible
  43. nodes in the cluster. If a quorum can't be discovered, the missing nodes were
  44. the ones holding the cluster metadata.
  45. Ensure there are enough nodes running to form a quorum and that every node can
  46. communicate with every other node over the network. {es} will report additional
  47. details about network connectivity if the election problems persist for more
  48. than a few minutes. If you can't start enough nodes to form a quorum, start a
  49. new cluster and restore data from a recent snapshot. Refer to
  50. <<modules-discovery-quorums>> for more information.
  51. If the logs or the health report indicate that {es} _has_ discovered a possible
  52. quorum of nodes, the typical reason that the cluster can't elect a master is
  53. that one of the other nodes can't discover a quorum. Inspect the logs on the
  54. other master-eligible nodes and ensure that they have all discovered enough
  55. nodes to form a quorum.
  56. If the logs suggest that discovery or master elections are failing due to
  57. timeouts or network-related issues then narrow down the problem as follows.
  58. include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
  59. include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
  60. include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
  61. [discrete]
  62. [[discovery-master-unstable]]
  63. === Master is elected but unstable
  64. When a node wins the master election, it logs a message containing
  65. `elected-as-master`. If this happens repeatedly, the elected master node is
  66. unstable. In this situation, focus on the logs from the master-eligible nodes
  67. to understand why the election winner stops being the master and triggers
  68. another election. If the logs suggest that the master is unstable due to
  69. timeouts or network-related issues then narrow down the problem as follows.
  70. include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
  71. include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
  72. include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
  73. [discrete]
  74. [[discovery-cannot-join-master]]
  75. === Node cannot discover or join stable master
  76. If there is a stable elected master but a node can't discover or join its
  77. cluster, it will repeatedly log messages about the problem using the
  78. `ClusterFormationFailureHelper` logger. The <<health-api>> API on the affected
  79. node will also provide useful information about the situation. Other log
  80. messages on the affected node and the elected master may provide additional
  81. information about the problem. If the logs suggest that the node cannot
  82. discover or join the cluster due to timeouts or network-related issues then
  83. narrow down the problem as follows.
  84. include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
  85. include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
  86. include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
  87. [discrete]
  88. [[discovery-node-leaves]]
  89. === Node joins cluster and leaves again
  90. If a node joins the cluster but {es} determines it to be faulty then it will be
  91. removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
  92. for more information.