network-timeouts.asciidoc 3.2 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
  1. tag::troubleshooting-network-timeouts-gc-vm[]
  2. * GC pauses are recorded in the GC logs that {es} emits by default, and also
  3. usually by the `JvmMonitorService` in the main node logs. Use these logs to
  4. confirm whether or not the node is experiencing high heap usage with long GC
  5. pauses. If so, <<high-jvm-memory-pressure,the troubleshooting guide for high
  6. heap usage>> has some suggestions for further investigation but typically you
  7. will need to capture a heap dump and the <<gc-logging,garbage collector logs>>
  8. during a time of high heap usage to fully understand the problem.
  9. * VM pauses also affect other processes on the same host. A VM pause also
  10. typically causes a discontinuity in the system clock, which {es} will report in
  11. its logs. If you see evidence of other processes pausing at the same time, or
  12. unexpected clock discontinuities, investigate the infrastructure on which you
  13. are running {es}.
  14. end::troubleshooting-network-timeouts-gc-vm[]
  15. tag::troubleshooting-network-timeouts-packet-capture-elections[]
  16. * Packet captures will reveal system-level and network-level faults, especially
  17. if you capture the network traffic simultaneously at all relevant nodes and
  18. analyse it alongside the {es} logs from those nodes. You should be able to
  19. observe any retransmissions, packet loss, or other delays on the connections
  20. between the nodes.
  21. end::troubleshooting-network-timeouts-packet-capture-elections[]
  22. tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
  23. * Packet captures will reveal system-level and network-level faults, especially
  24. if you capture the network traffic simultaneously at the elected master and the
  25. faulty node and analyse it alongside the {es} logs from those nodes. The
  26. connection used for follower checks is not used for any other traffic so it can
  27. be easily identified from the flow pattern alone, even if TLS is in use: almost
  28. exactly every second there will be a few hundred bytes sent each way, first the
  29. request by the master and then the response by the follower. You should be able
  30. to observe any retransmissions, packet loss, or other delays on such a
  31. connection.
  32. end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
  33. tag::troubleshooting-network-timeouts-threads[]
  34. * Long waits for particular threads to be available can be identified by taking
  35. stack dumps of the main {es} process (for example, using `jstack`) or a
  36. profiling trace (for example, using Java Flight Recorder) in the few seconds
  37. leading up to the relevant log message.
  38. +
  39. The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
  40. bear in mind that this API also requires a number of `transport_worker` and
  41. `generic` threads across all the nodes in the cluster. The API may be affected
  42. by the very problem you're trying to diagnose. `jstack` is much more reliable
  43. since it doesn't require any JVM threads.
  44. +
  45. The threads involved in discovery and cluster membership are mainly
  46. `transport_worker` and `cluster_coordination` threads, for which there should
  47. never be a long wait. There may also be evidence of long waits for threads in
  48. the {es} logs, particularly looking at warning logs from
  49. `org.elasticsearch.transport.InboundHandler`. See
  50. <<modules-network-threading-model>> for more information.
  51. end::troubleshooting-network-timeouts-threads[]