12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455 |
- tag::troubleshooting-network-timeouts-gc-vm[]
- * GC pauses are recorded in the GC logs that {es} emits by default, and also
- usually by the `JvmMonitorService` in the main node logs. Use these logs to
- confirm whether or not the node is experiencing high heap usage with long GC
- pauses. If so, <<high-jvm-memory-pressure,the troubleshooting guide for high
- heap usage>> has some suggestions for further investigation but typically you
- will need to capture a heap dump and the <<gc-logging,garbage collector logs>>
- during a time of high heap usage to fully understand the problem.
- * VM pauses also affect other processes on the same host. A VM pause also
- typically causes a discontinuity in the system clock, which {es} will report in
- its logs. If you see evidence of other processes pausing at the same time, or
- unexpected clock discontinuities, investigate the infrastructure on which you
- are running {es}.
- end::troubleshooting-network-timeouts-gc-vm[]
- tag::troubleshooting-network-timeouts-packet-capture-elections[]
- * Packet captures will reveal system-level and network-level faults, especially
- if you capture the network traffic simultaneously at all relevant nodes and
- analyse it alongside the {es} logs from those nodes. You should be able to
- observe any retransmissions, packet loss, or other delays on the connections
- between the nodes.
- end::troubleshooting-network-timeouts-packet-capture-elections[]
- tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
- * Packet captures will reveal system-level and network-level faults, especially
- if you capture the network traffic simultaneously at the elected master and the
- faulty node and analyse it alongside the {es} logs from those nodes. The
- connection used for follower checks is not used for any other traffic so it can
- be easily identified from the flow pattern alone, even if TLS is in use: almost
- exactly every second there will be a few hundred bytes sent each way, first the
- request by the master and then the response by the follower. You should be able
- to observe any retransmissions, packet loss, or other delays on such a
- connection.
- end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
- tag::troubleshooting-network-timeouts-threads[]
- * Long waits for particular threads to be available can be identified by taking
- stack dumps of the main {es} process (for example, using `jstack`) or a
- profiling trace (for example, using Java Flight Recorder) in the few seconds
- leading up to the relevant log message.
- +
- The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
- bear in mind that this API also requires a number of `transport_worker` and
- `generic` threads across all the nodes in the cluster. The API may be affected
- by the very problem you're trying to diagnose. `jstack` is much more reliable
- since it doesn't require any JVM threads.
- +
- The threads involved in discovery and cluster membership are mainly
- `transport_worker` and `cluster_coordination` threads, for which there should
- never be a long wait. There may also be evidence of long waits for threads in
- the {es} logs, particularly looking at warning logs from
- `org.elasticsearch.transport.InboundHandler`. See
- <<modules-network-threading-model>> for more information.
- end::troubleshooting-network-timeouts-threads[]
|