1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253 |
- tag::troubleshooting-network-timeouts-gc-vm[]
- * GC pauses are recorded in the GC logs that {es} emits by default, and also
- usually by the `JvmMonitorService` in the main node logs. Use these logs to
- confirm whether or not the node is experiencing high heap usage with long GC
- pauses. If so, <<high-jvm-memory-pressure,the troubleshooting guide for high
- heap usage>> has some suggestions for further investigation but typically you
- will need to capture a heap dump during a time of high heap usage to fully
- understand the problem.
- * VM pauses also affect other processes on the same host. A VM pause also
- typically causes a discontinuity in the system clock, which {es} will report in
- its logs. If you see evidence of other processes pausing at the same time, or
- unexpected clock discontinuities, investigate the infrastructure on which you
- are running {es}.
- end::troubleshooting-network-timeouts-gc-vm[]
- tag::troubleshooting-network-timeouts-packet-capture-elections[]
- * Packet captures will reveal system-level and network-level faults, especially
- if you capture the network traffic simultaneously at all relevant nodes. You
- should be able to observe any retransmissions, packet loss, or other delays on
- the connections between the nodes.
- end::troubleshooting-network-timeouts-packet-capture-elections[]
- tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
- * Packet captures will reveal system-level and network-level faults, especially
- if you capture the network traffic simultaneously at the elected master and the
- faulty node. The connection used for follower checks is not used for any other
- traffic so it can be easily identified from the flow pattern alone, even if TLS
- is in use: almost exactly every second there will be a few hundred bytes sent
- each way, first the request by the master and then the response by the
- follower. You should be able to observe any retransmissions, packet loss, or
- other delays on such a connection.
- end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
- tag::troubleshooting-network-timeouts-threads[]
- * Long waits for particular threads to be available can be identified by taking
- stack dumps (for example, using `jstack`) or a profiling trace (for example,
- using Java Flight Recorder) in the few seconds leading up to the relevant log
- message.
- +
- The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
- bear in mind that this API also requires a number of `transport_worker` and
- `generic` threads across all the nodes in the cluster. The API may be affected
- by the very problem you're trying to diagnose. `jstack` is much more reliable
- since it doesn't require any JVM threads.
- +
- The threads involved in discovery and cluster membership are mainly
- `transport_worker` and `cluster_coordination` threads, for which there should
- never be a long wait. There may also be evidence of long waits for threads in
- the {es} logs, particularly looking at warning logs from
- `org.elasticsearch.transport.InboundHandler`. See
- <<modules-network-threading-model>> for more information.
- end::troubleshooting-network-timeouts-threads[]
|