Explorar o código

More detail around packet captures (#111835)

Clarify that it's best to analyse the captures alongside the node logs,
and spell out in a bit more detail how to use packet captures and logs
to pin down the cause of a `disconnected` node.
David Turner hai 1 ano
pai
achega
e5fd63bbb8

+ 8 - 7
docs/reference/modules/discovery/fault-detection.asciidoc

@@ -168,9 +168,8 @@ reason, something other than {es} likely caused the connection to close. A
 common cause is a misconfigured firewall with an improper timeout or another
 policy that's <<long-lived-connections,incompatible with {es}>>. It could also
 be caused by general connectivity issues, such as packet loss due to faulty
-hardware or network congestion. If you're an advanced user, you can get more
-detailed information about network exceptions by configuring the following
-loggers:
+hardware or network congestion. If you're an advanced user, configure the
+following loggers to get more detailed information about network exceptions:
 
 [source,yaml]
 ----
@@ -178,9 +177,11 @@ logger.org.elasticsearch.transport.TcpTransport: DEBUG
 logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG
 ----
 
-In extreme cases, you may need to take packet captures using `tcpdump` to
-determine whether messages between nodes are being dropped or rejected by some
-other device on the network.
+If these logs do not show enough information to diagnose the problem, obtain a
+packet capture simultaneously from the nodes at both ends of an unstable
+connection and analyse it alongside the {es} logs from those nodes to determine
+if traffic between the nodes is being disrupted by another device on the
+network.
 
 [discrete]
 ===== Diagnosing `lagging` nodes
@@ -299,4 +300,4 @@ To reconstruct the output, base64-decode the data and decompress it using
 ----
 cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
 ----
-//end::troubleshooting[]
+//end::troubleshooting[]

+ 11 - 9
docs/reference/troubleshooting/network-timeouts.asciidoc

@@ -16,20 +16,22 @@ end::troubleshooting-network-timeouts-gc-vm[]
 
 tag::troubleshooting-network-timeouts-packet-capture-elections[]
 * Packet captures will reveal system-level and network-level faults, especially
-if you capture the network traffic simultaneously at all relevant nodes. You
-should be able to observe any retransmissions, packet loss, or other delays on
-the connections between the nodes.
+if you capture the network traffic simultaneously at all relevant nodes and
+analyse it alongside the {es} logs from those nodes. You should be able to
+observe any retransmissions, packet loss, or other delays on the connections
+between the nodes.
 end::troubleshooting-network-timeouts-packet-capture-elections[]
 
 tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
 * Packet captures will reveal system-level and network-level faults, especially
 if you capture the network traffic simultaneously at the elected master and the
-faulty node. The connection used for follower checks is not used for any other
-traffic so it can be easily identified from the flow pattern alone, even if TLS
-is in use: almost exactly every second there will be a few hundred bytes sent
-each way, first the request by the master and then the response by the
-follower. You should be able to observe any retransmissions, packet loss, or
-other delays on such a connection.
+faulty node and analyse it alongside the {es} logs from those nodes. The
+connection used for follower checks is not used for any other traffic so it can
+be easily identified from the flow pattern alone, even if TLS is in use: almost
+exactly every second there will be a few hundred bytes sent each way, first the
+request by the master and then the response by the follower. You should be able
+to observe any retransmissions, packet loss, or other delays on such a
+connection.
 end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
 
 tag::troubleshooting-network-timeouts-threads[]