Browse Source

Small changes to corruption troubleshooting docs (#95265)

- Mention that third-party software may be to blame too
- Mention `strace` as a last resort
- Minor rewordings
David Turner 2 years ago
parent
commit
b4b9292ce9
1 changed files with 17 additions and 8 deletions
  1. 17 8
      docs/reference/troubleshooting/corruption-issues.asciidoc

+ 17 - 8
docs/reference/troubleshooting/corruption-issues.asciidoc

@@ -32,16 +32,17 @@ The files that make up a Lucene index are written sequentially from start to
 end and then never modified or overwritten. This access pattern means the
 checksum computation is very simple and can happen on-the-fly as the file is
 initially written, and also makes it very unlikely that an incorrect checksum
-is due to a userspace bug at the time the file was written. The routine that
-computes the checksum is straightforward, widely used, and very well-tested, so
-you can be very confident that a checksum mismatch really does indicate that
-the data read from disk is different from the data that {es} previously wrote.
+is due to a userspace bug at the time the file was written. The part of {es}
+that computes the checksum is straightforward, widely used, and very
+well-tested, so you can be very confident that a checksum mismatch really does
+indicate that the data read from disk is different from the data that {es}
+previously wrote.
 
 The files that make up a Lucene index are written in full before they are used.
 If a file is needed to recover an index after a restart then your storage
 system previously confirmed to {es} that this file was durably synced to disk.
 On Linux this means that the `fsync()` system call returned successfully. {es}
-sometimes detects that an index is corrupt because a file needed for recovery
+sometimes reports that an index is corrupt because a file needed for recovery
 has been truncated or is missing its footer. This indicates that your storage
 system acknowledges durable writes incorrectly.
 
@@ -64,6 +65,8 @@ work correctly.
 - Faulty hardware, which may include the drive itself, the RAID controller,
   your RAM or CPU.
 
+- Third-party software which modifies the files that {es} writes.
+
 Data corruption typically doesn't result in other evidence of problems apart
 from the checksum mismatch. Do not interpret this as an indication that your
 storage subsystem is working correctly and therefore that {es} itself caused
@@ -76,12 +79,15 @@ using something other than {es} and look for data integrity errors. On Linux
 the `fio` and `stress-ng` tools can both generate challenging I/O workloads and
 verify the integrity of the data they write. Use version 0.12.01 or newer of
 `stress-ng` since earlier versions do not have strong enough integrity checks.
-You can check that durable writes persist across power outages using a script
-such as https://gist.github.com/bradfitz/3172656[`diskchecker.pl`].
+Verify that durable writes persist across power outages using a script such as
+https://gist.github.com/bradfitz/3172656[`diskchecker.pl`]. Alternatively, use
+a tool such as `strace` to observe the sequence of syscalls that {es} makes
+when writing data and confirm that this sequence does not explain the reported
+corruption.
 
 To narrow down the source of the corruptions, systematically change components
 in your cluster's environment until the corruptions stop. The details will
-depend on the exact configuration of your hardware, but may include the
+depend on the exact configuration of your environment, but may include the
 following:
 
 - Try a different filesystem or a different kernel.
@@ -90,3 +96,6 @@ following:
   model or manufacturer.
 
 - Try different firmware versions for each hardware component.
+
+- Remove any third-party software which may modify the contents of the {es}
+  data path.