Browse Source

Add troubleshooting docs about data corruption (#88760)

Adds some docs giving more detailed background about what data
corruption really means and some suggestions about how to narrow down
the root cause.

Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>
David Turner 3 years ago
parent
commit
7103053f03

+ 2 - 0
docs/reference/troubleshooting.asciidoc

@@ -67,3 +67,5 @@ include::monitoring/troubleshooting.asciidoc[]
 include::transform/troubleshooting.asciidoc[leveloffset=+1]
 
 include::../../x-pack/docs/en/watcher/troubleshooting.asciidoc[]
+
+include::troubleshooting/corruption-issues.asciidoc[]

+ 92 - 0
docs/reference/troubleshooting/corruption-issues.asciidoc

@@ -0,0 +1,92 @@
+[[corruption-troubleshooting]]
+== Troubleshooting corruption
+
+{es} expects that the data it reads from disk is exactly the data it previously
+wrote. If it detects that the data on disk is different from what it wrote then
+it will report some kind of exception such as:
+
+- `org.apache.lucene.index.CorruptIndexException`
+- `org.elasticsearch.gateway.CorruptStateException`
+- `org.elasticsearch.index.translog.TranslogCorruptedException`
+
+Typically these exceptions happen due to a checksum mismatch. Most of the data
+that {es} writes to disk is followed by a checksum using a simple algorithm
+known as CRC32 which is fast to compute and good at detecting the kinds of
+random corruption that may happen when using faulty storage. A CRC32 checksum
+mismatch definitely indicates that something is faulty, although of course a
+matching checksum doesn't prove the absence of corruption.
+
+Verifying a checksum is expensive since it involves reading every byte of the
+file which takes significant effort and might evict more useful data from the
+filesystem cache, so systems typically don't verify the checksum on a file very
+often. This is why you tend only to encounter a corruption exception when
+something unusual is happening. For instance, corruptions are often detected
+during merges, shard movements, and snapshots. This does not mean that these
+proceses are causing corruption: they are examples of the rare times where
+reading a whole file is necessary. {es} takes the opportunity to verify the
+checksum at the same time, and this is when the corruption is detected and
+reported. It doesn't indicate the cause of the corruption or when it happened.
+Corruptions can remain undetected for many months.
+
+The files that make up a Lucene index are written sequentially from start to
+end and then never modified or overwritten. This access pattern means the
+checksum computation is very simple and can happen on-the-fly as the file is
+initially written, and also makes it very unlikely that an incorrect checksum
+is due to a userspace bug at the time the file was written. The routine that
+computes the checksum is straightforward, widely used, and very well-tested, so
+you can be very confident that a checksum mismatch really does indicate that
+the data read from disk is different from the data that {es} previously wrote.
+
+The files that make up a Lucene index are written in full before they are used.
+If a file is needed to recover an index after a restart then your storage
+system previously confirmed to {es} that this file was durably synced to disk.
+On Linux this means that the `fsync()` system call returned successfully. {es}
+sometimes detects that an index is corrupt because a file needed for recovery
+has been truncated or is missing its footer. This indicates that your storage
+system acknowledges durable writes incorrectly.
+
+There are many possible explanations for {es} detecting corruption in your
+cluster. Databases like {es} generate a challenging I/O workload that may find
+subtle infrastructural problems which other tests may miss. {es} is known to
+expose the following problems as file corruptions:
+
+- Filesystem bugs, especially in newer and nonstandard filesystems which might
+  not have seen enough real-world production usage to be confident that they
+work correctly.
+
+- https://www.elastic.co/blog/canonical-elastic-and-google-team-up-to-prevent-data-corruption-in-linux[Kernel bugs].
+
+- Bugs in firmware running on the drive or RAID controller.
+
+- Incorrect configuration, for instance configuring `fsync()` to report success
+  before all durable writes have completed.
+
+- Faulty hardware, which may include the drive itself, the RAID controller,
+  your RAM or CPU.
+
+Data corruption typically doesn't result in other evidence of problems apart
+from the checksum mismatch. Do not interpret this as an indication that your
+storage subsystem is working correctly and therefore that {es} itself caused
+the corruption. It is rare for faulty storage to show any evidence of problems
+apart from the data corruption, but data corruption itself is a very strong
+indicator that your storage subsystem is not working correctly.
+
+To rule out {es} as the source of data corruption, generate an I/O workload
+using something other than {es} and look for data integrity errors. On Linux
+the `fio` and `stress-ng` tools can both generate challenging I/O workloads and
+verify the integrity of the data they write. Use version 0.12.01 or newer of
+`stress-ng` since earlier versions do not have strong enough integrity checks.
+You can check that durable writes persist across power outages using a script
+such as https://gist.github.com/bradfitz/3172656[`diskchecker.pl`].
+
+To narrow down the source of the corruptions, systematically change components
+in your cluster's environment until the corruptions stop. The details will
+depend on the exact configuration of your hardware, but may include the
+following:
+
+- Try a different filesystem or a different kernel.
+
+- Try changing each hardware component in turn, ideally changing to a different
+  model or manufacturer.
+
+- Try different firmware versions for each hardware component.