|
@@ -0,0 +1,92 @@
|
|
|
+[[corruption-troubleshooting]]
|
|
|
+== Troubleshooting corruption
|
|
|
+
|
|
|
+{es} expects that the data it reads from disk is exactly the data it previously
|
|
|
+wrote. If it detects that the data on disk is different from what it wrote then
|
|
|
+it will report some kind of exception such as:
|
|
|
+
|
|
|
+- `org.apache.lucene.index.CorruptIndexException`
|
|
|
+- `org.elasticsearch.gateway.CorruptStateException`
|
|
|
+- `org.elasticsearch.index.translog.TranslogCorruptedException`
|
|
|
+
|
|
|
+Typically these exceptions happen due to a checksum mismatch. Most of the data
|
|
|
+that {es} writes to disk is followed by a checksum using a simple algorithm
|
|
|
+known as CRC32 which is fast to compute and good at detecting the kinds of
|
|
|
+random corruption that may happen when using faulty storage. A CRC32 checksum
|
|
|
+mismatch definitely indicates that something is faulty, although of course a
|
|
|
+matching checksum doesn't prove the absence of corruption.
|
|
|
+
|
|
|
+Verifying a checksum is expensive since it involves reading every byte of the
|
|
|
+file which takes significant effort and might evict more useful data from the
|
|
|
+filesystem cache, so systems typically don't verify the checksum on a file very
|
|
|
+often. This is why you tend only to encounter a corruption exception when
|
|
|
+something unusual is happening. For instance, corruptions are often detected
|
|
|
+during merges, shard movements, and snapshots. This does not mean that these
|
|
|
+proceses are causing corruption: they are examples of the rare times where
|
|
|
+reading a whole file is necessary. {es} takes the opportunity to verify the
|
|
|
+checksum at the same time, and this is when the corruption is detected and
|
|
|
+reported. It doesn't indicate the cause of the corruption or when it happened.
|
|
|
+Corruptions can remain undetected for many months.
|
|
|
+
|
|
|
+The files that make up a Lucene index are written sequentially from start to
|
|
|
+end and then never modified or overwritten. This access pattern means the
|
|
|
+checksum computation is very simple and can happen on-the-fly as the file is
|
|
|
+initially written, and also makes it very unlikely that an incorrect checksum
|
|
|
+is due to a userspace bug at the time the file was written. The routine that
|
|
|
+computes the checksum is straightforward, widely used, and very well-tested, so
|
|
|
+you can be very confident that a checksum mismatch really does indicate that
|
|
|
+the data read from disk is different from the data that {es} previously wrote.
|
|
|
+
|
|
|
+The files that make up a Lucene index are written in full before they are used.
|
|
|
+If a file is needed to recover an index after a restart then your storage
|
|
|
+system previously confirmed to {es} that this file was durably synced to disk.
|
|
|
+On Linux this means that the `fsync()` system call returned successfully. {es}
|
|
|
+sometimes detects that an index is corrupt because a file needed for recovery
|
|
|
+has been truncated or is missing its footer. This indicates that your storage
|
|
|
+system acknowledges durable writes incorrectly.
|
|
|
+
|
|
|
+There are many possible explanations for {es} detecting corruption in your
|
|
|
+cluster. Databases like {es} generate a challenging I/O workload that may find
|
|
|
+subtle infrastructural problems which other tests may miss. {es} is known to
|
|
|
+expose the following problems as file corruptions:
|
|
|
+
|
|
|
+- Filesystem bugs, especially in newer and nonstandard filesystems which might
|
|
|
+ not have seen enough real-world production usage to be confident that they
|
|
|
+work correctly.
|
|
|
+
|
|
|
+- https://www.elastic.co/blog/canonical-elastic-and-google-team-up-to-prevent-data-corruption-in-linux[Kernel bugs].
|
|
|
+
|
|
|
+- Bugs in firmware running on the drive or RAID controller.
|
|
|
+
|
|
|
+- Incorrect configuration, for instance configuring `fsync()` to report success
|
|
|
+ before all durable writes have completed.
|
|
|
+
|
|
|
+- Faulty hardware, which may include the drive itself, the RAID controller,
|
|
|
+ your RAM or CPU.
|
|
|
+
|
|
|
+Data corruption typically doesn't result in other evidence of problems apart
|
|
|
+from the checksum mismatch. Do not interpret this as an indication that your
|
|
|
+storage subsystem is working correctly and therefore that {es} itself caused
|
|
|
+the corruption. It is rare for faulty storage to show any evidence of problems
|
|
|
+apart from the data corruption, but data corruption itself is a very strong
|
|
|
+indicator that your storage subsystem is not working correctly.
|
|
|
+
|
|
|
+To rule out {es} as the source of data corruption, generate an I/O workload
|
|
|
+using something other than {es} and look for data integrity errors. On Linux
|
|
|
+the `fio` and `stress-ng` tools can both generate challenging I/O workloads and
|
|
|
+verify the integrity of the data they write. Use version 0.12.01 or newer of
|
|
|
+`stress-ng` since earlier versions do not have strong enough integrity checks.
|
|
|
+You can check that durable writes persist across power outages using a script
|
|
|
+such as https://gist.github.com/bradfitz/3172656[`diskchecker.pl`].
|
|
|
+
|
|
|
+To narrow down the source of the corruptions, systematically change components
|
|
|
+in your cluster's environment until the corruptions stop. The details will
|
|
|
+depend on the exact configuration of your hardware, but may include the
|
|
|
+following:
|
|
|
+
|
|
|
+- Try a different filesystem or a different kernel.
|
|
|
+
|
|
|
+- Try changing each hardware component in turn, ideally changing to a different
|
|
|
+ model or manufacturer.
|
|
|
+
|
|
|
+- Try different firmware versions for each hardware component.
|