3 years ago · 7103053f03
--- a/docs/reference/troubleshooting.asciidoc
+++ b/docs/reference/troubleshooting.asciidoc
@@ -67,3 +67,5 @@ include::monitoring/troubleshooting.asciidoc[]
 
				 include::transform/troubleshooting.asciidoc[leveloffset=+1]
			
 
				 
			
 
				 include::../../x-pack/docs/en/watcher/troubleshooting.asciidoc[]
			
 
				+
			
 
				+include::troubleshooting/corruption-issues.asciidoc[]
			
--- a/docs/reference/troubleshooting/corruption-issues.asciidoc
+++ b/docs/reference/troubleshooting/corruption-issues.asciidoc
@@ -0,0 +1,92 @@
 
				+[[corruption-troubleshooting]]
			
 
				+== Troubleshooting corruption
			
 
				+
			
 
				+{es} expects that the data it reads from disk is exactly the data it previously
			
 
				+wrote. If it detects that the data on disk is different from what it wrote then
			
 
				+it will report some kind of exception such as:
			
 
				+
			
 
				+- `org.apache.lucene.index.CorruptIndexException`
			
 
				+- `org.elasticsearch.gateway.CorruptStateException`
			
 
				+- `org.elasticsearch.index.translog.TranslogCorruptedException`
			
 
				+
			
 
				+Typically these exceptions happen due to a checksum mismatch. Most of the data
			
 
				+that {es} writes to disk is followed by a checksum using a simple algorithm
			
 
				+known as CRC32 which is fast to compute and good at detecting the kinds of
			
 
				+random corruption that may happen when using faulty storage. A CRC32 checksum
			
 
				+mismatch definitely indicates that something is faulty, although of course a
			
 
				+matching checksum doesn't prove the absence of corruption.
			
 
				+
			
 
				+Verifying a checksum is expensive since it involves reading every byte of the
			
 
				+file which takes significant effort and might evict more useful data from the
			
 
				+filesystem cache, so systems typically don't verify the checksum on a file very
			
 
				+often. This is why you tend only to encounter a corruption exception when
			
 
				+something unusual is happening. For instance, corruptions are often detected
			
 
				+during merges, shard movements, and snapshots. This does not mean that these
			
 
				+proceses are causing corruption: they are examples of the rare times where
			
 
				+reading a whole file is necessary. {es} takes the opportunity to verify the
			
 
				+checksum at the same time, and this is when the corruption is detected and
			
 
				+reported. It doesn't indicate the cause of the corruption or when it happened.
			
 
				+Corruptions can remain undetected for many months.
			
 
				+
			
 
				+The files that make up a Lucene index are written sequentially from start to
			
 
				+end and then never modified or overwritten. This access pattern means the
			
 
				+checksum computation is very simple and can happen on-the-fly as the file is
			
 
				+initially written, and also makes it very unlikely that an incorrect checksum
			
 
				+is due to a userspace bug at the time the file was written. The routine that
			
 
				+computes the checksum is straightforward, widely used, and very well-tested, so
			
 
				+you can be very confident that a checksum mismatch really does indicate that
			
 
				+the data read from disk is different from the data that {es} previously wrote.
			
 
				+
			
 
				+The files that make up a Lucene index are written in full before they are used.
			
 
				+If a file is needed to recover an index after a restart then your storage
			
 
				+system previously confirmed to {es} that this file was durably synced to disk.
			
 
				+On Linux this means that the `fsync()` system call returned successfully. {es}
			
 
				+sometimes detects that an index is corrupt because a file needed for recovery
			
 
				+has been truncated or is missing its footer. This indicates that your storage
			
 
				+system acknowledges durable writes incorrectly.
			
 
				+
			
 
				+There are many possible explanations for {es} detecting corruption in your
			
 
				+cluster. Databases like {es} generate a challenging I/O workload that may find
			
 
				+subtle infrastructural problems which other tests may miss. {es} is known to
			
 
				+expose the following problems as file corruptions:
			
 
				+
			
 
				+- Filesystem bugs, especially in newer and nonstandard filesystems which might
			
 
				+  not have seen enough real-world production usage to be confident that they
			
 
				+work correctly.
			
 
				+
			
 
				+- https://www.elastic.co/blog/canonical-elastic-and-google-team-up-to-prevent-data-corruption-in-linux[Kernel bugs].
			
 
				+
			
 
				+- Bugs in firmware running on the drive or RAID controller.
			
 
				+
			
 
				+- Incorrect configuration, for instance configuring `fsync()` to report success
			
 
				+  before all durable writes have completed.
			
 
				+
			
 
				+- Faulty hardware, which may include the drive itself, the RAID controller,
			
 
				+  your RAM or CPU.
			
 
				+
			
 
				+Data corruption typically doesn't result in other evidence of problems apart
			
 
				+from the checksum mismatch. Do not interpret this as an indication that your
			
 
				+storage subsystem is working correctly and therefore that {es} itself caused
			
 
				+the corruption. It is rare for faulty storage to show any evidence of problems
			
 
				+apart from the data corruption, but data corruption itself is a very strong
			
 
				+indicator that your storage subsystem is not working correctly.
			
 
				+
			
 
				+To rule out {es} as the source of data corruption, generate an I/O workload
			
 
				+using something other than {es} and look for data integrity errors. On Linux
			
 
				+the `fio` and `stress-ng` tools can both generate challenging I/O workloads and
			
 
				+verify the integrity of the data they write. Use version 0.12.01 or newer of
			
 
				+`stress-ng` since earlier versions do not have strong enough integrity checks.
			
 
				+You can check that durable writes persist across power outages using a script
			
 
				+such as https://gist.github.com/bradfitz/3172656[`diskchecker.pl`].
			
 
				+
			
 
				+To narrow down the source of the corruptions, systematically change components
			
 
				+in your cluster's environment until the corruptions stop. The details will
			
 
				+depend on the exact configuration of your hardware, but may include the
			
 
				+following:
			
 
				+
			
 
				+- Try a different filesystem or a different kernel.
			
 
				+
			
 
				+- Try changing each hardware component in turn, ideally changing to a different
			
 
				+  model or manufacturer.
			
 
				+
			
 
				+- Try different firmware versions for each hardware component.