corruption-issues.asciidoc 5.0 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
  1. [[corruption-troubleshooting]]
  2. == Troubleshooting corruption
  3. {es} expects that the data it reads from disk is exactly the data it previously
  4. wrote. If it detects that the data on disk is different from what it wrote then
  5. it will report some kind of exception such as:
  6. - `org.apache.lucene.index.CorruptIndexException`
  7. - `org.elasticsearch.gateway.CorruptStateException`
  8. - `org.elasticsearch.index.translog.TranslogCorruptedException`
  9. Typically these exceptions happen due to a checksum mismatch. Most of the data
  10. that {es} writes to disk is followed by a checksum using a simple algorithm
  11. known as CRC32 which is fast to compute and good at detecting the kinds of
  12. random corruption that may happen when using faulty storage. A CRC32 checksum
  13. mismatch definitely indicates that something is faulty, although of course a
  14. matching checksum doesn't prove the absence of corruption.
  15. Verifying a checksum is expensive since it involves reading every byte of the
  16. file which takes significant effort and might evict more useful data from the
  17. filesystem cache, so systems typically don't verify the checksum on a file very
  18. often. This is why you tend only to encounter a corruption exception when
  19. something unusual is happening. For instance, corruptions are often detected
  20. during merges, shard movements, and snapshots. This does not mean that these
  21. processes are causing corruption: they are examples of the rare times where
  22. reading a whole file is necessary. {es} takes the opportunity to verify the
  23. checksum at the same time, and this is when the corruption is detected and
  24. reported. It doesn't indicate the cause of the corruption or when it happened.
  25. Corruptions can remain undetected for many months.
  26. The files that make up a Lucene index are written sequentially from start to
  27. end and then never modified or overwritten. This access pattern means the
  28. checksum computation is very simple and can happen on-the-fly as the file is
  29. initially written, and also makes it very unlikely that an incorrect checksum
  30. is due to a userspace bug at the time the file was written. The routine that
  31. computes the checksum is straightforward, widely used, and very well-tested, so
  32. you can be very confident that a checksum mismatch really does indicate that
  33. the data read from disk is different from the data that {es} previously wrote.
  34. The files that make up a Lucene index are written in full before they are used.
  35. If a file is needed to recover an index after a restart then your storage
  36. system previously confirmed to {es} that this file was durably synced to disk.
  37. On Linux this means that the `fsync()` system call returned successfully. {es}
  38. sometimes detects that an index is corrupt because a file needed for recovery
  39. has been truncated or is missing its footer. This indicates that your storage
  40. system acknowledges durable writes incorrectly.
  41. There are many possible explanations for {es} detecting corruption in your
  42. cluster. Databases like {es} generate a challenging I/O workload that may find
  43. subtle infrastructural problems which other tests may miss. {es} is known to
  44. expose the following problems as file corruptions:
  45. - Filesystem bugs, especially in newer and nonstandard filesystems which might
  46. not have seen enough real-world production usage to be confident that they
  47. work correctly.
  48. - https://www.elastic.co/blog/canonical-elastic-and-google-team-up-to-prevent-data-corruption-in-linux[Kernel bugs].
  49. - Bugs in firmware running on the drive or RAID controller.
  50. - Incorrect configuration, for instance configuring `fsync()` to report success
  51. before all durable writes have completed.
  52. - Faulty hardware, which may include the drive itself, the RAID controller,
  53. your RAM or CPU.
  54. Data corruption typically doesn't result in other evidence of problems apart
  55. from the checksum mismatch. Do not interpret this as an indication that your
  56. storage subsystem is working correctly and therefore that {es} itself caused
  57. the corruption. It is rare for faulty storage to show any evidence of problems
  58. apart from the data corruption, but data corruption itself is a very strong
  59. indicator that your storage subsystem is not working correctly.
  60. To rule out {es} as the source of data corruption, generate an I/O workload
  61. using something other than {es} and look for data integrity errors. On Linux
  62. the `fio` and `stress-ng` tools can both generate challenging I/O workloads and
  63. verify the integrity of the data they write. Use version 0.12.01 or newer of
  64. `stress-ng` since earlier versions do not have strong enough integrity checks.
  65. You can check that durable writes persist across power outages using a script
  66. such as https://gist.github.com/bradfitz/3172656[`diskchecker.pl`].
  67. To narrow down the source of the corruptions, systematically change components
  68. in your cluster's environment until the corruptions stop. The details will
  69. depend on the exact configuration of your hardware, but may include the
  70. following:
  71. - Try a different filesystem or a different kernel.
  72. - Try changing each hardware component in turn, ideally changing to a different
  73. model or manufacturer.
  74. - Try different firmware versions for each hardware component.