error-handling.asciidoc 7.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214
  1. [role="xpack"]
  2. [[index-lifecycle-error-handling]]
  3. == Troubleshooting {ilm} errors
  4. When {ilm-init} executes a lifecycle policy, it's possible for errors to occur
  5. while performing the necessary index operations for a step.
  6. When this happens, {ilm-init} moves the index to an `ERROR` step.
  7. If {ilm-init} cannot resolve the error automatically, execution is halted
  8. until you resolve the underlying issues with the policy, index, or cluster.
  9. For example, you might have a `shrink-index` policy that shrinks an index to four shards once it
  10. is at least five days old:
  11. [source,console]
  12. --------------------------------------------------
  13. PUT _ilm/policy/shrink-index
  14. {
  15. "policy": {
  16. "phases": {
  17. "warm": {
  18. "min_age": "5d",
  19. "actions": {
  20. "shrink": {
  21. "number_of_shards": 4
  22. }
  23. }
  24. }
  25. }
  26. }
  27. }
  28. --------------------------------------------------
  29. // TEST
  30. There is nothing that prevents you from applying the `shrink-index` policy to a new
  31. index that has only two shards:
  32. [source,console]
  33. --------------------------------------------------
  34. PUT /my-index-000001
  35. {
  36. "settings": {
  37. "index.number_of_shards": 2,
  38. "index.lifecycle.name": "shrink-index"
  39. }
  40. }
  41. --------------------------------------------------
  42. // TEST[continued]
  43. After five days, {ilm-init} attempts to shrink `my-index-000001` from two shards to four shards.
  44. Because the shrink action cannot _increase_ the number of shards, this operation fails
  45. and {ilm-init} moves `my-index-000001` to the `ERROR` step.
  46. You can use the <<ilm-explain-lifecycle,{ilm-init} Explain API>> to get information about
  47. what went wrong:
  48. [source,console]
  49. --------------------------------------------------
  50. GET /my-index-000001/_ilm/explain
  51. --------------------------------------------------
  52. // TEST[continued]
  53. Which returns the following information:
  54. [source,console-result]
  55. --------------------------------------------------
  56. {
  57. "indices" : {
  58. "my-index-000001" : {
  59. "index" : "my-index-000001",
  60. "managed" : true,
  61. "policy" : "shrink-index", <1>
  62. "lifecycle_date_millis" : 1541717265865,
  63. "age": "5.1d", <2>
  64. "phase" : "warm", <3>
  65. "phase_time_millis" : 1541717272601,
  66. "action" : "shrink", <4>
  67. "action_time_millis" : 1541717272601,
  68. "step" : "ERROR", <5>
  69. "step_time_millis" : 1541717272688,
  70. "failed_step" : "shrink", <6>
  71. "step_info" : {
  72. "type" : "illegal_argument_exception", <7>
  73. "reason" : "the number of target shards [4] must be less that the number of source shards [2]"
  74. },
  75. "phase_execution" : {
  76. "policy" : "shrink-index",
  77. "phase_definition" : { <8>
  78. "min_age" : "5d",
  79. "actions" : {
  80. "shrink" : {
  81. "number_of_shards" : 4
  82. }
  83. }
  84. },
  85. "version" : 1,
  86. "modified_date_in_millis" : 1541717264230
  87. }
  88. }
  89. }
  90. }
  91. --------------------------------------------------
  92. // TESTRESPONSE[skip:no way to know if we will get this response immediately]
  93. <1> The policy being used to manage the index: `shrink-index`
  94. <2> The index age: 5.1 days
  95. <3> The phase the index is currently in: `warm`
  96. <4> The current action: `shrink`
  97. <5> The step the index is currently in: `ERROR`
  98. <6> The step that failed to execute: `shrink`
  99. <7> The type of error and a description of that error.
  100. <8> The definition of the current phase from the `shrink-index` policy
  101. To resolve this, you could update the policy to shrink the index to a single shard after 5 days:
  102. [source,console]
  103. --------------------------------------------------
  104. PUT _ilm/policy/shrink-index
  105. {
  106. "policy": {
  107. "phases": {
  108. "warm": {
  109. "min_age": "5d",
  110. "actions": {
  111. "shrink": {
  112. "number_of_shards": 1
  113. }
  114. }
  115. }
  116. }
  117. }
  118. }
  119. --------------------------------------------------
  120. // TEST[continued]
  121. [discrete]
  122. === Retrying failed lifecycle policy steps
  123. Once you fix the problem that put an index in the `ERROR` step,
  124. you might need to explicitly tell {ilm-init} to retry the step:
  125. [source,console]
  126. --------------------------------------------------
  127. POST /my-index-000001/_ilm/retry
  128. --------------------------------------------------
  129. // TEST[skip:we can't be sure the index is ready to be retried at this point]
  130. {ilm-init} subsequently attempts to re-run the step that failed.
  131. You can use the <<ilm-explain-lifecycle,{ilm-init} Explain API>> to monitor the progress.
  132. [discrete]
  133. === Common {ilm-init} errors
  134. Here's how to resolve the most common errors reported in the `ERROR` step.
  135. TIP: Problems with rollover aliases are a common cause of errors.
  136. Consider using <<data-streams, data streams>> instead of managing rollover with aliases.
  137. [discrete]
  138. ==== Rollover alias [x] can point to multiple indices, found duplicated alias [x] in index template [z]
  139. The target rollover alias is specified in an index template's `index.lifecycle.rollover_alias` setting.
  140. You need to explicitly configure this alias _one time_ when you
  141. <<ilm-gs-alias-bootstrap, bootstrap the initial index>>.
  142. The rollover action then manages setting and updating the alias to
  143. <<rollover-index-api-desc, roll over>> to each subsequent index.
  144. Do not explicitly configure this same alias in the aliases section of an index template.
  145. [discrete]
  146. ==== index.lifecycle.rollover_alias [x] does not point to index [y]
  147. Either the index is using the wrong alias or the alias does not exist.
  148. Check the `index.lifecycle.rollover_alias` <<indices-get-settings, index setting>>.
  149. To see what aliases are configured, use <<cat-alias, _cat/aliases>>.
  150. [discrete]
  151. ==== Setting [index.lifecycle.rollover_alias] for index [y] is empty or not defined
  152. The `index.lifecycle.rollover_alias` setting must be configured for the rollover action to work.
  153. Update the index settings to set `index.lifecycle.rollover_alias`.
  154. [discrete]
  155. ==== Alias [x] has more than one write index [y,z]
  156. Only one index can be designated as the write index for a particular alias.
  157. Use the <<indices-aliases, aliases>> API to set `is_write_index:false` for all but one index.
  158. [discrete]
  159. ==== index name [x] does not match pattern ^.*-\d+
  160. The index name must match the regex pattern `^.*-\d+` for the rollover action to work.
  161. The most common problem is that the index name does not contain trailing digits.
  162. For example, `my-index` does not match the pattern requirement.
  163. Append a numeric value to the index name, for example `my-index-000001`.
  164. [discrete]
  165. ==== CircuitBreakingException: [x] data too large, data for [y]
  166. This indicates that the cluster is hitting resource limits.
  167. Before continuing to set up {ilm-init}, you'll need to take steps to alleviate the resource issues.
  168. For more information, see <<circuit-breaker-errors>>.
  169. [discrete]
  170. ==== High disk watermark [x] exceeded on [y]
  171. This indicates that the cluster is running out of disk space.
  172. This can happen when you don't have {ilm} set up to roll over from hot to warm nodes.
  173. Consider adding nodes, upgrading your hardware, or deleting unneeded indices.