error-handling.asciidoc 7.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
  1. [role="xpack"]
  2. [testenv="basic"]
  3. [[index-lifecycle-error-handling]]
  4. == Troubleshooting {ilm} errors
  5. When {ilm-init} executes a lifecycle policy, it's possible for errors to occur
  6. while performing the necessary index operations for a step.
  7. When this happens, {ilm-init} moves the index to an `ERROR` step.
  8. If {ilm-init} cannot resolve the error automatically, execution is halted
  9. until you resolve the underlying issues with the policy, index, or cluster.
  10. For example, you might have a `shrink-index` policy that shrinks an index to four shards once it
  11. is at least five days old:
  12. [source,console]
  13. --------------------------------------------------
  14. PUT _ilm/policy/shrink-index
  15. {
  16. "policy": {
  17. "phases": {
  18. "warm": {
  19. "min_age": "5d",
  20. "actions": {
  21. "shrink": {
  22. "number_of_shards": 4
  23. }
  24. }
  25. }
  26. }
  27. }
  28. }
  29. --------------------------------------------------
  30. // TEST
  31. There is nothing that prevents you from applying the `shrink-index` policy to a new
  32. index that has only two shards:
  33. [source,console]
  34. --------------------------------------------------
  35. PUT /my-index-000001
  36. {
  37. "settings": {
  38. "index.number_of_shards": 2,
  39. "index.lifecycle.name": "shrink-index"
  40. }
  41. }
  42. --------------------------------------------------
  43. // TEST[continued]
  44. After five days, {ilm-init} attempts to shrink `my-index-000001` from two shards to four shards.
  45. Because the shrink action cannot _increase_ the number of shards, this operation fails
  46. and {ilm-init} moves `my-index-000001` to the `ERROR` step.
  47. You can use the <<ilm-explain-lifecycle,{ilm-init} Explain API>> to get information about
  48. what went wrong:
  49. [source,console]
  50. --------------------------------------------------
  51. GET /my-index-000001/_ilm/explain
  52. --------------------------------------------------
  53. // TEST[continued]
  54. Which returns the following information:
  55. [source,console-result]
  56. --------------------------------------------------
  57. {
  58. "indices" : {
  59. "my-index-000001" : {
  60. "index" : "my-index-000001",
  61. "managed" : true,
  62. "policy" : "shrink-index", <1>
  63. "lifecycle_date_millis" : 1541717265865,
  64. "age": "5.1d", <2>
  65. "phase" : "warm", <3>
  66. "phase_time_millis" : 1541717272601,
  67. "action" : "shrink", <4>
  68. "action_time_millis" : 1541717272601,
  69. "step" : "ERROR", <5>
  70. "step_time_millis" : 1541717272688,
  71. "failed_step" : "shrink", <6>
  72. "step_info" : {
  73. "type" : "illegal_argument_exception", <7>
  74. "reason" : "the number of target shards [4] must be less that the number of source shards [2]"
  75. },
  76. "phase_execution" : {
  77. "policy" : "shrink-index",
  78. "phase_definition" : { <8>
  79. "min_age" : "5d",
  80. "actions" : {
  81. "shrink" : {
  82. "number_of_shards" : 4
  83. }
  84. }
  85. },
  86. "version" : 1,
  87. "modified_date_in_millis" : 1541717264230
  88. }
  89. }
  90. }
  91. }
  92. --------------------------------------------------
  93. // TESTRESPONSE[skip:no way to know if we will get this response immediately]
  94. <1> The policy being used to manage the index: `shrink-index`
  95. <2> The index age: 5.1 days
  96. <3> The phase the index is currently in: `warm`
  97. <4> The current action: `shrink`
  98. <5> The step the index is currently in: `ERROR`
  99. <6> The step that failed to execute: `shrink`
  100. <7> The type of error and a description of that error.
  101. <8> The definition of the current phase from the `shrink-index` policy
  102. To resolve this, you could update the policy to shrink the index to a single shard after 5 days:
  103. [source,console]
  104. --------------------------------------------------
  105. PUT _ilm/policy/shrink-index
  106. {
  107. "policy": {
  108. "phases": {
  109. "warm": {
  110. "min_age": "5d",
  111. "actions": {
  112. "shrink": {
  113. "number_of_shards": 1
  114. }
  115. }
  116. }
  117. }
  118. }
  119. }
  120. --------------------------------------------------
  121. // TEST[continued]
  122. [discrete]
  123. === Retrying failed lifecycle policy steps
  124. Once you fix the problem that put an index in the `ERROR` step,
  125. you might need to explicitly tell {ilm-init} to retry the step:
  126. [source,console]
  127. --------------------------------------------------
  128. POST /my-index-000001/_ilm/retry
  129. --------------------------------------------------
  130. // TEST[skip:we can't be sure the index is ready to be retried at this point]
  131. {ilm-init} subsequently attempts to re-run the step that failed.
  132. You can use the <<ilm-explain-lifecycle,{ilm-init} Explain API>> to monitor the progress.
  133. [discrete]
  134. === Common {ilm-init} errors
  135. Here's how to resolve the most common errors reported in the `ERROR` step.
  136. TIP: Problems with rollover aliases are a common cause of errors.
  137. Consider using <<data-streams, data streams>> instead of managing rollover with aliases.
  138. [discrete]
  139. ==== Rollover alias [x] can point to multiple indices, found duplicated alias [x] in index template [z]
  140. The target rollover alias is specified in an index template's `index.lifecycle.rollover_alias` setting.
  141. You need to explicitly configure this alias _one time_ when you
  142. <<ilm-gs-alias-bootstrap, bootstrap the initial index>>.
  143. The rollover action then manages setting and updating the alias to
  144. <<rollover-index-api-desc, roll over>> to each subsequent index.
  145. Do not explicitly configure this same alias in the aliases section of an index template.
  146. [discrete]
  147. ==== index.lifecycle.rollover_alias [x] does not point to index [y]
  148. Either the index is using the wrong alias or the alias does not exist.
  149. Check the `index.lifecycle.rollover_alias` <<indices-get-settings, index setting>>.
  150. To see what aliases are configured, use <<cat-alias, _cat/aliases>>.
  151. [discrete]
  152. ==== Setting [index.lifecycle.rollover_alias] for index [y] is empty or not defined
  153. The `index.lifecycle.rollover_alias` setting must be configured for the rollover action to work.
  154. Update the index settings to set `index.lifecycle.rollover_alias`.
  155. [discrete]
  156. ==== Alias [x] has more than one write index [y,z]
  157. Only one index can be designated as the write index for a particular alias.
  158. Use the <<indices-aliases, aliases>> API to set `is_write_index:false` for all but one index.
  159. [discrete]
  160. ==== index name [x] does not match pattern ^.*-\d+
  161. The index name must match the regex pattern `^.*-\d+` for the rollover action to work.
  162. The most common problem is that the index name does not contain trailing digits.
  163. For example, `my-index` does not match the pattern requirement.
  164. Append a numeric value to the index name, for example `my-index-000001`.
  165. [discrete]
  166. ==== CircuitBreakingException: [x] data too large, data for [y]
  167. This indicates that the cluster is hitting resource limits.
  168. Before continuing to set up {ilm-init}, you'll need to take steps to alleviate the resource issues.
  169. For more information, see <<circuit-breaker-errors>>.
  170. [discrete]
  171. ==== High disk watermark [x] exceeded on [y]
  172. This indicates that the cluster is running out of disk space.
  173. This can happen when you don't have {ilm} set up to roll over from hot to warm nodes.
  174. Consider adding nodes, upgrading your hardware, or deleting unneeded indices.