Browse Source

Add shards capacity troubleshooting guide (#95208)

Pablo Alcantar Morales 2 years ago
parent
commit
253fe6325d

+ 1 - 1
docs/reference/health/health.asciidoc

@@ -381,7 +381,7 @@ watermark threshold>>.
     The `invocations_since_last_success` key will report a map where the unhealthy policy
     name is the key and it's corresponding number of failed invocations is the value.
 
-[[health-api-response-details-shards-usage]]
+[[health-api-response-details-shards-capacity]]
 ===== shards_capacity
 `data`::
 (map) A view with information about the current capacity of shards for data nodes that do not belong to the frozen tier.

+ 108 - 0
docs/reference/tab-widgets/troubleshooting/troubleshooting-shards-capacity-widget.asciidoc

@@ -0,0 +1,108 @@
+
+[discrete]
+=== Cluster is close to reaching the configured maximum number of shards for data nodes.
+
+The <<cluster-max-shards-per-node,`cluster.max_shards_per_node`>> cluster
+setting limits the maximum number of open shards for a cluster, only counting data nodes
+that do not belong to the frozen tier.
+
+This symptom indicates that action should be taken, otherwise, either the creation of new
+indices or upgrading the cluster could be blocked.
+
+If you're confident your changes won't destabilize the cluster, you can
+temporarily increase the limit using the <<cluster-update-settings,cluster update settings API>>:
+
+++++
+<div class="tabs" data-tab-group="host">
+  <div role="tablist" aria-label="Troubleshoot shards capacity for non-frozen nodes">
+    <button role="tab"
+            aria-selected="true"
+            aria-controls="cloud-tab-shards-capacity-non-frozen"
+            id="cloud-shards-capacity-non-frozen">
+      Elasticsearch Service
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="self-managed-tab-shards-capacity-non-frozen"
+            id="self-managed-shards-capacity-non-frozen"
+            tabindex="-1">
+      Self-managed
+    </button>
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="cloud-tab-shards-capacity-non-frozen"
+       aria-labelledby="cloud-shards-capacity-non-frozen">
+++++
+
+include::troubleshooting-shards-capacity.asciidoc[tag=non-frozen-nodes-cloud]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="self-managed-tab-shards-capacity-non-frozen"
+       aria-labelledby="self-managed-shards-capacity-non-frozen"
+       hidden="">
+++++
+
+include::troubleshooting-shards-capacity.asciidoc[tag=non-frozen-nodes-self-managed]
+
+++++
+  </div>
+</div>
+++++
+
+[discrete]
+=== Cluster is close to reaching the configured maximum number of shards for frozen nodes.
+
+The <<cluster-max-shards-per-node-frozen,`cluster.max_shards_per_node.frozen`>> cluster
+setting limits the maximum number of open shards for a cluster, only counting data nodes
+that belong to the frozen tier.
+
+This symptom indicates that action should be taken, otherwise, either the creation of new
+indices or upgrading the cluster could be blocked.
+
+If you're confident your changes won't destabilize the cluster, you can
+temporarily increase the limit using the <<cluster-update-settings,cluster update settings API>>:
+
+++++
+<div class="tabs" data-tab-group="host">
+  <div role="tablist" aria-label="Troubleshoot shards capacity for frozen nodes">
+    <button role="tab"
+            aria-selected="true"
+            aria-controls="cloud-tab-shards-capacity-frozen"
+            id="cloud-shards-capacity">
+      Elasticsearch Service
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="self-managed-tab-shards-capacity-frozen"
+            id="self-managed-shards-capacity-frozen"
+            tabindex="-1">
+      Self-managed
+    </button>
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="cloud-tab-shards-capacity-frozen"
+       aria-labelledby="cloud-shards-capacity-frozen">
+++++
+
+include::troubleshooting-shards-capacity.asciidoc[tag=frozen-nodes-cloud]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="self-managed-tab-shards-capacity-frozen"
+       aria-labelledby="self-managed-shards-capacity-frozen"
+       hidden="">
+++++
+
+include::troubleshooting-shards-capacity.asciidoc[tag=frozen-nodes-self-managed]
+
+++++
+  </div>
+</div>
+++++

+ 458 - 0
docs/reference/tab-widgets/troubleshooting/troubleshooting-shards-capacity.asciidoc

@@ -0,0 +1,458 @@
+
+// tag::non-frozen-nodes-cloud[]
+
+**Use {kib}**
+
+//tag::kibana-api-ex[]
+. Log in to the {ess-console}[{ecloud} console].
++
+
+. On the **Elasticsearch Service** panel, click the name of your deployment.
++
+
+NOTE: If the name of your deployment is disabled your {kib} instances might be
+unhealthy, in which case please contact https://support.elastic.co[Elastic Support].
+If your deployment doesn't include {kib}, all you need to do is
+{cloud}/ec-access-kibana.html[enable it first].
+
+. Open your deployment's side navigation menu (placed under the Elastic logo in the upper left corner)
+and go to **Dev Tools > Console**.
++
+[role="screenshot"]
+image::images/kibana-console.png[{kib} Console,align="center"]
++
+. Check the current status of the cluster according the shards capacity indicator:
++
+[source,console]
+----
+GET _health_report/shards_capacity
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+{
+  "cluster_name": "...",
+  "indicators": {
+    "shards_capacity": {
+      "status": "yellow",
+      "symptom": "Cluster is close to reaching the configured maximum number of shards for data nodes.",
+      "details": {
+        "data": {
+          "max_shards_in_cluster": 1000, <1>
+          "current_used_shards": 988 <2>
+        },
+        "frozen": {
+          "max_shards_in_cluster": 3000,
+          "current_used_shards": 0
+        }
+      },
+      "impacts": [
+        ...
+      ],
+      "diagnosis": [
+        ...
+    }
+  }
+}
+----
+// TESTRESPONSE[skip:the result is for illustrating purposes only]
++
+<1> Current value of the setting `cluster.max_shards_per_node`
+<2> Current number of open shards across the cluster
++
+. Update the <<cluster-max-shards-per-node,`cluster.max_shards_per_node`>> setting with a proper value:
++
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent" : {
+    "cluster.max_shards_per_node": 1200
+  }
+}
+----
++
+This increase should only be temporary. As a long-term solution, we recommend
+you add nodes to the oversharded data tier or
+<<reduce-cluster-shard-count,reduce your cluster's shard count>> on nodes that do not belong
+to the frozen tier.
+
+. To verify that the change has fixed the issue, you can get the current
+status of the `shards_capacity` indicator by checking the `data` section of the
+<<health-api-example,health API>>:
++
+[source,console]
+----
+GET _health_report/shards_capacity
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+{
+  "cluster_name": "...",
+  "indicators": {
+    "shards_capacity": {
+      "status": "green",
+      "symptom": "The cluster has enough room to add new shards.",
+      "details": {
+        "data": {
+          "max_shards_in_cluster": 1000
+        },
+        "frozen": {
+          "max_shards_in_cluster": 3000
+        }
+      }
+    }
+  }
+}
+----
+// TESTRESPONSE[skip:the result is for illustrating purposes only]
+
+. When a long-term solution is in place, we recommend you reset the
+`cluster.max_shards_per_node` limit.
++
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent" : {
+    "cluster.max_shards_per_node": null
+  }
+}
+----
+
+// end::non-frozen-nodes-cloud[]
+
+// tag::non-frozen-nodes-self-managed[]
+
+Check the current status of the cluster according the shards capacity indicator:
+
+[source,console]
+----
+GET _health_report/shards_capacity
+----
+
+The response will look like this:
+
+[source,console-result]
+----
+{
+  "cluster_name": "...",
+  "indicators": {
+    "shards_capacity": {
+      "status": "yellow",
+      "symptom": "Cluster is close to reaching the configured maximum number of shards for data nodes.",
+      "details": {
+        "data": {
+          "max_shards_in_cluster": 1000, <1>
+          "current_used_shards": 988 <2>
+        },
+        "frozen": {
+          "max_shards_in_cluster": 3000
+        }
+      },
+      "impacts": [
+        ...
+      ],
+      "diagnosis": [
+        ...
+    }
+  }
+}
+----
+// TESTRESPONSE[skip:the result is for illustrating purposes only]
+<1> Current value of the setting `cluster.max_shards_per_node`
+<2> Current number of open shards across the cluster
+
+Using the <<cluster-update-settings,`cluster settings API`>>, update the
+<<cluster-max-shards-per-node,`cluster.max_shards_per_node`>> setting:
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent" : {
+    "cluster.max_shards_per_node": 1200
+  }
+}
+----
+
+This increase should only be temporary. As a long-term solution, we recommend
+you add nodes to the oversharded data tier or
+<<reduce-cluster-shard-count,reduce your cluster's shard count>> on nodes that do not belong
+to the frozen tier. To verify that the change has fixed the issue, you can get the current
+status of the `shards_capacity` indicator by checking the `data` section of the
+<<health-api-example,health API>>:
+
+[source,console]
+----
+GET _health_report/shards_capacity
+----
+
+The response will look like this:
+
+[source,console-result]
+----
+{
+  "cluster_name": "...",
+  "indicators": {
+    "shards_capacity": {
+      "status": "green",
+      "symptom": "The cluster has enough room to add new shards.",
+      "details": {
+        "data": {
+          "max_shards_in_cluster": 1200
+        },
+        "frozen": {
+          "max_shards_in_cluster": 3000
+        }
+      }
+    }
+  }
+}
+----
+// TESTRESPONSE[skip:the result is for illustrating purposes only]
+
+When a long-term solution is in place, we recommend you reset the
+`cluster.max_shards_per_node` limit.
+
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent" : {
+    "cluster.max_shards_per_node": null
+  }
+}
+----
+// end::non-frozen-nodes-self-managed[]
+
+// tag::frozen-nodes-cloud[]
+
+**Use {kib}**
+
+//tag::kibana-api-ex[]
+. Log in to the {ess-console}[{ecloud} console].
++
+
+. On the **Elasticsearch Service** panel, click the name of your deployment.
++
+
+NOTE: If the name of your deployment is disabled your {kib} instances might be
+unhealthy, in which case please contact https://support.elastic.co[Elastic Support].
+If your deployment doesn't include {kib}, all you need to do is
+{cloud}/ec-access-kibana.html[enable it first].
+
+. Open your deployment's side navigation menu (placed under the Elastic logo in the upper left corner)
+and go to **Dev Tools > Console**.
++
+[role="screenshot"]
+image::images/kibana-console.png[{kib} Console,align="center"]
+. Check the current status of the cluster according the shards capacity indicator:
++
+[source,console]
+----
+GET _health_report/shards_capacity
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+{
+  "cluster_name": "...",
+  "indicators": {
+    "shards_capacity": {
+      "status": "yellow",
+      "symptom": "Cluster is close to reaching the configured maximum number of shards for frozen nodes.",
+      "details": {
+        "data": {
+          "max_shards_in_cluster": 1000
+        },
+        "frozen": {
+          "max_shards_in_cluster": 3000, <1>
+          "current_used_shards": 2998 <2>
+        }
+      },
+      "impacts": [
+        ...
+      ],
+      "diagnosis": [
+        ...
+    }
+  }
+}
+----
+// TESTRESPONSE[skip:the result is for illustrating purposes only]
+<1> Current value of the setting `cluster.max_shards_per_node.frozen`
+<2> Current number of open shards used by frozen nodes across the cluster
++
+
+. Update the <<cluster-max-shards-per-node-frozen,`cluster.max_shards_per_node.frozen`>> setting:
++
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent" : {
+    "cluster.max_shards_per_node.frozen": 3200
+  }
+}
+----
++
+This increase should only be temporary. As a long-term solution, we recommend
+you add nodes to the oversharded data tier or
+<<reduce-cluster-shard-count,reduce your cluster's shard count>> on nodes that belong
+to the frozen tier.
+
+. To verify that the change has fixed the issue, you can get the current
+status of the `shards_capacity` indicator by checking the `data` section of the
+<<health-api-example,health API>>:
++
+[source,console]
+----
+GET _health_report/shards_capacity
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+{
+  "cluster_name": "...",
+  "indicators": {
+    "shards_capacity": {
+      "status": "green",
+      "symptom": "The cluster has enough room to add new shards.",
+      "details": {
+        "data": {
+          "max_shards_in_cluster": 1000
+        },
+        "frozen": {
+          "max_shards_in_cluster": 3200
+        }
+      }
+    }
+  }
+}
+----
+// TESTRESPONSE[skip:the result is for illustrating purposes only]
++
+. When a long-term solution is in place, we recommend you reset the
+`cluster.max_shards_per_node.frozen` limit.
++
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent" : {
+    "cluster.max_shards_per_node.frozen": null
+  }
+}
+----
+
+// end::frozen-nodes-cloud[]
+
+// tag::frozen-nodes-self-managed[]
+
+Check the current status of the cluster according the shards capacity indicator:
+
+[source,console]
+----
+GET _health_report/shards_capacity
+----
+
+[source,console-result]
+----
+{
+  "cluster_name": "...",
+  "indicators": {
+    "shards_capacity": {
+      "status": "yellow",
+      "symptom": "Cluster is close to reaching the configured maximum number of shards for frozen nodes.",
+      "details": {
+        "data": {
+          "max_shards_in_cluster": 1000
+        },
+        "frozen": {
+          "max_shards_in_cluster": 3000, <1>
+          "current_used_shards": 2998 <2>
+        }
+      },
+      "impacts": [
+        ...
+      ],
+      "diagnosis": [
+        ...
+    }
+  }
+}
+----
+// TESTRESPONSE[skip:the result is for illustrating purposes only]
+<1> Current value of the setting `cluster.max_shards_per_node.frozen`.
+<2> Current number of open shards used by frozen nodes across the cluster.
+
+Using the <<cluster-update-settings,`cluster settings API`>>, update the
+<<cluster-max-shards-per-node-frozen,`cluster.max_shards_per_node.frozen`>> setting:
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent" : {
+    "cluster.max_shards_per_node.frozen": 3200
+  }
+}
+----
+
+This increase should only be temporary. As a long-term solution, we recommend
+you add nodes to the oversharded data tier or
+<<reduce-cluster-shard-count,reduce your cluster's shard count>> on nodes that belong
+to the frozen tier. To verify that the change has fixed the issue, you can get the current
+status of the `shards_capacity` indicator by checking the `data` section of the
+<<health-api-example,health API>>:
+
+[source,console]
+----
+GET _health_report/shards_capacity
+----
+
+The response will look like this:
+
+[source,console-result]
+----
+{
+  "cluster_name": "...",
+  "indicators": {
+    "shards_capacity": {
+      "status": "green",
+      "symptom": "The cluster has enough room to add new shards.",
+      "details": {
+        "data": {
+          "max_shards_in_cluster": 1000
+        },
+        "frozen": {
+          "max_shards_in_cluster": 3200
+        }
+      }
+    }
+  }
+}
+----
+// TESTRESPONSE[skip:the result is for illustrating purposes only]
+
+When a long-term solution is in place, we recommend you reset the
+`cluster.max_shards_per_node.frozen` limit.
+
+[source,console]
+----
+PUT _cluster/settings
+{
+  "persistent" : {
+    "cluster.max_shards_per_node.frozen": null
+  }
+}
+----
+// end::frozen-nodes-self-managed[]

+ 3 - 0
docs/reference/troubleshooting.asciidoc

@@ -54,6 +54,7 @@ fix problems that an {es} deployment might encounter.
 * <<transform-troubleshooting,Troubleshooting transforms>>
 * <<watcher-troubleshooting,Troubleshooting Watcher>>
 * <<troubleshooting-searches,Troubleshooting searches>>
+* <<troubleshooting-shards-capacity-issues, Troubleshoot shards capacity>>
 
 If none of these solutions relate to your issue, you can still get help:
 
@@ -123,3 +124,5 @@ include::transform/troubleshooting.asciidoc[leveloffset=+1]
 include::../../x-pack/docs/en/watcher/troubleshooting.asciidoc[]
 
 include::troubleshooting/troubleshooting-searches.asciidoc[]
+
+include::troubleshooting/troubleshooting-shards-capacity.asciidoc[]

+ 10 - 0
docs/reference/troubleshooting/troubleshooting-shards-capacity.asciidoc

@@ -0,0 +1,10 @@
+[[troubleshooting-shards-capacity-issues]]
+== Troubleshooting shards capacity health issues
+
+{es} limits the maximum number of shards to be held per node using the
+<<cluster-max-shards-per-node, `cluster.max_shards_per_node`>> and
+<<cluster-max-shards-per-node-frozen, `cluster.max_shards_per_node.frozen`>> settings.
+The current shards capacity of the cluster is available in the
+<<health-api-response-details-shards-capacity, health API shards capacity section>>.
+
+include::{es-repo-dir}/tab-widgets/troubleshooting/troubleshooting-shards-capacity-widget.asciidoc[]