Browse Source

Disk indicator troubleshooting guides (#90504)

Mary Gouseti 3 years ago
parent
commit
cfd23d512f
20 changed files with 655 additions and 1 deletions
  1. BIN
      docs/reference/images/troubleshooting/disk/autoscaling_banner.png
  2. BIN
      docs/reference/images/troubleshooting/disk/autoscaling_limits_banner.png
  3. BIN
      docs/reference/images/troubleshooting/disk/enable_autoscaling.png
  4. BIN
      docs/reference/images/troubleshooting/disk/increase-disk-capacity-master-node.png
  5. BIN
      docs/reference/images/troubleshooting/disk/increase-disk-capacity-other-node.png
  6. BIN
      docs/reference/images/troubleshooting/disk/reached_autoscaling_limits.png
  7. BIN
      docs/reference/images/troubleshooting/disk/reduce_replicas.png
  8. 9 1
      docs/reference/settings/health-diagnostic-settings.asciidoc
  9. 40 0
      docs/reference/tab-widgets/troubleshooting/disk/decrease-data-node-disk-usage-widget.asciidoc
  10. 140 0
      docs/reference/tab-widgets/troubleshooting/disk/decrease-data-node-disk-usage.asciidoc
  11. 40 0
      docs/reference/tab-widgets/troubleshooting/disk/increase-data-node-capacity-widget.asciidoc
  12. 110 0
      docs/reference/tab-widgets/troubleshooting/disk/increase-data-node-capacity.asciidoc
  13. 40 0
      docs/reference/tab-widgets/troubleshooting/disk/increase-master-node-capacity-widget.asciidoc
  14. 89 0
      docs/reference/tab-widgets/troubleshooting/disk/increase-master-node-capacity.asciidoc
  15. 40 0
      docs/reference/tab-widgets/troubleshooting/disk/increase-other-node-capacity-widget.asciidoc
  16. 94 0
      docs/reference/tab-widgets/troubleshooting/disk/increase-other-node-capacity.asciidoc
  17. 13 0
      docs/reference/troubleshooting.asciidoc
  18. 23 0
      docs/reference/troubleshooting/disk/fix-data-node-out-of-disk.asciidoc
  19. 8 0
      docs/reference/troubleshooting/disk/fix-master-node-out-of-disk.asciidoc
  20. 9 0
      docs/reference/troubleshooting/disk/fix-other-node-out-of-disk.asciidoc

BIN
docs/reference/images/troubleshooting/disk/autoscaling_banner.png


BIN
docs/reference/images/troubleshooting/disk/autoscaling_limits_banner.png


BIN
docs/reference/images/troubleshooting/disk/enable_autoscaling.png


BIN
docs/reference/images/troubleshooting/disk/increase-disk-capacity-master-node.png


BIN
docs/reference/images/troubleshooting/disk/increase-disk-capacity-other-node.png


BIN
docs/reference/images/troubleshooting/disk/reached_autoscaling_limits.png


BIN
docs/reference/images/troubleshooting/disk/reduce_replicas.png


+ 9 - 1
docs/reference/settings/health-diagnostic-settings.asciidoc

@@ -16,7 +16,7 @@ is not recommended to change any of these from their default values.
 a master at all, before moving on with other checks. Defaults to `30s` (30 seconds).
 
 `master_history.max_age`::
-(<<static-cluster-setting,Static>>) The timeframe we record the master history 
+(<<static-cluster-setting,Static>>) The timeframe we record the master history
 to be used for diagnosing the cluster health. Master node changes older than this time will not be considered when
 diagnosing the cluster health. Defaults to `30m` (30 minutes).
 
@@ -27,3 +27,11 @@ Defaults to `4`.
 `health.master_history.no_master_transitions_threshold`::
 (<<static-cluster-setting,Static>>) The number of transitions to no master witnessed by a node that indicates the cluster is not healthy.
 Defaults to `4`.
+
+`health.node.enabled`::
+(<<cluster-update-settings,Dynamic>>) Enables the health node, which allows the health API to provide indications about
+cluster wide health aspects such as disk space.
+
+`health.reporting.local.monitor.interval`::
+(<<cluster-update-settings,Dynamic>>) Determines the interval in which each node of the cluster monitors aspects that
+comprise its local health such as its disk usage.

+ 40 - 0
docs/reference/tab-widgets/troubleshooting/disk/decrease-data-node-disk-usage-widget.asciidoc

@@ -0,0 +1,40 @@
+++++
+<div class="tabs" data-tab-group="host">
+  <div role="tablist" aria-label="Restore from snapshot">
+    <button role="tab"
+            aria-selected="true"
+            aria-controls="cloud-tab-decrease-disk-usage"
+            id="cloud-decrease-disk-usage">
+      Elasticsearch Service
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="self-managed-tab-decrease-disk-usage"
+            id="self-managed-decrease-disk-usage"
+            tabindex="-1">
+      Self-managed
+    </button>
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="cloud-tab-decrease-disk-usage"
+       aria-labelledby="cloud-decrease-disk-usage">
+++++
+
+include::decrease-data-node-disk-usage.asciidoc[tag=cloud]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="self-managed-tab-decrease-disk-usage"
+       aria-labelledby="self-managed-decrease-disk-usage"
+       hidden="">
+++++
+
+include::decrease-data-node-disk-usage.asciidoc[tag=self-managed]
+
+++++
+  </div>
+</div>
+++++

+ 140 - 0
docs/reference/tab-widgets/troubleshooting/disk/decrease-data-node-disk-usage.asciidoc

@@ -0,0 +1,140 @@
+// tag::cloud[]
+**Use {kib}**
+
+//tag::kibana-api-ex[]
+. Log in to the {ess-console}[{ecloud} console].
++
+
+. On the **Elasticsearch Service** panel, click the name of your deployment.
++
+
+NOTE: If the name of your deployment is disabled your {kib} instances might be
+unhealthy, in which case please contact https://support.elastic.co[Elastic Support].
+If your deployment doesn't include {kib}, all you need to do is
+{cloud}/ec-access-kibana.html[enable it first].
++
+. Open your deployment's side navigation menu (placed under the Elastic logo in the upper left corner)
+and go to **Stack Management > Index Management**.
+
+. In the list of all your indices, click the `Replicas` column twice to sort the indices based on their number of
+replicas starting with the one that has the most. Go through the indices and pick one by one the index with the
+least importance and higher number of replicas.
++
+WARNING: Reducing the replicas of an index can potentially reduce search throughput and data redundancy.
++
+. For each index you chose, click on its name, then on the panel that appears click `Edit settings`, reduce the
+value of the `index.number_of_replicas` to the desired value and then click `Save`.
++
+[role="screenshot"]
+image::images/troubleshooting/disk/reduce_replicas.png[Reducing replicas,align="center"]
++
+. Continue this process until the cluster is healthy again.
+
+// end::cloud[]
+
+// tag::self-managed[]
+In order to estimate how many replicas need to be removed, first you need to estimate the amount of disk space that
+needs to be released.
+
+. First, retrieve the relevant disk thresholds that will indicate how much space should be released. The
+relevant thresholds are the <<cluster-routing-watermark-high, high watermark>> for all the tiers apart from the frozen
+one and the <<cluster-routing-flood-stage-frozen, frozen flood stage watermark>> for the frozen tier. The following
+example demonstrates disk shortage in the hot tier, so we will only retrieve the high watermark:
++
+[source,console]
+----
+GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+{
+  "defaults": {
+    "cluster": {
+      "routing": {
+        "allocation": {
+          "disk": {
+            "watermark": {
+              "high": "90%",
+              "high.max_headroom": "150GB"
+            }
+          }
+        }
+      }
+    }
+  }
+}
+----
+// TEST[skip:illustration purposes only]
++
+The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
+more than 150GB available, read more on how this threshold works <<cluster-routing-watermark-high, here>>.
+
+. The next step is to find out the current disk usage; this will indicate how much space should be freed. For simplicity,
+our example has one node, but you can apply the same for every node over the relevant threshold.
++
+[source,console]
+----
+GET _cat/allocation?v&s=disk.avail&h=node,disk.percent,disk.avail,disk.total,disk.used,disk.indices,shards
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+node                disk.percent disk.avail disk.total disk.used disk.indices shards
+instance-0000000000           91     4.6gb       35gb    31.1gb       29.9gb    111
+----
+// TEST[skip:illustration purposes only]
+
+. The high watermark configuration indicates that the disk usage needs to drop below 90%. Consider allowing some
+padding, so the node will not go over the threshold in the near future. In this example, let's release approximately 7GB.
+
+. The next step is to list all the indices and choose which replicas to reduce.
++
+NOTE: The following command orders the indices with descending number of replicas and primary store size. We do this to
+help you choose which replicas to reduce under the assumption that the more replicas you have the smaller the risk if
+you remove a copy and the bigger the replica the more space will be released. This does not take into consideration any
+functional requirements, so please see it as a mere suggestion.
++
+[source,console]
+----
+GET _cat/indices?v&s=rep:desc,pri.store.size:desc&h=health,index,pri,rep,store.size,pri.store.size
+----
++
+The response will look like:
++
+[source,console-result]
+----
+health index                                                      pri rep store.size pri.store.size
+green  my_index                                                     2   3      9.9gb          3.3gb
+green  my_other_index                                               2   3      1.8gb        470.3mb
+green  search-products                                              2   3    278.5kb         69.6kb
+green  logs-000001                                                  1   0      7.7gb          7.7gb
+----
+// TEST[skip:illustration purposes only]
++
+. In the list above we see that if we reduce the replicas to 1 of the indices `my_index` and  `my_other_index` we will
+release the required disk space. It is not necessary to reduce the replicas of `search-products` and `logs-000001` does
+not have any replicas anyway. Reduce the replicas of one or more indices with the <<indices-update-settings,
+index update settings API>>:
++
+WARNING: Reducing the replicas of an index can potentially reduce search throughput and data redundancy.
++
+[source,console]
+----
+PUT my_index,my_other_index/_settings
+{
+  "index.number_of_replicas": 1
+}
+----
+// TEST[skip:illustration purposes only]
+// end::self-managed[]
+
+IMPORTANT: After reducing the replicas please consider there are enough replicas to ensure your search
+performance and reliability requirements. If not, at your earliest convenience (i) consider using
+<<overview-index-lifecycle-management, Index Lifecycle Management>> to manage more efficiently the
+retention of your timeseries data, or (ii) reduce the amount of data you have by disabling the `source` or removing
+less important data, or (iii) increase your disk capacity.

+ 40 - 0
docs/reference/tab-widgets/troubleshooting/disk/increase-data-node-capacity-widget.asciidoc

@@ -0,0 +1,40 @@
+++++
+<div class="tabs" data-tab-group="host">
+  <div role="tablist" aria-label="Increase data node capacity">
+    <button role="tab"
+            aria-selected="true"
+            aria-controls="cloud-tab-increase-data-node-capacity"
+            id="cloud-increase-data-node-capacity">
+      Elasticsearch Service
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="self-managed-tab-increase-data-node-capacity"
+            id="self-managed-increase-data-node-capacity"
+            tabindex="-1">
+      Self-managed
+    </button>
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="cloud-tab-increase-data-node-capacity"
+       aria-labelledby="cloud-increase-data-node-capacity">
+++++
+
+include::increase-data-node-capacity.asciidoc[tag=cloud]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="self-managed-tab-increase-data-node-capacity"
+       aria-labelledby="self-managed-increase-data-node-capacity"
+       hidden="">
+++++
+
+include::increase-data-node-capacity.asciidoc[tag=self-managed]
+
+++++
+  </div>
+</div>
+++++

+ 110 - 0
docs/reference/tab-widgets/troubleshooting/disk/increase-data-node-capacity.asciidoc

@@ -0,0 +1,110 @@
+// tag::cloud[]
+In order to increase the disk capacity of the data nodes in your cluster:
+
+. Log in to the {ess-console}[{ecloud} console].
++
+. On the **Elasticsearch Service** panel, click the gear under the `Manage deployment` column that corresponds to the
+name of your deployment.
++
+. If autoscaling is available but not enabled, please enable it. You can do this by clicking the button
+`Enable autoscaling` on a banner like the one below:
++
+[role="screenshot"]
+image::images/troubleshooting/disk/autoscaling_banner.png[Autoscaling banner,align="center"]
++
+Or you can go to `Actions > Edit deployment`, check the checkbox `Autoscale` and click `save` at the bottom of the page.
++
+[role="screenshot"]
+image::images/troubleshooting/disk/enable_autoscaling.png[Enabling autoscaling,align="center"]
+
+. If autoscaling has succeeded the cluster should return to `healthy` status. If the cluster is still out of disk,
+please check if autoscaling has reached its limits. You will be notified about this by the following banner:
++
+[role="screenshot"]
+image::images/troubleshooting/disk/autoscaling_limits_banner.png[Autoscaling banner,align="center"]
++
+or you can go to `Actions > Edit deployment` and look for the label `LIMIT REACHED` as shown below:
++
+[role="screenshot"]
+image::images/troubleshooting/disk/reached_autoscaling_limits.png[Autoscaling limits reached,align="center"]
++
+If you are seeing the banner click `Update autoscaling settings` to go to the `Edit` page. Otherwise, you are already
+in the `Edit` page, click `Edit settings` to increase the autoscaling limits. After you perform the change click `save`
+at the bottom of the page.
+
+// end::cloud[]
+
+// tag::self-managed[]
+In order to increase the data node capacity in your cluster, you will need to calculate the amount of extra disk space
+needed.
+
+. First, retrieve the relevant disk thresholds that will indicate how much space should be available. The
+relevant thresholds are the <<cluster-routing-watermark-high, high watermark>> for all the tiers apart from the frozen
+one and the <<cluster-routing-flood-stage-frozen, frozen flood stage watermark>> for the frozen tier. The following
+example demonstrates disk shortage in the hot tier, so we will only retrieve the high watermark:
++
+[source,console]
+----
+GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+{
+  "defaults": {
+    "cluster": {
+      "routing": {
+        "allocation": {
+          "disk": {
+            "watermark": {
+              "high": "90%",
+              "high.max_headroom": "150GB"
+            }
+          }
+        }
+      }
+    }
+  }
+}
+----
+// TEST[skip:illustration purposes only]
++
+The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
+more than 150GB available, read more on how this threshold works <<cluster-routing-watermark-high, here>>.
+
+. The next step is to find out the current disk usage, this will indicate how much extra space is needed. For simplicity,
+our example has one node, but you can apply the same for every node over the relevant threshold.
++
+[source,console]
+----
+GET _cat/allocation?v&s=disk.avail&h=node,disk.percent,disk.avail,disk.total,disk.used,disk.indices,shards
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+node                disk.percent disk.avail disk.total disk.used disk.indices shards
+instance-0000000000           91     4.6gb       35gb    31.1gb       29.9gb    111
+----
+// TEST[skip:illustration purposes only]
+
+. The high watermark configuration indicates that the disk usage needs to drop below 90%. To achieve this, 2
+things are possible:
+- to add an extra data node to the cluster (this requires that you have more than one shard in your cluster), or
+- to extend the disk space of the current node by approximately 20% to allow this node to drop to 70%. This will give
+enough space to this node to not run out of space soon.
+
+. In the case of adding another data node, the cluster will not recover immediately. It might take some time to
+relocate some shards to the new node. You can check the progress here:
++
+[source,console]
+----
+GET /_cat/shards?v&h=state,node&s=state
+----
++
+If in the response the shards' state is `RELOCATING`, it means that shards are still moving. Wait until all shards turn
+to `STARTED` or until the health disk indicator turns to `green`.
+// end::self-managed[]

+ 40 - 0
docs/reference/tab-widgets/troubleshooting/disk/increase-master-node-capacity-widget.asciidoc

@@ -0,0 +1,40 @@
+++++
+<div class="tabs" data-tab-group="host">
+  <div role="tablist" aria-label="Increase master node capacity">
+    <button role="tab"
+            aria-selected="true"
+            aria-controls="cloud-tab-increase-master-node-capacity"
+            id="cloud-increase-data-node-capacity">
+      Elasticsearch Service
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="self-managed-tab-increase-master-node-capacity"
+            id="self-managed-increase-master-node-capacity"
+            tabindex="-1">
+      Self-managed
+    </button>
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="cloud-tab-increase-master-node-capacity"
+       aria-labelledby="cloud-increase-master-node-capacity">
+++++
+
+include::increase-master-node-capacity.asciidoc[tag=cloud]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="self-managed-tab-increase-master-node-capacity"
+       aria-labelledby="self-managed-increase-master-node-capacity"
+       hidden="">
+++++
+
+include::increase-master-node-capacity.asciidoc[tag=self-managed]
+
+++++
+  </div>
+</div>
+++++

+ 89 - 0
docs/reference/tab-widgets/troubleshooting/disk/increase-master-node-capacity.asciidoc

@@ -0,0 +1,89 @@
+// tag::cloud[]
+
+. Log in to the {ess-console}[{ecloud} console].
++
+. On the **Elasticsearch Service** panel, click the gear under the `Manage deployment` column that corresponds to the
+name of your deployment.
++
+. Go to `Actions > Edit deployment` and then go to the `Master instances` section:
++
+[role="screenshot"]
+image::images/troubleshooting/disk/increase-disk-capacity-master-node.png[Increase disk capacity of master nodes,align="center"]
+
+. Choose a larger than the pre-selected capacity configuration from the drop-down menu and click `save`. Wait for
+the plan to be applied and the problem should be resolved.
+
+// end::cloud[]
+
+// tag::self-managed[]
+In order to increase the disk capacity of a master node, you will need to replace *all* the master nodes with
+master nodes of higher disk capacity.
+
+. First, retrieve the disk threshold that will indicate how much disk space is needed. The relevant threshold is
+the <<cluster-routing-watermark-high, high watermark>> and can be retrieved via the following command:
++
+[source,console]
+----
+GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+{
+  "defaults": {
+    "cluster": {
+      "routing": {
+        "allocation": {
+          "disk": {
+            "watermark": {
+              "high": "90%",
+              "high.max_headroom": "150GB"
+            }
+          }
+        }
+      }
+    }
+  }
+----
+// TEST[skip:illustration purposes only]
++
+The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
+more than 150GB available, read more how this threshold works <<cluster-routing-watermark-high, here>>.
+
+. The next step is to find out the current disk usage, this will allow to calculate how much extra space is needed.
+In the following example, we show only the master nodes for readability purposes:
++
+[source,console]
+----
+GET /_cat/nodes?v&h=name,master,node.role,disk.used_percent,disk.used,disk.avail,disk.total
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+name                master node.role disk.used_percent disk.used disk.avail disk.total
+instance-0000000000 *      m                    85.31    3.4gb     500mb       4gb
+instance-0000000001 *      m                    50.02    2.1gb     1.9gb       4gb
+instance-0000000002 *      m                    50.02    1.9gb     2.1gb       4gb
+----
+// TEST[skip:illustration purposes only]
+
+. The desired situation is to drop the disk usages below the relevant threshold, in our example 90%. Consider adding
+some padding, so it will not go over the threshold soon. If you have multiple master nodes you need to ensure that *all*
+master nodes will have this capacity. Assuming you have the new nodes ready, follow the next three steps for every
+master node.
+
+. Bring down one of the master nodes.
+. Start up one of the new master nodes and wait for it to join the cluster. You can check this via:
++
+[source,console]
+----
+GET /_cat/nodes?v&h=name,master,node.role,disk.used_percent,disk.used,disk.avail,disk.total
+----
++
+. Only after you have confirmed that your cluster has the initial number of master nodes, move forward to the next one
+until all the initial master nodes have been replaced.
+// end::self-managed[]

+ 40 - 0
docs/reference/tab-widgets/troubleshooting/disk/increase-other-node-capacity-widget.asciidoc

@@ -0,0 +1,40 @@
+++++
+<div class="tabs" data-tab-group="host">
+  <div role="tablist" aria-label="Increase other node capacity">
+    <button role="tab"
+            aria-selected="true"
+            aria-controls="cloud-tab-increase-other-node-capacity"
+            id="cloud-increase-data-node-capacity">
+      Elasticsearch Service
+    </button>
+    <button role="tab"
+            aria-selected="false"
+            aria-controls="self-managed-tab-increase-other-node-capacity"
+            id="self-managed-increase-other-node-capacity"
+            tabindex="-1">
+      Self-managed
+    </button>
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="cloud-tab-increase-other-node-capacity"
+       aria-labelledby="cloud-increase-other-node-capacity">
+++++
+
+include::increase-other-node-capacity.asciidoc[tag=cloud]
+
+++++
+  </div>
+  <div tabindex="0"
+       role="tabpanel"
+       id="self-managed-tab-increase-other-node-capacity"
+       aria-labelledby="self-managed-increase-other-node-capacity"
+       hidden="">
+++++
+
+include::increase-other-node-capacity.asciidoc[tag=self-managed]
+
+++++
+  </div>
+</div>
+++++

+ 94 - 0
docs/reference/tab-widgets/troubleshooting/disk/increase-other-node-capacity.asciidoc

@@ -0,0 +1,94 @@
+// tag::cloud[]
+
+. Log in to the {ess-console}[{ecloud} console].
++
+. On the **Elasticsearch Service** panel, click the gear under the `Manage deployment` column that corresponds to the
+name of your deployment.
++
+. Go to `Actions > Edit deployment` and then go to the `Coordinating instances` or the `Machine Learning instances`
+section depending on the roles listed in the diagnosis:
++
+[role="screenshot"]
+image::images/troubleshooting/disk/increase-disk-capacity-other-node.png[Increase disk capacity of other nodes,align="center"]
+
+. Choose a larger than the pre-selected capacity configuration from the drop-down menu and click `save`. Wait for
+the plan to be applied and the problem should be resolved.
+
+// end::cloud[]
+
+// tag::self-managed[]
+In order to increase the disk capacity of any other node, you will need to replace the instance that has run out of
+space with one of higher disk capacity.
+
+. First, retrieve the disk threshold that will indicate how much disk space is needed. The relevant threshold is
+the <<cluster-routing-watermark-high, high watermark>> and can be retrieved via the following command:
++
+[source,console]
+----
+GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+{
+  "defaults": {
+    "cluster": {
+      "routing": {
+        "allocation": {
+          "disk": {
+            "watermark": {
+              "high": "90%",
+              "high.max_headroom": "150GB"
+            }
+          }
+        }
+      }
+    }
+  }
+----
+// TEST[skip:illustration purposes only]
++
+The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
+more than 150GB available, read more how this threshold works <<cluster-routing-watermark-high, here>>.
+
+. The next step is to find out the current disk usage, this will allow to calculate how much extra space is needed.
+In the following example, we show only a machine learning node for readability purposes:
++
+[source,console]
+----
+GET /_cat/nodes?v&h=name,node.role,disk.used_percent,disk.used,disk.avail,disk.total
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+name                node.role disk.used_percent disk.used disk.avail disk.total
+instance-0000000000     l                 85.31    3.4gb     500mb       4gb
+----
+// TEST[skip:illustration purposes only]
+
+. The desired situation is to drop the disk usage below the relevant threshold, in our example 90%. Consider adding
+some padding, so it will not go over the threshold soon. Assuming you have the new node ready, add this node to the
+cluster.
+
+. Verify that the new node has joined the cluster:
++
+[source,console]
+----
+GET /_cat/nodes?v&h=name,node.role,disk.used_percent,disk.used,disk.avail,disk.total
+----
++
+The response will look like this:
++
+[source,console-result]
+----
+name                node.role disk.used_percent disk.used disk.avail disk.total
+instance-0000000000     l                 85.31    3.4gb     500mb       4gb
+instance-0000000001     l                 41.31    3.4gb     4.5gb       8gb
+----
+// TEST[skip:illustration purposes only]
+. Now you can remove the out of disk space instance.
+// end::self-managed[]

+ 13 - 0
docs/reference/troubleshooting.asciidoc

@@ -31,6 +31,13 @@ fix problems that an {es} deployment might encounter.
 * <<start-ilm,Start index lifecycle management>>
 * <<start-slm,Start snapshot lifecycle management>>
 
+[discrete]
+[[troubleshooting-capacity]]
+=== Capacity
+* <<fix-data-node-out-of-disk, Fix data nodes out of disk>>
+* <<fix-master-node-out-of-disk, Fix master nodes out of disk>>
+* <<fix-other-node-out-of-disk, Fix other role nodes out of disk>>
+
 [discrete]
 [[troubleshooting-snapshot]]
 === Snapshot and restore
@@ -90,6 +97,12 @@ include::troubleshooting/data/increase-cluster-shard-limit.asciidoc[]
 
 include::troubleshooting/corruption-issues.asciidoc[]
 
+include::troubleshooting/disk/fix-data-node-out-of-disk.asciidoc[]
+
+include::troubleshooting/disk/fix-master-node-out-of-disk.asciidoc[]
+
+include::troubleshooting/disk/fix-other-node-out-of-disk.asciidoc[]
+
 include::troubleshooting/data/start-ilm.asciidoc[]
 
 include::troubleshooting/data/start-slm.asciidoc[]

+ 23 - 0
docs/reference/troubleshooting/disk/fix-data-node-out-of-disk.asciidoc

@@ -0,0 +1,23 @@
+[[fix-data-node-out-of-disk]]
+== Fix data nodes out of disk
+
+{es} is using data nodes to distribute your data inside the cluster. If one or more of these nodes are running
+out of space, {es} takes action to redistribute your data within the nodes so all nodes have enough available
+disk space. If {es} cannot facilitate enough available space in a node, then you will need to intervene in one
+of two ways:
+
+. <<increase-capacity-data-node, Increase the disk capacity of your cluster>>
+. <<decrease-disk-usage-data-node, Reduce the disk usage by decreasing your data volume>>
+
+[[increase-capacity-data-node]]
+=== Increase the disk capacity of data nodes
+include::{es-repo-dir}/tab-widgets/troubleshooting/disk/increase-data-node-capacity-widget.asciidoc[]
+
+[[decrease-disk-usage-data-node]]
+=== Decrease the disk usage of data nodes
+In order to decrease the disk usage in your cluster without losing any data, you can try reducing the replicas of indices.
+
+NOTE: Reducing the replicas of an index can potentially reduce search throughput and data redundancy. However, it
+can quickly give the cluster breathing room until a more permanent solution is in place.
+
+include::{es-repo-dir}/tab-widgets/troubleshooting/disk/decrease-data-node-disk-usage-widget.asciidoc[]

+ 8 - 0
docs/reference/troubleshooting/disk/fix-master-node-out-of-disk.asciidoc

@@ -0,0 +1,8 @@
+[[fix-master-node-out-of-disk]]
+== Fix master nodes out of disk
+
+{es} is using master nodes to coordinate the cluster. If the master or any master eligible nodes are running
+out of space, you need to ensure that they have enough disk space to function. If the <<health-api, health API>>
+reports that your master node is out of space you need to increase the disk capacity of your master nodes.
+
+include::{es-repo-dir}/tab-widgets/troubleshooting/disk/increase-master-node-capacity-widget.asciidoc[]

+ 9 - 0
docs/reference/troubleshooting/disk/fix-other-node-out-of-disk.asciidoc

@@ -0,0 +1,9 @@
+[[fix-other-node-out-of-disk]]
+== Fix other role nodes out of disk
+
+{es} can use dedicated nodes to execute other functions apart from storing data or coordinating the cluster,
+for example machine learning. If one or more of these nodes are running out of space, you need to ensure that they have
+enough disk space to function. If the <<health-api, health API>> reports that a node that is not a master and does not
+contain data is out of space you need to increase the disk capacity of this node.
+
+include::{es-repo-dir}/tab-widgets/troubleshooting/disk/increase-other-node-capacity-widget.asciidoc[]