|
@@ -0,0 +1,274 @@
|
|
|
+[[hotspotting]]
|
|
|
+=== Hot spotting
|
|
|
+++++
|
|
|
+<titleabbrev>Hot spotting</titleabbrev>
|
|
|
+++++
|
|
|
+:keywords: hot-spotting, hotspot, hot-spot, hot spot, hotspots, hotspotting
|
|
|
+
|
|
|
+Computer link:{wikipedia}/Hot_spot_(computer_programming)[hot spotting]
|
|
|
+may occur in {es} when resource utilizations are unevenly distributed across
|
|
|
+<<modules-node,nodes>>. Temporary spikes are not usually considered problematic, but
|
|
|
+ongoing significantly unique utilization may lead to cluster bottlenecks
|
|
|
+and should be reviewed.
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[detect]]
|
|
|
+==== Detect hot spotting
|
|
|
+
|
|
|
+Hot spotting most commonly surfaces as significantly elevated
|
|
|
+resource utilization (of `disk.percent`, `heap.percent`, or `cpu`) among a
|
|
|
+subset of nodes as reported via <<cat-nodes,cat nodes>>. Individual spikes aren't
|
|
|
+necessarily problematic, but if utilization repeatedly spikes or consistently remains
|
|
|
+high over time (for example longer than 30 seconds), the resource may be experiencing problematic
|
|
|
+hot spotting.
|
|
|
+
|
|
|
+For example, let's show case two separate plausible issues using cat nodes:
|
|
|
+
|
|
|
+[source,console]
|
|
|
+----
|
|
|
+GET _cat/nodes?v&s=master,name&h=name,master,node.role,heap.percent,disk.used_percent,cpu
|
|
|
+----
|
|
|
+Pretend this same output pulled twice across five minutes:
|
|
|
+
|
|
|
+[source,console-result]
|
|
|
+----
|
|
|
+name master node.role heap.percent disk.used_percent cpu
|
|
|
+node_1 * hirstm 24 20 95
|
|
|
+node_2 - hirstm 23 18 18
|
|
|
+node_3 - hirstmv 25 90 10
|
|
|
+----
|
|
|
+// TEST[skip:illustrative response only]
|
|
|
+
|
|
|
+Here we see two significantly unique utilizations: where the master node is at
|
|
|
+`cpu: 95` and a hot node is at `disk.used_percent: 90%`. This would indicate
|
|
|
+hot spotting was occurring on these two nodes, and not necessarily from the same
|
|
|
+root cause.
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[causes]]
|
|
|
+==== Causes
|
|
|
+
|
|
|
+Historically, clusters experience hot spotting mainly as an effect of hardware,
|
|
|
+shard distributions, and/or task load. We'll review these sequentially in order
|
|
|
+of their potentially impacting scope.
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[causes-hardware]]
|
|
|
+===== Hardware
|
|
|
+
|
|
|
+Here are some common improper hardware setups which may contribute to hot
|
|
|
+spotting:
|
|
|
+
|
|
|
+* Resources are allocated non-uniformly. For example, if one hot node is
|
|
|
+given half the CPU of its peers. {es} expects all nodes on a
|
|
|
+<<data-tiers,data tier>> to share the same hardware profiles or
|
|
|
+specifications.
|
|
|
+
|
|
|
+* Resources are consumed by another service on the host, including other
|
|
|
+{es} nodes. Refer to our <<dedicated-host,dedicated host>> recommendation.
|
|
|
+
|
|
|
+* Resources experience different network or disk throughputs. For example, if one
|
|
|
+node's I/O is lower than its peers. Refer to
|
|
|
+<<tune-for-indexing-speed,Use faster hardware>> for more information.
|
|
|
+
|
|
|
+* A JVM that has been configured with a heap larger than 31GB. Refer to <<set-jvm-heap-size>>
|
|
|
+for more information.
|
|
|
+
|
|
|
+* Problematic resources uniquely report <<setup-configuration-memory,memory swapping>>.
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[causes-shards]]
|
|
|
+===== Shard distributions
|
|
|
+
|
|
|
+{es} indices are divided into one or more link:{wikipedia}/Shard_(database_architecture)[shards]
|
|
|
+which can sometimes be poorly distributed. {es} accounts for this by <<modules-cluster,balancing shard counts>>
|
|
|
+across data nodes. As link:{blog-ref}whats-new-elasticsearch-kibana-cloud-8-6-0[introduced in version 8.6],
|
|
|
+{es} by default also enables <<modules-cluster,desired balancing>> to account for ingest load.
|
|
|
+A node may still experience hot spotting either due to write-heavy indices or by the
|
|
|
+overall shards it's hosting.
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[causes-shards-nodes]]
|
|
|
+====== Node level
|
|
|
+
|
|
|
+You can check for shard balancing via <<cat-allocation,cat allocation>>, though as of version
|
|
|
+8.6, <<modules-cluster,desired balancing>> may no longer fully expect to
|
|
|
+balance shards. Kindly note, both methods may temporarily show problematic imbalance during
|
|
|
+<<cluster-fault-detection,cluster stability issues>>.
|
|
|
+
|
|
|
+For example, let's showcase two separate plausible issues using cat allocation:
|
|
|
+
|
|
|
+[source,console]
|
|
|
+----
|
|
|
+GET _cat/allocation?v&s=node&h=node,shards,disk.percent,disk.indices,disk.used
|
|
|
+----
|
|
|
+
|
|
|
+Which could return:
|
|
|
+
|
|
|
+[source,console-result]
|
|
|
+----
|
|
|
+node shards disk.percent disk.indices disk.used
|
|
|
+node_1 446 19 154.8gb 173.1gb
|
|
|
+node_2 31 52 44.6gb 372.7gb
|
|
|
+node_3 445 43 271.5gb 289.4gb
|
|
|
+----
|
|
|
+// TEST[skip:illustrative response only]
|
|
|
+
|
|
|
+Here we see two significantly unique situations. `node_2` has recently
|
|
|
+restarted, so it has a much lower number of shards than all other nodes. This
|
|
|
+also relates to `disk.indices` being much smaller than `disk.used` while shards
|
|
|
+are recovering as seen via <<cat-recovery,cat recovery>>. While `node_2`'s shard
|
|
|
+count is low, it may become a write hot spot due to ongoing <<ilm-rollover,ILM
|
|
|
+rollovers>>. This is a common root cause of write hot spots covered in the next
|
|
|
+section.
|
|
|
+
|
|
|
+The second situation is that `node_3` has a higher `disk.percent` than `node_1`,
|
|
|
+even though they hold roughly the same number of shards. This occurs when either
|
|
|
+shards are not evenly sized (refer to <<shard-size-recommendation>>) or when
|
|
|
+there are a lot of empty indices.
|
|
|
+
|
|
|
+Cluster rebalancing based on desired balance does much of the heavy lifting
|
|
|
+of keeping nodes from hot spotting. It can be limited by either nodes hitting
|
|
|
+<<disk-based-shard-allocation,watermarks>> (refer to <<fix-watermark-errors,fixing disk watermark errors>>) or by a
|
|
|
+write-heavy index's total shards being much lower than the written-to nodes.
|
|
|
+
|
|
|
+You can confirm hot spotted nodes via <<cluster-nodes-stats,the nodes stats API>>,
|
|
|
+potentially polling twice over time to only checking for the stats differences
|
|
|
+between them rather than polling once giving you stats for the node's
|
|
|
+full <<cluster-nodes-usage,node uptime>>. For example, to check all nodes
|
|
|
+indexing stats:
|
|
|
+
|
|
|
+[source,console]
|
|
|
+----
|
|
|
+GET _nodes/stats?human&filter_path=nodes.*.name,nodes.*.indices.indexing
|
|
|
+----
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[causes-shards-index]]
|
|
|
+====== Index level
|
|
|
+
|
|
|
+Hot spotted nodes frequently surface via <<cat-thread-pool,cat thread pool>>'s
|
|
|
+`write` and `search` queue backups. For example:
|
|
|
+
|
|
|
+[source,console]
|
|
|
+----
|
|
|
+GET _cat/thread_pool/write,search?v=true&s=n,nn&h=n,nn,q,a,r,c
|
|
|
+----
|
|
|
+
|
|
|
+Which could return:
|
|
|
+
|
|
|
+[source,console-result]
|
|
|
+----
|
|
|
+n nn q a r c
|
|
|
+search node_1 3 1 0 1287
|
|
|
+search node_2 0 2 0 1159
|
|
|
+search node_3 0 1 0 1302
|
|
|
+write node_1 100 3 0 4259
|
|
|
+write node_2 0 4 0 980
|
|
|
+write node_3 1 5 0 8714
|
|
|
+----
|
|
|
+// TEST[skip:illustrative response only]
|
|
|
+
|
|
|
+Here you can see two significantly unique situations. Firstly, `node_1` has a
|
|
|
+severely backed up write queue compared to other nodes. Secondly, `node_3` shows
|
|
|
+historically completed writes that are double any other node. These are both
|
|
|
+probably due to either poorly distributed write-heavy indices, or to multiple
|
|
|
+write-heavy indices allocated to the same node. Since primary and replica writes
|
|
|
+are majorly the same amount of cluster work, we usually recommend setting
|
|
|
+<<total-shards-per-node,`index.routing.allocation.total_shards_per_node`>> to
|
|
|
+force index spreading after lining up index shard counts to total nodes.
|
|
|
+
|
|
|
+We normally recommend heavy-write indices have sufficient primary
|
|
|
+`number_of_shards` and replica `number_of_replicas` to evenly spread across
|
|
|
+indexing nodes. Alternatively, you can <<cluster-reroute,reroute>> shards to
|
|
|
+more quiet nodes to alleviate the nodes with write hot spotting.
|
|
|
+
|
|
|
+If it's non-obvious what indices are problematic, you can introspect further via
|
|
|
+<<indices-stats,the index stats API>> by running:
|
|
|
+
|
|
|
+[source,console]
|
|
|
+----
|
|
|
+GET _stats?level=shards&human&expand_wildcards=all&filter_path=indices.*.total.indexing.index_total
|
|
|
+----
|
|
|
+
|
|
|
+For more advanced analysis, you can poll for shard-level stats,
|
|
|
+which lets you compare joint index-level and node-level stats. This analysis
|
|
|
+wouldn't account for node restarts and/or shards rerouting, but serves as
|
|
|
+overview:
|
|
|
+
|
|
|
+[source,console]
|
|
|
+----
|
|
|
+GET _stats/indexing,search?level=shards&human&expand_wildcards=all
|
|
|
+----
|
|
|
+
|
|
|
+You can for example use the link:https://stedolan.github.io/jq[third-party JQ tool],
|
|
|
+to process the output saved as `indices_stats.json`:
|
|
|
+
|
|
|
+[source,sh]
|
|
|
+----
|
|
|
+cat indices_stats.json | jq -rc ['.indices|to_entries[]|.key as $i|.value.shards|to_entries[]|.key as $s|.value[]|{node:.routing.node[:4], index:$i, shard:$s, primary:.routing.primary, size:.store.size, total_indexing:.indexing.index_total, time_indexing:.indexing.index_time_in_millis, total_query:.search.query_total, time_query:.search.query_time_in_millis } | .+{ avg_indexing: (if .total_indexing>0 then (.time_indexing/.total_indexing|round) else 0 end), avg_search: (if .total_search>0 then (.time_search/.total_search|round) else 0 end) }'] > shard_stats.json
|
|
|
+
|
|
|
+# show top written-to shard simplified stats which contain their index and node references
|
|
|
+cat shard_stats.json | jq -rc 'sort_by(-.avg_indexing)[]' | head
|
|
|
+----
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[causes-tasks]]
|
|
|
+===== Task loads
|
|
|
+
|
|
|
+Shard distribution problems will most-likely surface as task load as seen
|
|
|
+above in the <<cat-thread-pool,cat thread pool>> example. It is also
|
|
|
+possible for tasks to hot spot a node either due to
|
|
|
+individual qualitative expensiveness or overall quantitative traffic loads.
|
|
|
+
|
|
|
+For example, if <<cat-thread-pool,cat thread pool>> reported a high
|
|
|
+queue on the `warmer` <<modules-threadpool,thread pool>>, you would
|
|
|
+look-up the effected node's <<cluster-nodes-hot-threads,hot threads>>.
|
|
|
+Let's say it reported `warmer` threads at `100% cpu` related to
|
|
|
+`GlobalOrdinalsBuilder`. This would let you know to inspect
|
|
|
+<<eager-global-ordinals,field data's global ordinals>>.
|
|
|
+
|
|
|
+Alternatively, let's say <<cat-nodes,cat nodes>> shows a hot spotted master node
|
|
|
+and <<cat-thread-pool,cat thread pool>> shows general queuing across nodes.
|
|
|
+This would suggest the master node is overwhelmed. To resolve
|
|
|
+this, first ensure <<high-availability-cluster-small-clusters,hardware high availability>>
|
|
|
+setup and then look to ephemeral causes. In this example,
|
|
|
+<<cluster-nodes-hot-threads,the nodes hot threads API>> reports multiple threads in
|
|
|
+`other` which indicates they're waiting on or blocked by either garbage collection
|
|
|
+or I/O.
|
|
|
+
|
|
|
+For either of these example situations, a good way to confirm the problematic tasks
|
|
|
+is to look at longest running non-continuous (designated `[c]`) tasks via
|
|
|
+<<cat-tasks,cat task management>>. This can be supplemented checking longest
|
|
|
+running cluster sync tasks via <<cat-pending-tasks,cat pending tasks>>. Using
|
|
|
+a third example,
|
|
|
+
|
|
|
+[source,console]
|
|
|
+----
|
|
|
+GET _cat/tasks?v&s=time:desc&h=type,action,running_time,node,cancellable
|
|
|
+----
|
|
|
+
|
|
|
+This could return:
|
|
|
+
|
|
|
+[source,console-result]
|
|
|
+----
|
|
|
+type action running_time node cancellable
|
|
|
+direct indices:data/read/eql 10m node_1 true
|
|
|
+...
|
|
|
+----
|
|
|
+// TEST[skip:illustrative response only]
|
|
|
+
|
|
|
+This surfaces a problematic <<eql-search-api,EQL query>>. We can gain
|
|
|
+further insight on it via <<tasks,the task management API>>. Its response
|
|
|
+contains a `description` that reports this query:
|
|
|
+
|
|
|
+[source,eql]
|
|
|
+----
|
|
|
+indices[winlogbeat-*,logs-window*], sequence by winlog.computer_name with maxspan=1m\n\n[authentication where host.os.type == "windows" and event.action:"logged-in" and\n event.outcome == "success" and process.name == "svchost.exe" ] by winlog.event_data.TargetLogonId
|
|
|
+----
|
|
|
+
|
|
|
+This lets you know which indices to check (`winlogbeat-*,logs-window*`), as well
|
|
|
+as the <<eql-search-api,EQL search>> request body. Most likely this is
|
|
|
+link:{security-guide}/es-overview.html[SIEM related].
|
|
|
+You can combine this with <<enable-audit-logging,audit logging>> as needed to
|
|
|
+trace the request source.
|