123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150 |
- [role="xpack"]
- [testenv="basic"]
- [[es-monitoring-collectors]]
- == Collectors
- Collectors, as their name implies, collect things. Each collector runs once for
- each collection interval to obtain data from the public APIs in {es} and {xpack}
- that it chooses to monitor. When the data collection is finished, the data is
- handed in bulk to the <<es-monitoring-exporters,exporters>> to be sent to the
- monitoring clusters. Regardless of the number of exporters, each collector only
- runs once per collection interval.
- There is only one collector per data type gathered. In other words, for any
- monitoring document that is created, it comes from a single collector rather
- than being merged from multiple collectors. {monitoring} for {es} currently has
- a few collectors because the goal is to minimize overlap between them for
- optimal performance.
- Each collector can create zero or more monitoring documents. For example,
- the `index_stats` collector collects all index statistics at the same time to
- avoid many unnecessary calls.
- [options="header"]
- |=======================
- | Collector | Data Types | Description
- | Cluster Stats | `cluster_stats`
- | Gathers details about the cluster state, including parts of
- the actual cluster state (for example `GET /_cluster/state`) and statistics
- about it (for example, `GET /_cluster/stats`). This produces a single document
- type. In versions prior to X-Pack 5.5, this was actually three separate collectors
- that resulted in three separate types: `cluster_stats`, `cluster_state`, and
- `cluster_info`. In 5.5 and later, all three are combined into `cluster_stats`.
- +
- This only runs on the _elected_ master node and the data collected
- (`cluster_stats`) largely controls the UI. When this data is not present, it
- indicates either a misconfiguration on the elected master node, timeouts related
- to the collection of the data, or issues with storing the data. Only a single
- document is produced per collection.
- | Index Stats | `indices_stats`, `index_stats`
- | Gathers details about the indices in the cluster, both in summary and
- individually. This creates many documents that represent parts of the index
- statistics output (for example, `GET /_stats`).
- +
- This information only needs to be collected once, so it is collected on the
- _elected_ master node. The most common failure for this collector relates to an
- extreme number of indices -- and therefore time to gather them -- resulting in
- timeouts. One summary `indices_stats` document is produced per collection and one
- `index_stats` document is produced per index, per collection.
- | Index Recovery | `index_recovery`
- | Gathers details about index recovery in the cluster. Index recovery represents
- the assignment of _shards_ at the cluster level. If an index is not recovered,
- it is not usable. This also corresponds to shard restoration via snapshots.
- +
- This information only needs to be collected once, so it is collected on the
- _elected_ master node. The most common failure for this collector relates to an
- extreme number of shards -- and therefore time to gather them -- resulting in
- timeouts. This creates a single document that contains all recoveries by default,
- which can be quite large, but it gives the most accurate picture of recovery in
- the production cluster.
- | Shards | `shards`
- | Gathers details about all _allocated_ shards for all indices, particularly
- including what node the shard is allocated to.
- +
- This information only needs to be collected once, so it is collected on the
- _elected_ master node. The collector uses the local cluster state to get the
- routing table without any network timeout issues unlike most other collectors.
- Each shard is represented by a separate monitoring document.
- | Jobs | `job_stats`
- | Gathers details about all machine learning job statistics (for example,
- `GET /_xpack/ml/anomaly_detectors/_stats`).
- +
- This information only needs to be collected once, so it is collected on the
- _elected_ master node. However, for the master node to be able to perform the
- collection, the master node must have `xpack.ml.enabled` set to true (default)
- and a license level that supports {ml}.
- | Node Stats | `node_stats`
- | Gathers details about the running node, such as memory utilization and CPU
- usage (for example, `GET /_nodes/_local/stats`).
- +
- This runs on _every_ node with {monitoring} enabled. One common failure
- results in the timeout of the node stats request due to too many segment files.
- As a result, the collector spends too much time waiting for the file system
- stats to be calculated until it finally times out. A single `node_stats`
- document is created per collection. This is collected per node to help to
- discover issues with nodes communicating with each other, but not with the
- monitoring cluster (for example, intermittent network issues or memory pressure).
- |=======================
- {monitoring} uses a single threaded scheduler to run the collection of {es}
- monitoring data by all of the appropriate collectors on each node. This
- scheduler is managed locally by each node and its interval is controlled by
- specifying the `xpack.monitoring.collection.interval`, which defaults to 10
- seconds (`10s`), at either the node or cluster level.
- Fundamentally, each collector works on the same principle. Per collection
- interval, each collector is checked to see whether it should run and then the
- appropriate collectors run. The failure of an individual collector does not
- impact any other collector.
- Once collection has completed, all of the monitoring data is passed to the
- exporters to route the monitoring data to the monitoring clusters.
- If gaps exist in the monitoring charts in {kib}, it is typically because either
- a collector failed or the monitoring cluster did not receive the data (for
- example, it was being restarted). In the event that a collector fails, a logged
- error should exist on the node that attempted to perform the collection.
- NOTE: Collection is currently done serially, rather than in parallel, to avoid
- extra overhead on the elected master node. The downside to this approach
- is that collectors might observe a different version of the cluster state
- within the same collection period. In practice, this does not make a
- significant difference and running the collectors in parallel would not
- prevent such a possibility.
- For more information about the configuration options for the collectors, see
- <<monitoring-collection-settings>>.
- [float]
- [[es-monitoring-stack]]
- === Collecting data from across the Elastic Stack
- {monitoring} in {es} also receives monitoring data from other parts of the
- Elastic Stack. In this way, it serves as an unscheduled monitoring data
- collector for the stack.
- By default, data collection is disabled. {es} monitoring data is not
- collected and all monitoring data from other sources such as {kib}, Beats, and
- Logstash is ignored. You must set `xpack.monitoring.collection.enabled` to `true`
- to enable the collection of monitoring data. See <<monitoring-settings>>.
- Once data is received, it is forwarded to the exporters
- to be routed to the monitoring cluster like all monitoring data.
- WARNING: Because this stack-level "collector" lives outside of the collection
- interval of {monitoring} for {es}, it is not impacted by the
- `xpack.monitoring.collection.interval` setting. Therefore, data is passed to the
- exporters whenever it is received. This behavior can result in indices for {kib},
- Logstash, or Beats being created somewhat unexpectedly.
- While the monitoring data is collected and processed, some production cluster
- metadata is added to incoming documents. This metadata enables {kib} to link the
- monitoring data to the appropriate cluster. If this linkage is unimportant to
- the infrastructure that you're monitoring, it might be simpler to configure
- Logstash and Beats to report monitoring data directly to the monitoring cluster.
- This scenario also prevents the production cluster from adding extra overhead
- related to monitoring data, which can be very useful when there are a large
- number of Logstash nodes or Beats.
- For more information about typical monitoring architectures, see
- {xpack-ref}/how-monitoring-works.html[How Monitoring Works].
|