123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596 |
- [[search-aggregations-bucket-variablewidthhistogram-aggregation]]
- === Variable width histogram aggregation
- ++++
- <titleabbrev>Variable width histogram</titleabbrev>
- ++++
- This is a multi-bucket aggregation similar to <<search-aggregations-bucket-histogram-aggregation>>.
- However, the width of each bucket is not specified. Rather, a target number of buckets is provided and bucket intervals
- are dynamically determined based on the document distribution. This is done using a simple one-pass document clustering algorithm
- that aims to obtain low distances between bucket centroids. Unlike other multi-bucket aggregations, the intervals will not
- necessarily have a uniform width.
- TIP: The number of buckets returned will always be less than or equal to the target number.
- Requesting a target of 2 buckets.
- [source,console]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "prices": {
- "variable_width_histogram": {
- "field": "price",
- "buckets": 2
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- Response:
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "prices": {
- "buckets": [
- {
- "min": 10.0,
- "key": 30.0,
- "max": 50.0,
- "doc_count": 2
- },
- {
- "min": 150.0,
- "key": 185.0,
- "max": 200.0,
- "doc_count": 5
- }
- ]
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- IMPORTANT: This aggregation cannot currently be nested under any aggregation that collects from more than a single bucket.
- ==== Clustering Algorithm
- Each shard fetches the first `initial_buffer` documents and stores them in memory. Once the buffer is full, these documents
- are sorted and linearly separated into `3/4 * shard_size buckets`.
- Next each remaining documents is either collected into the nearest bucket, or placed into a new bucket if it is distant
- from all the existing ones. At most `shard_size` total buckets are created.
- In the reduce step, the coordinating node sorts the buckets from all shards by their centroids. Then, the two buckets
- with the nearest centroids are repeatedly merged until the target number of buckets is achieved.
- This merging procedure is a form of {wikipedia}/Hierarchical_clustering[agglomerative hierarchical clustering].
- TIP: A shard can return fewer than `shard_size` buckets, but it cannot return more.
- ==== Shard size
- The `shard_size` parameter specifies the number of buckets that the coordinating node will request from each shard.
- A higher `shard_size` leads each shard to produce smaller buckets. This reduces the likelihood of buckets overlapping
- after the reduction step. Increasing the `shard_size` will improve the accuracy of the histogram, but it will
- also make it more expensive to compute the final result because bigger priority queues will have to be managed on a
- shard level, and the data transfers between the nodes and the client will be larger.
- TIP: Parameters `buckets`, `shard_size`, and `initial_buffer` are optional. By default, `buckets = 10`, `shard_size = buckets * 50`, and `initial_buffer = min(10 * shard_size, 50000)`.
- ==== Initial Buffer
- The `initial_buffer` parameter can be used to specify the number of individual documents that will be stored in memory
- on a shard before the initial bucketing algorithm is run. Bucket distribution is determined using this sample
- of `initial_buffer` documents. So, although a higher `initial_buffer` will use more memory, it will lead to more representative
- clusters.
- ==== Bucket bounds are approximate
- During the reduce step, the master node continuously merges the two buckets with the nearest centroids. If two buckets have
- overlapping bounds but distant centroids, then it is possible that they will not be merged. Because of this, after
- reduction the maximum value in some interval (`max`) might be greater than the minimum value in the subsequent
- bucket (`min`). To reduce the impact of this error, when such an overlap occurs the bound between these intervals is adjusted to be `(max + min) / 2`.
- TIP: Bucket bounds are very sensitive to outliers
|