123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413 |
- [[search-aggregations-bucket-histogram-aggregation]]
- === Histogram aggregation
- ++++
- <titleabbrev>Histogram</titleabbrev>
- ++++
- A multi-bucket values source based aggregation that can be applied on numeric values or numeric range values extracted
- from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the
- documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with
- interval `5` (in case of price it may represent $5). When the aggregation executes, the price field of every document
- will be evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size
- is `5` then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the
- key `30`.
- To make this more formal, here is the rounding function that is used:
- [source,java]
- --------------------------------------------------
- bucket_key = Math.floor((value - offset) / interval) * interval + offset
- --------------------------------------------------
- For range values, a document can fall into multiple buckets. The first bucket is computed from the lower
- bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same
- way from the upper bound of the range, and the range is counted in all buckets in between and including those two.
- The `interval` must be a positive decimal, while the `offset` must be a decimal in `[0, interval)`
- (a decimal greater than or equal to `0` and less than `interval`)
- The following snippet "buckets" the products based on their `price` by interval of `50`:
- [source,console,id=histogram-aggregation-example]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "prices": {
- "histogram": {
- "field": "price",
- "interval": 50
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- And the following may be the response:
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "prices": {
- "buckets": [
- {
- "key": 0.0,
- "doc_count": 1
- },
- {
- "key": 50.0,
- "doc_count": 1
- },
- {
- "key": 100.0,
- "doc_count": 0
- },
- {
- "key": 150.0,
- "doc_count": 2
- },
- {
- "key": 200.0,
- "doc_count": 3
- }
- ]
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- ==== Minimum document count
- The response above show that no documents has a price that falls within the range of `[100, 150)`. By default the
- response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with
- a higher minimum count thanks to the `min_doc_count` setting:
- [source,console,id=histogram-aggregation-min-doc-count-example]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "prices": {
- "histogram": {
- "field": "price",
- "interval": 50,
- "min_doc_count": 1
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- Response:
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "prices": {
- "buckets": [
- {
- "key": 0.0,
- "doc_count": 1
- },
- {
- "key": 50.0,
- "doc_count": 1
- },
- {
- "key": 150.0,
- "doc_count": 2
- },
- {
- "key": 200.0,
- "doc_count": 3
- }
- ]
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- [[search-aggregations-bucket-histogram-aggregation-extended-bounds]]
- By default the `histogram` returns all the buckets within the range of the data itself, that is, the documents with
- the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the
- documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when
- requesting empty buckets, this causes a confusion, specifically, when the data is also filtered.
- To understand why, let's look at an example:
- Lets say the you're filtering your request to get all docs with values between `0` and `500`, in addition you'd like
- to slice the data per price using a histogram with an interval of `50`. You also specify `"min_doc_count" : 0` as you'd
- like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than `100`,
- the first bucket you'll get will be the one with `100` as its key. This is confusing, as many times, you'd also like
- to get those buckets between `0 - 100`.
- With `extended_bounds` setting, you now can "force" the histogram aggregation to start building buckets on a specific
- `min` value and also keep on building buckets up to a `max` value (even if there are no documents anymore). Using
- `extended_bounds` only makes sense when `min_doc_count` is 0 (the empty buckets will never be returned if `min_doc_count`
- is greater than 0).
- Note that (as the name suggest) `extended_bounds` is **not** filtering buckets. Meaning, if the `extended_bounds.min` is higher
- than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the
- same goes for the `extended_bounds.max` and the last bucket). For filtering buckets, one should nest the histogram aggregation
- under a range `filter` aggregation with the appropriate `from`/`to` settings.
- Example:
- [source,console,id=histogram-aggregation-extended-bounds-example]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "query": {
- "constant_score": { "filter": { "range": { "price": { "to": "500" } } } }
- },
- "aggs": {
- "prices": {
- "histogram": {
- "field": "price",
- "interval": 50,
- "extended_bounds": {
- "min": 0,
- "max": 500
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- When aggregating ranges, buckets are based on the values of the returned documents. This means the response may include
- buckets outside of a query's range. For example, if your query looks for values greater than 100, and you have a range
- covering 50 to 150, and an interval of 50, that document will land in 3 buckets - 50, 100, and 150. In general, it's
- best to think of the query and aggregation steps as independent - the query selects a set of documents, and then the
- aggregation buckets those documents without regard to how they were selected.
- See <<search-aggregations-bucket-range-field-note,note on bucketing range
- fields>> for more information and an example.
- [[search-aggregations-bucket-histogram-aggregation-hard-bounds]]
- The `hard_bounds` is a counterpart of `extended_bounds` and can limit the range of buckets in the histogram. It is
- particularly useful in the case of open <<range, data ranges>> that can result in a very large number of buckets.
- Example:
- [source,console,id=histogram-aggregation-hard-bounds-example]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "query": {
- "constant_score": { "filter": { "range": { "price": { "to": "500" } } } }
- },
- "aggs": {
- "prices": {
- "histogram": {
- "field": "price",
- "interval": 50,
- "hard_bounds": {
- "min": 100,
- "max": 200
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- In this example even though the range specified in the query is up to 500, the histogram will only have 2 buckets starting at 100 and 150.
- All other buckets will be omitted even if documents that should go to this buckets are present in the results.
- ==== Order
- By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controlled using
- the `order` setting. Supports the same `order` functionality as the <<search-aggregations-bucket-terms-aggregation-order,`Terms Aggregation`>>.
- ==== Offset
- By default the bucket keys start with 0 and then continue in even spaced steps
- of `interval`, e.g. if the interval is `10`, the first three buckets (assuming
- there is data inside them) will be `[0, 10)`, `[10, 20)`, `[20, 30)`. The bucket
- boundaries can be shifted by using the `offset` option.
- This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval `10` will result in
- two buckets with 5 documents each. If an additional offset `5` is used, there will be only one single bucket `[5, 15)` containing all the 10
- documents.
- ==== Response Format
- By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash
- instead keyed by the buckets keys:
- [source,console,id=histogram-aggregation-keyed-example]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "prices": {
- "histogram": {
- "field": "price",
- "interval": 50,
- "keyed": true
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- Response:
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "prices": {
- "buckets": {
- "0.0": {
- "key": 0.0,
- "doc_count": 1
- },
- "50.0": {
- "key": 50.0,
- "doc_count": 1
- },
- "100.0": {
- "key": 100.0,
- "doc_count": 0
- },
- "150.0": {
- "key": 150.0,
- "doc_count": 2
- },
- "200.0": {
- "key": 200.0,
- "doc_count": 3
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- ==== Missing value
- The `missing` parameter defines how documents that are missing a value should be treated.
- By default they will be ignored but it is also possible to treat them as if they
- had a value.
- [source,console,id=histogram-aggregation-missing-value-example]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "quantity": {
- "histogram": {
- "field": "quantity",
- "interval": 10,
- "missing": 0 <1>
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- <1> Documents without a value in the `quantity` field will fall into the same bucket as documents that have the value `0`.
- [[search-aggregations-bucket-histogram-aggregation-histogram-fields]]
- ==== Histogram fields
- Running a histogram aggregation over histogram fields computes the total number of counts for each interval.
- For example, executing a histogram aggregation against the following index that stores pre-aggregated histograms
- with latency metrics (in milliseconds) for different networks:
- [source,console]
- --------------------------------------------------
- PUT metrics_index/_doc/1
- {
- "network.name" : "net-1",
- "latency_histo" : {
- "values" : [1, 3, 8, 12, 15],
- "counts" : [3, 7, 23, 12, 6]
- }
- }
- PUT metrics_index/_doc/2
- {
- "network.name" : "net-2",
- "latency_histo" : {
- "values" : [1, 6, 8, 12, 14],
- "counts" : [8, 17, 8, 7, 6]
- }
- }
- POST /metrics_index/_search?size=0
- {
- "aggs": {
- "latency_buckets": {
- "histogram": {
- "field": "latency_histo",
- "interval": 5
- }
- }
- }
- }
- --------------------------------------------------
- The `histogram` aggregation will sum the counts of each interval computed based on the `values` and
- return the following output:
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "prices": {
- "buckets": [
- {
- "key": 0.0,
- "doc_count": 18
- },
- {
- "key": 5.0,
- "doc_count": 48
- },
- {
- "key": 10.0,
- "doc_count": 25
- },
- {
- "key": 15.0,
- "doc_count": 6
- }
- ]
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[skip:test not setup]
- [IMPORTANT]
- ========
- Histogram aggregation is a bucket aggregation, which partitions documents into buckets rather than calculating metrics over fields like
- metrics aggregations do. Each bucket represents a collection of documents which sub-aggregations can run on.
- On the other hand, a histogram field is a pre-aggregated field representing multiple values inside a single field:
- buckets of numerical data and a count of items/documents for each bucket. This mismatch between the histogram aggregations expected input
- (expecting raw documents) and the histogram field (that provides summary information) limits the outcome of the aggregation
- to only the doc counts for each bucket.
- **Consequently, when executing a histogram aggregation over a histogram field, no sub-aggregations are allowed.**
- ========
- Also, when running histogram aggregation over histogram field the `missing` parameter is not supported.
|