123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349 |
- [[search-aggregations-bucket-histogram-aggregation]]
- === Histogram Aggregation
- A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents.
- It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field
- that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval `5`
- (in case of price it may represent $5). When the aggregation executes, the price field of every document will be
- evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size is `5`
- then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the key `30`.
- To make this more formal, here is the rounding function that is used:
- [source,java]
- --------------------------------------------------
- rem = value % interval
- if (rem < 0) {
- rem += interval
- }
- bucket_key = value - rem
- --------------------------------------------------
- From the rounding function above it can be seen that the intervals themselves **must** be integers.
- WARNING: Currently, values are cast to integers before being bucketed, which
- might cause negative floating-point values to fall into the wrong bucket. For
- instance, `-4.5` with an interval of `2` would be cast to `-4`, and so would
- end up in the `-4 <= val < -2` bucket instead of the `-6 <= val < -4` bucket.
- The following snippet "buckets" the products based on their `price` by interval of `50`:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "prices" : {
- "histogram" : {
- "field" : "price",
- "interval" : 50
- }
- }
- }
- }
- --------------------------------------------------
- And the following may be the response:
- [source,js]
- --------------------------------------------------
- {
- "aggregations": {
- "prices" : {
- "buckets": [
- {
- "key": 0,
- "doc_count": 2
- },
- {
- "key": 50,
- "doc_count": 4
- },
- {
- "key": 100,
- "doc_count": 0
- },
- {
- "key": 150,
- "doc_count": 3
- }
- ]
- }
- }
- }
- --------------------------------------------------
- ==== Minimum document count
- The response above show that no documents has a price that falls within the range of `[100 - 150)`. By default the
- response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with
- a higher minimum count thanks to the `min_doc_count` setting:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "prices" : {
- "histogram" : {
- "field" : "price",
- "interval" : 50,
- "min_doc_count" : 1
- }
- }
- }
- }
- --------------------------------------------------
- Response:
- [source,js]
- --------------------------------------------------
- {
- "aggregations": {
- "prices" : {
- "buckets": [
- {
- "key": 0,
- "doc_count": 2
- },
- {
- "key": 50,
- "doc_count": 4
- },
- {
- "key": 150,
- "doc_count": 3
- }
- ]
- }
- }
- }
- --------------------------------------------------
- [[search-aggregations-bucket-histogram-aggregation-extended-bounds]]
- By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with
- the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the
- documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when when
- requesting empty buckets, this causes a confusion, specifically, when the data is also filtered.
- To understand why, let's look at an example:
- Lets say the you're filtering your request to get all docs with values between `0` and `500`, in addition you'd like
- to slice the data per price using a histogram with an interval of `50`. You also specify `"min_doc_count" : 0` as you'd
- like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than `100`,
- the first bucket you'll get will be the one with `100` as its key. This is confusing, as many times, you'd also like
- to get those buckets between `0 - 100`.
- With `extended_bounds` setting, you now can "force" the histogram aggregation to start building buckets on a specific
- `min` values and also keep on building buckets up to a `max` value (even if there are no documents anymore). Using
- `extended_bounds` only makes sense when `min_doc_count` is 0 (the empty buckets will never be returned if `min_doc_count`
- is greater than 0).
- Note that (as the name suggest) `extended_bounds` is **not** filtering buckets. Meaning, if the `extended_bounds.min` is higher
- than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the
- same goes for the `extended_bounds.max` and the last bucket). For filtering buckets, one should nest the histogram aggregation
- under a range `filter` aggregation with the appropriate `from`/`to` settings.
- Example:
- [source,js]
- --------------------------------------------------
- {
- "query" : {
- "constant_score" : { "filter": { "range" : { "price" : { "to" : "500" } } } }
- },
- "aggs" : {
- "prices" : {
- "histogram" : {
- "field" : "price",
- "interval" : 50,
- "extended_bounds" : {
- "min" : 0,
- "max" : 500
- }
- }
- }
- }
- }
- --------------------------------------------------
- ==== Order
- By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controlled
- using the `order` setting.
- Ordering the buckets by their key - descending:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "prices" : {
- "histogram" : {
- "field" : "price",
- "interval" : 50,
- "order" : { "_key" : "desc" }
- }
- }
- }
- }
- --------------------------------------------------
- Ordering the buckets by their `doc_count` - ascending:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "prices" : {
- "histogram" : {
- "field" : "price",
- "interval" : 50,
- "order" : { "_count" : "asc" }
- }
- }
- }
- }
- --------------------------------------------------
- If the histogram aggregation has a direct metrics sub-aggregation, the latter can determine the order of the buckets:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "prices" : {
- "histogram" : {
- "field" : "price",
- "interval" : 50,
- "order" : { "price_stats.min" : "asc" } <1>
- },
- "aggs" : {
- "price_stats" : { "stats" : {} } <2>
- }
- }
- }
- }
- --------------------------------------------------
- <1> The `{ "price_stats.min" : asc" }` will sort the buckets based on `min` value of their `price_stats` sub-aggregation.
- <2> There is no need to configure the `price` field for the `price_stats` aggregation as it will inherit it by default from its parent histogram aggregation.
- It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long
- as the aggregations path are of a single-bucket type, where the last aggregation in the path may either by a single-bucket
- one or a metrics one. If it's a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. `doc_count`),
- in case it's a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of
- a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value).
- The path must be defined in the following form:
- --------------------------------------------------
- AGG_SEPARATOR := '>'
- METRIC_SEPARATOR := '.'
- AGG_NAME := <the name of the aggregation>
- METRIC := <the name of the metric (in case of multi-value metrics aggregation)>
- PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
- --------------------------------------------------
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "prices" : {
- "histogram" : {
- "field" : "price",
- "interval" : 50,
- "order" : { "promoted_products>rating_stats.avg" : "desc" } <1>
- },
- "aggs" : {
- "promoted_products" : {
- "filter" : { "term" : { "promoted" : true }},
- "aggs" : {
- "rating_stats" : { "stats" : { "field" : "rating" }}
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- The above will sort the buckets based on the avg rating among the promoted products
- ==== Offset
- By default the bucket keys start with 0 and then continue in even spaced steps of `interval`, e.g. if the interval is 10 the first buckets
- (assuming there is data inside them) will be [0 - 9], [10-19], [20-29]. The bucket boundaries can be shifted by using the `offset` option.
- This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval `10` will result in
- two buckets with 5 documents each. If an additional offset `5` is used, there will be only one single bucket [5-14] containing all the 10
- documents.
- ==== Response Format
- By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash
- instead keyed by the buckets keys:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "prices" : {
- "histogram" : {
- "field" : "price",
- "interval" : 50,
- "keyed" : true
- }
- }
- }
- }
- --------------------------------------------------
- Response:
- [source,js]
- --------------------------------------------------
- {
- "aggregations": {
- "prices": {
- "buckets": {
- "0": {
- "key": 0,
- "doc_count": 2
- },
- "50": {
- "key": 50,
- "doc_count": 4
- },
- "150": {
- "key": 150,
- "doc_count": 3
- }
- }
- }
- }
- }
- --------------------------------------------------
- ==== Missing value
- The `missing` parameter defines how documents that are missing a value should be treated.
- By default they will be ignored but it is also possible to treat them as if they
- had a value.
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "quantity" : {
- "histogram" : {
- "field" : "quantity",
- "interval": 10,
- "missing": 0 <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> Documents without a value in the `quantity` field will fall into the same bucket as documents that have the value `0`.
|