123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190 |
- [[search-aggregations-bucket-range-field-note]]
- === Subtleties of bucketing range fields
- ==== Documents are counted for each bucket they land in
- Since a range represents multiple values, running a bucket aggregation over a
- range field can result in the same document landing in multiple buckets. This
- can lead to surprising behavior, such as the sum of bucket counts being higher
- than the number of matched documents. For example, consider the following
- index:
- [source, console]
- --------------------------------------------------
- PUT range_index
- {
- "settings": {
- "number_of_shards": 2
- },
- "mappings": {
- "properties": {
- "expected_attendees": {
- "type": "integer_range"
- },
- "time_frame": {
- "type": "date_range",
- "format": "yyyy-MM-dd||epoch_millis"
- }
- }
- }
- }
- PUT range_index/_doc/1?refresh
- {
- "expected_attendees" : {
- "gte" : 10,
- "lte" : 20
- },
- "time_frame" : {
- "gte" : "2019-10-28",
- "lte" : "2019-11-04"
- }
- }
- --------------------------------------------------
- // TESTSETUP
- The range is wider than the interval in the following aggregation, and thus the
- document will land in multiple buckets.
- [source, console,id=range-field-aggregation-example]
- --------------------------------------------------
- POST /range_index/_search?size=0
- {
- "aggs": {
- "range_histo": {
- "histogram": {
- "field": "expected_attendees",
- "interval": 5
- }
- }
- }
- }
- --------------------------------------------------
- Since the interval is `5` (and the offset is `0` by default), we expect buckets `10`,
- `15`, and `20`. Our range document will fall in all three of these buckets.
- [source, console-result]
- --------------------------------------------------
- {
- ...
- "aggregations" : {
- "range_histo" : {
- "buckets" : [
- {
- "key" : 10.0,
- "doc_count" : 1
- },
- {
- "key" : 15.0,
- "doc_count" : 1
- },
- {
- "key" : 20.0,
- "doc_count" : 1
- }
- ]
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- A document cannot exist partially in a bucket; For example, the above document
- cannot count as one-third in each of the above three buckets. In this example,
- since the document's range landed in multiple buckets, the full value of that
- document would also be counted in any sub-aggregations for each bucket as well.
- ==== Query bounds are not aggregation filters
- Another unexpected behavior can arise when a query is used to filter on the
- field being aggregated. In this case, a document could match the query but
- still have one or both of the endpoints of the range outside the query.
- Consider the following aggregation on the above document:
- [source, console,id=range-field-aggregation-query-bounds-example]
- --------------------------------------------------
- POST /range_index/_search?size=0
- {
- "query": {
- "range": {
- "time_frame": {
- "gte": "2019-11-01",
- "format": "yyyy-MM-dd"
- }
- }
- },
- "aggs": {
- "november_data": {
- "date_histogram": {
- "field": "time_frame",
- "calendar_interval": "day",
- "format": "yyyy-MM-dd"
- }
- }
- }
- }
- --------------------------------------------------
- Even though the query only considers days in November, the aggregation
- generates 8 buckets (4 in October, 4 in November) because the aggregation is
- calculated over the ranges of all matching documents.
- [source, console-result]
- --------------------------------------------------
- {
- ...
- "aggregations" : {
- "november_data" : {
- "buckets" : [
- {
- "key_as_string" : "2019-10-28",
- "key" : 1572220800000,
- "doc_count" : 1
- },
- {
- "key_as_string" : "2019-10-29",
- "key" : 1572307200000,
- "doc_count" : 1
- },
- {
- "key_as_string" : "2019-10-30",
- "key" : 1572393600000,
- "doc_count" : 1
- },
- {
- "key_as_string" : "2019-10-31",
- "key" : 1572480000000,
- "doc_count" : 1
- },
- {
- "key_as_string" : "2019-11-01",
- "key" : 1572566400000,
- "doc_count" : 1
- },
- {
- "key_as_string" : "2019-11-02",
- "key" : 1572652800000,
- "doc_count" : 1
- },
- {
- "key_as_string" : "2019-11-03",
- "key" : 1572739200000,
- "doc_count" : 1
- },
- {
- "key_as_string" : "2019-11-04",
- "key" : 1572825600000,
- "doc_count" : 1
- }
- ]
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- Depending on the use case, a `CONTAINS` query could limit the documents to only
- those that fall entirely in the queried range. In this example, the one
- document would not be included and the aggregation would be empty. Filtering
- the buckets after the aggregation is also an option, for use cases where the
- document should be counted but the out of bounds data can be safely ignored.
|