Browse Source

Documentation notes for Range field histograms (#46890)

Mark Tozzi 6 years ago
parent
commit
57a679fbbb

+ 1 - 0
docs/reference/aggregations/bucket.asciidoc

@@ -67,3 +67,4 @@ include::bucket/significanttext-aggregation.asciidoc[]
 
 include::bucket/terms-aggregation.asciidoc[]
 
+include::bucket/range-field-note.asciidoc[]

+ 1 - 1
docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc

@@ -3,7 +3,7 @@
 
 This multi-bucket aggregation is similar to the normal
 <<search-aggregations-bucket-histogram-aggregation,histogram>>, but it can
-only be used with date values. Because dates are represented internally in 
+only be used with date or date range values. Because dates are represented internally in 
 Elasticsearch as long values, it is possible, but not as accurate, to use the
 normal `histogram` on dates as well. The main difference in the two APIs is
 that here the interval can be specified using date/time expressions. Time-based

+ 19 - 6
docs/reference/aggregations/bucket/histogram-aggregation.asciidoc

@@ -1,12 +1,13 @@
 [[search-aggregations-bucket-histogram-aggregation]]
 === Histogram Aggregation
 
-A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents.
-It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field
-that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval `5`
-(in case of price it may represent $5). When the aggregation executes, the price field of every document will be
-evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size is `5`
-then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the key `30`.
+A multi-bucket values source based aggregation that can be applied on numeric values or numeric range values extracted
+from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the
+documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with
+interval `5` (in case of price it may represent $5). When the aggregation executes, the price field of every document
+will be evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size
+is `5` then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the
+key `30`.
 To make this more formal, here is the rounding function that is used:
 
 [source,java]
@@ -14,6 +15,10 @@ To make this more formal, here is the rounding function that is used:
 bucket_key = Math.floor((value - offset) / interval) * interval + offset
 --------------------------------------------------
 
+For range values, a document can fall into multiple buckets. The first bucket is computed from the lower
+bound of the range in the same way as a bucket for a single value is computed.  The final bucket is computed in the same
+way from the upper bound of the range, and the range is counted in all buckets in between and including those two.
+
 The `interval` must be a positive decimal, while the `offset` must be a decimal in `[0, interval)`
 (a decimal greater than or equal to `0` and less than `interval`)
 
@@ -175,6 +180,14 @@ POST /sales/_search?size=0
 --------------------------------------------------
 // TEST[setup:sales]
 
+When aggregating ranges, buckets are based on the values of the returned documents.  This means the response may include
+buckets outside of a query's range. For example, if your query looks for values greater than 100, and you have a range
+covering 50 to 150, and an interval of 50, that document will land in 3 buckets - 50, 100, and 150. In general, it's
+best to think of the query and aggregation steps as independent - the query selects a set of documents, and then the
+aggregation buckets those documents without regard to how they were selected.
+See <<search-aggregations-bucket-range-field-note,note on bucketing range
+fields>> for more information and an example.
+
 ==== Order
 
 By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controlled using

+ 181 - 0
docs/reference/aggregations/bucket/range-field-note.asciidoc

@@ -0,0 +1,181 @@
+[[search-aggregations-bucket-range-field-note]]
+=== Subtleties of bucketing range fields
+
+==== Documents are counted for each bucket they land in
+
+Since a range represents multiple values, running a bucket aggregation over a
+range field can result in the same document landing in multiple buckets. This
+can lead to surprising behavior, such as the sum of bucket counts being higher
+than the number of matched documents.  For example, consider the following
+index: 
+[source, console]
+--------------------------------------------------
+PUT range_index
+{
+  "settings": {
+    "number_of_shards": 2
+  },
+  "mappings": {
+    "properties": {
+      "expected_attendees": {
+        "type": "integer_range"
+      },
+      "time_frame": {
+        "type": "date_range",
+        "format": "yyyy-MM-dd||epoch_millis"
+      }
+    }
+  }
+}
+
+PUT range_index/_doc/1?refresh
+{
+  "expected_attendees" : {
+    "gte" : 10,
+    "lte" : 20
+  },
+  "time_frame" : {
+    "gte" : "2019-10-28",
+    "lte" : "2019-11-04"
+  }
+}
+--------------------------------------------------
+// TESTSETUP
+
+The range is wider than the interval in the following aggregation, and thus the
+document will land in multiple buckets.
+
+[source, console]
+--------------------------------------------------
+POST /range_index/_search?size=0
+{
+    "aggs" : {
+        "range_histo" : {
+            "histogram" : {
+                "field" : "expected_attendees",
+                "interval" : 5
+            }
+        }
+    }
+}
+--------------------------------------------------
+
+Since the interval is `5` (and the offset is `0` by default), we expect buckets `10`,
+`15`, and `20`. Our range document will fall in all three of these buckets.
+
+[source, console-result]
+--------------------------------------------------
+{
+  ...
+  "aggregations" : {
+    "range_histo" : {
+      "buckets" : [
+        {
+          "key" : 10.0,
+          "doc_count" : 1
+        },
+        {
+          "key" : 15.0,
+          "doc_count" : 1
+        },
+        {
+          "key" : 20.0,
+          "doc_count" : 1
+        }
+      ]
+    }
+  }
+}
+--------------------------------------------------
+// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
+
+A document cannot exist partially in a bucket; For example, the above document
+cannot count as one-third in each of the above three buckets. In this example,
+since the document's range landed in multiple buckets, the full value of that
+document would also be counted in any sub-aggregations for each bucket as well.
+
+==== Query bounds are not aggregation filters
+
+Another unexpected behavior can arise when a query is used to filter on the
+field being aggregated. In this case, a document could match the query but
+still have one or both of the endpoints of the range outside the query.
+Consider the following aggregation on the above document:
+
+[source, console]
+--------------------------------------------------
+POST /range_index/_search?size=0
+{
+    "query": {
+      "range": {
+        "time_frame": {
+          "gte": "2019-11-01",
+          "format": "yyyy-MM-dd"
+        }
+      }
+    }, 
+    "aggs" : {
+        "november_data" : {
+            "date_histogram" : {
+                "field" : "time_frame",
+                "calendar_interval" : "day"
+              }
+        }
+    }
+}
+--------------------------------------------------
+
+Even though the query only considers days in November, the aggregation
+generates 8 buckets (4 in October, 4 in November) because the aggregation is
+calculated over the ranges of all matching documents.
+
+[source, console-result]
+--------------------------------------------------
+{
+  ...
+  "aggregations" : {
+    "november_data" : {
+      "buckets" : [
+        {
+          "key" : 1572220800000,
+          "doc_count" : 1
+        },
+        {
+          "key" : 1572307200000,
+          "doc_count" : 1
+        },
+        {
+          "key" : 1572393600000,
+          "doc_count" : 1
+        },
+        {
+          "key" : 1572480000000,
+          "doc_count" : 1
+        },
+        {
+          "key" : 1572566400000,
+          "doc_count" : 1
+        },
+        {
+          "key" : 1572652800000,
+          "doc_count" : 1
+        },
+        {
+          "key" : 1572739200000,
+          "doc_count" : 1
+        },
+        {
+          "key" : 1572825600000,
+          "doc_count" : 1
+        }
+      ]
+    }
+  }
+}
+--------------------------------------------------
+// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
+
+Depending on the use case, a `CONTAINS` query could limit the documents to only
+those that fall entirely in the queried range.  In this example, the one
+document would not be included and the aggregation would be empty.  Filtering
+the buckets after the aggregation is also an option, for use cases where the
+document should be counted but the out of bounds data can be safely ignored.