123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169 |
- [[search-aggregations-pipeline]]
- == Pipeline Aggregations
- coming[2.0.0]
- experimental[]
- Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding
- information to the output tree. There are many different types of pipeline aggregation, each computing different information from
- other aggregations, but these types can broken down into two families:
- _Parent_::
- A family of pipeline aggregations that is provided with the output of its parent aggregation and is able
- to compute new buckets or new aggregations to add to existing buckets.
- _Sibling_::
- Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a
- new aggregation which will be at the same level as the sibling aggregation.
- Pipeline aggregations can reference the aggregations they need to perform their computation by using the `buckets_paths`
- parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the
- <<bucket-path-syntax, `buckets_path` Syntax>> section below.
- Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the `buckets_path`
- allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
- (e.g. a derivative of a derivative).
- NOTE: Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation
- will be included in the final output.
- [[bucket-path-syntax]]
- [float]
- === `buckets_path` Syntax
- Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the `buckets_path`
- parameter, which follows a specific format:
- --------------------------------------------------
- AGG_SEPARATOR := '>'
- METRIC_SEPARATOR := '.'
- AGG_NAME := <the name of the aggregation>
- METRIC := <the name of the metric (in case of multi-value metrics aggregation)>
- PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
- --------------------------------------------------
- For example, the path `"my_bucket>my_stats.avg"` will path to the `avg` value in the `"my_stats"` metric, which is
- contained in the `"my_bucket"` bucket aggregation.
- Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the
- aggregation tree. For example, this moving average is embedded inside a date_histogram and refers to a "sibling"
- metric `"the_sum"`:
- [source,js]
- --------------------------------------------------
- {
- "my_date_histo":{
- "date_histogram":{
- "field":"timestamp",
- "interval":"day"
- },
- "aggs":{
- "the_sum":{
- "sum":{ "field": "lemmings" } <1>
- },
- "the_movavg":{
- "moving_avg":{ "buckets_path": "the_sum" } <2>
- }
- }
- }
- }
- --------------------------------------------------
- <1> The metric is called `"the_sum"`
- <2> The `buckets_path` refers to the metric via a relative path `"the_sum"`
- `buckets_path` is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets
- instead of embedded "inside" them. For example, the `max_bucket` aggregation uses the `buckets_path` to specify
- a metric embedded inside a sibling aggregation:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "sales_per_month" : {
- "date_histogram" : {
- "field" : "date",
- "interval" : "month"
- },
- "aggs": {
- "sales": {
- "sum": {
- "field": "price"
- }
- }
- }
- },
- "max_monthly_sales": {
- "max_bucket": {
- "buckets_paths": "sales_per_month>sales" <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> `bucket_paths` instructs this max_bucket aggregation that we want the maximum value of the `sales` aggregation in the
- `sales_per_month` date histogram.
- [float]
- ==== Special Paths
- Instead of pathing to a metric, `buckets_path` can use a special `"_count"` path. This instructs
- the pipeline aggregation to use the document count as it's input. For example, a moving average can be calculated on the document
- count of each bucket, instead of a specific metric:
- [source,js]
- --------------------------------------------------
- {
- "my_date_histo":{
- "date_histogram":{
- "field":"timestamp",
- "interval":"day"
- },
- "aggs":{
- "the_movavg":{
- "moving_avg":{ "buckets_path": "_count" } <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> By using `_count` instead of a metric name, we can calculate the moving average of document counts in the histogram
- [[gap-policy]]
- [float]
- === Dealing with gaps in the data
- Data in the real world is often noisy and sometimes contains *gaps* -- places where data simply doesn't exist. This can
- occur for a variety of reasons, the most common being:
- * Documents falling into a bucket do not contain a required field
- * There are no documents matching the query for one or more buckets
- * The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value.
- Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the
- first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
- Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing
- data is encountered. All pipeline aggregations accept the `gap_policy` parameter. There are currently two gap policies
- to choose from:
- _skip_::
- This option treats missing data as if the bucket does not exist. It will skip the bucket and continue
- calculating using the next available value.
- _insert_zeros_::
- This option will replace missing values with a zero (`0`) and pipeline aggregation computation will
- proceed as normal.
- include::pipeline/avg-bucket-aggregation.asciidoc[]
- include::pipeline/derivative-aggregation.asciidoc[]
- include::pipeline/max-bucket-aggregation.asciidoc[]
- include::pipeline/min-bucket-aggregation.asciidoc[]
- include::pipeline/sum-bucket-aggregation.asciidoc[]
- include::pipeline/movavg-aggregation.asciidoc[]
- include::pipeline/cumulative-sum-aggregation.asciidoc[]
- include::pipeline/bucket-script-aggregation.asciidoc[]
- include::pipeline/bucket-selector-aggregation.asciidoc[]
|