123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288 |
- [[search-aggregations-pipeline]]
- == Pipeline Aggregations
- Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding
- information to the output tree. There are many different types of pipeline aggregation, each computing different information from
- other aggregations, but these types can be broken down into two families:
- _Parent_::
- A family of pipeline aggregations that is provided with the output of its parent aggregation and is able
- to compute new buckets or new aggregations to add to existing buckets.
- _Sibling_::
- Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a
- new aggregation which will be at the same level as the sibling aggregation.
- Pipeline aggregations can reference the aggregations they need to perform their computation by using the `buckets_path`
- parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the
- <<buckets-path-syntax, `buckets_path` Syntax>> section below.
- Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the `buckets_path`
- allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
- (i.e. a derivative of a derivative).
- NOTE: Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation
- will be included in the final output.
- [[buckets-path-syntax]]
- [float]
- === `buckets_path` Syntax
- Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the `buckets_path`
- parameter, which follows a specific format:
- // https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form
- [source,ebnf]
- --------------------------------------------------
- AGG_SEPARATOR = `>` ;
- METRIC_SEPARATOR = `.` ;
- AGG_NAME = <the name of the aggregation> ;
- METRIC = <the name of the metric (in case of multi-value metrics aggregation)> ;
- MULTIBUCKET_KEY = `[<KEY_NAME>]`
- PATH = <AGG_NAME><MULTIBUCKET_KEY>? (<AGG_SEPARATOR>, <AGG_NAME> )* ( <METRIC_SEPARATOR>, <METRIC> ) ;
- --------------------------------------------------
- For example, the path `"my_bucket>my_stats.avg"` will path to the `avg` value in the `"my_stats"` metric, which is
- contained in the `"my_bucket"` bucket aggregation.
- Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the
- aggregation tree. For example, this derivative is embedded inside a date_histogram and refers to a "sibling"
- metric `"the_sum"`:
- [source,console,id=buckets-path-example]
- --------------------------------------------------
- POST /_search
- {
- "aggs": {
- "my_date_histo":{
- "date_histogram":{
- "field":"timestamp",
- "calendar_interval":"day"
- },
- "aggs":{
- "the_sum":{
- "sum":{ "field": "lemmings" } <1>
- },
- "the_deriv":{
- "derivative":{ "buckets_path": "the_sum" } <2>
- }
- }
- }
- }
- }
- --------------------------------------------------
- <1> The metric is called `"the_sum"`
- <2> The `buckets_path` refers to the metric via a relative path `"the_sum"`
- `buckets_path` is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets
- instead of embedded "inside" them. For example, the `max_bucket` aggregation uses the `buckets_path` to specify
- a metric embedded inside a sibling aggregation:
- [source,console,id=buckets-path-sibling-example]
- --------------------------------------------------
- POST /_search
- {
- "aggs" : {
- "sales_per_month" : {
- "date_histogram" : {
- "field" : "date",
- "calendar_interval" : "month"
- },
- "aggs": {
- "sales": {
- "sum": {
- "field": "price"
- }
- }
- }
- },
- "max_monthly_sales": {
- "max_bucket": {
- "buckets_path": "sales_per_month>sales" <1>
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- <1> `buckets_path` instructs this max_bucket aggregation that we want the maximum value of the `sales` aggregation in the
- `sales_per_month` date histogram.
- If a Sibling pipeline agg references a multi-bucket aggregation, such as a `terms` agg, it also has the option to
- select specific keys from the multi-bucket. For example, a `bucket_script` could select two specific buckets (via
- their bucket keys) to perform the calculation:
- [source,console,id=buckets-path-specific-bucket-example]
- --------------------------------------------------
- POST /_search
- {
- "aggs" : {
- "sales_per_month" : {
- "date_histogram" : {
- "field" : "date",
- "calendar_interval" : "month"
- },
- "aggs": {
- "sale_type": {
- "terms": {
- "field": "type"
- },
- "aggs": {
- "sales": {
- "sum": {
- "field": "price"
- }
- }
- }
- },
- "hat_vs_bag_ratio": {
- "bucket_script": {
- "buckets_path": {
- "hats": "sale_type['hat']>sales", <1>
- "bags": "sale_type['bag']>sales" <1>
- },
- "script": "params.hats / params.bags"
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- <1> `buckets_path` selects the hats and bags buckets (via `['hat']`/`['bag']``) to use in the script specifically,
- instead of fetching all the buckets from `sale_type` aggregation
- [float]
- === Special Paths
- Instead of pathing to a metric, `buckets_path` can use a special `"_count"` path. This instructs
- the pipeline aggregation to use the document count as its input. For example, a derivative can be calculated
- on the document count of each bucket, instead of a specific metric:
- [source,console,id=buckets-path-count-example]
- --------------------------------------------------
- POST /_search
- {
- "aggs": {
- "my_date_histo": {
- "date_histogram": {
- "field":"timestamp",
- "calendar_interval":"day"
- },
- "aggs": {
- "the_deriv": {
- "derivative": { "buckets_path": "_count" } <1>
- }
- }
- }
- }
- }
- --------------------------------------------------
- <1> By using `_count` instead of a metric name, we can calculate the derivative of document counts in the histogram
- The `buckets_path` can also use `"_bucket_count"` and path to a multi-bucket aggregation to use the number of buckets
- returned by that aggregation in the pipeline aggregation instead of a metric. For example, a `bucket_selector` can be
- used here to filter out buckets which contain no buckets for an inner terms aggregation:
- [source,console,id=buckets-path-bucket-count-example]
- --------------------------------------------------
- POST /sales/_search
- {
- "size": 0,
- "aggs": {
- "histo": {
- "date_histogram": {
- "field": "date",
- "calendar_interval": "day"
- },
- "aggs": {
- "categories": {
- "terms": {
- "field": "category"
- }
- },
- "min_bucket_selector": {
- "bucket_selector": {
- "buckets_path": {
- "count": "categories._bucket_count" <1>
- },
- "script": {
- "source": "params.count != 0"
- }
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- <1> By using `_bucket_count` instead of a metric name, we can filter out `histo` buckets where they contain no buckets
- for the `categories` aggregation
- [[dots-in-agg-names]]
- [float]
- === Dealing with dots in agg names
- An alternate syntax is supported to cope with aggregations or metrics which
- have dots in the name, such as the ++99.9++th
- <<search-aggregations-metrics-percentile-aggregation,percentile>>. This metric
- may be referred to as:
- [source,js]
- ---------------
- "buckets_path": "my_percentile[99.9]"
- ---------------
- // NOTCONSOLE
- [[gap-policy]]
- [float]
- === Dealing with gaps in the data
- Data in the real world is often noisy and sometimes contains *gaps* -- places where data simply doesn't exist. This can
- occur for a variety of reasons, the most common being:
- * Documents falling into a bucket do not contain a required field
- * There are no documents matching the query for one or more buckets
- * The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value.
- Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the
- first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
- Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing
- data is encountered. All pipeline aggregations accept the `gap_policy` parameter. There are currently two gap policies
- to choose from:
- _skip_::
- This option treats missing data as if the bucket does not exist. It will skip the bucket and continue
- calculating using the next available value.
- _insert_zeros_::
- This option will replace missing values with a zero (`0`) and pipeline aggregation computation will
- proceed as normal.
- include::pipeline/avg-bucket-aggregation.asciidoc[]
- include::pipeline/derivative-aggregation.asciidoc[]
- include::pipeline/max-bucket-aggregation.asciidoc[]
- include::pipeline/min-bucket-aggregation.asciidoc[]
- include::pipeline/sum-bucket-aggregation.asciidoc[]
- include::pipeline/stats-bucket-aggregation.asciidoc[]
- include::pipeline/extended-stats-bucket-aggregation.asciidoc[]
- include::pipeline/percentiles-bucket-aggregation.asciidoc[]
- include::pipeline/movavg-aggregation.asciidoc[]
- include::pipeline/movfn-aggregation.asciidoc[]
- include::pipeline/cumulative-sum-aggregation.asciidoc[]
- include::pipeline/cumulative-cardinality-aggregation.asciidoc[]
- include::pipeline/bucket-script-aggregation.asciidoc[]
- include::pipeline/bucket-selector-aggregation.asciidoc[]
- include::pipeline/bucket-sort-aggregation.asciidoc[]
- include::pipeline/serial-diff-aggregation.asciidoc[]
|