123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847 |
- [[search-aggregations-bucket-composite-aggregation]]
- === Composite aggregation
- A multi-bucket aggregation that creates composite buckets from different sources.
- Unlike the other `multi-bucket` aggregation the `composite` aggregation can be used
- to paginate **all** buckets from a multi-level aggregation efficiently. This aggregation
- provides a way to stream **all** buckets of a specific aggregation similarly to what
- <<request-body-search-scroll, scroll>> does for documents.
- The composite buckets are built from the combinations of the
- values extracted/created for each document and each combination is considered as
- a composite bucket.
- //////////////////////////
- [source,js]
- --------------------------------------------------
- PUT /sales
- {
- "mappings": {
- "properties": {
- "product": {
- "type": "keyword"
- },
- "timestamp": {
- "type": "date"
- },
- "price": {
- "type": "long"
- },
- "shop": {
- "type": "keyword"
- },
- "nested": {
- "type": "nested",
- "properties": {
- "product": {
- "type": "keyword"
- },
- "timestamp": {
- "type": "date"
- },
- "price": {
- "type": "long"
- },
- "shop": {
- "type": "keyword"
- }
- }
- }
- }
- }
- }
- POST /sales/_bulk?refresh
- {"index":{"_id":0}}
- {"product": "mad max", "price": "20", "timestamp": "2017-05-09T14:35"}
- {"index":{"_id":1}}
- {"product": "mad max", "price": "25", "timestamp": "2017-05-09T12:35"}
- {"index":{"_id":2}}
- {"product": "rocky", "price": "10", "timestamp": "2017-05-08T09:10"}
- {"index":{"_id":3}}
- {"product": "mad max", "price": "27", "timestamp": "2017-05-10T07:07"}
- {"index":{"_id":4}}
- {"product": "apocalypse now", "price": "10", "timestamp": "2017-05-11T08:35"}
- -------------------------------------------------
- // NOTCONSOLE
- // TESTSETUP
- //////////////////////////
- For instance the following document:
- [source,js]
- --------------------------------------------------
- {
- "keyword": ["foo", "bar"],
- "number": [23, 65, 76]
- }
- --------------------------------------------------
- // NOTCONSOLE
- \... creates the following composite buckets when `keyword` and `number` are used as values source
- for the aggregation:
- [source,js]
- --------------------------------------------------
- { "keyword": "foo", "number": 23 }
- { "keyword": "foo", "number": 65 }
- { "keyword": "foo", "number": 76 }
- { "keyword": "bar", "number": 23 }
- { "keyword": "bar", "number": 65 }
- { "keyword": "bar", "number": 76 }
- --------------------------------------------------
- // NOTCONSOLE
- ==== Values source
- The `sources` parameter controls the sources that should be used to build the composite buckets.
- The order that the `sources` are defined is important because it also controls the order
- the keys are returned.
- The name given to each sources must be unique.
- There are three different types of values source:
- [[_terms]]
- ===== Terms
- The `terms` value source is equivalent to a simple `terms` aggregation.
- The values are extracted from a field or a script exactly like the `terms` aggregation.
- Example:
- [source,console,id=composite-aggregation-terms-field-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "product": { "terms" : { "field": "product" } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- Like the `terms` aggregation it is also possible to use a script to create the values for the composite buckets:
- [source,console,id=composite-aggregation-terms-script-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- {
- "product": {
- "terms" : {
- "script" : {
- "source": "doc['product'].value",
- "lang": "painless"
- }
- }
- }
- }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- [[_histogram]]
- ===== Histogram
- The `histogram` value source can be applied on numeric values to build fixed size
- interval over the values. The `interval` parameter defines how the numeric values should be
- transformed. For instance an `interval` set to 5 will translate any numeric values to its closest interval,
- a value of `101` would be translated to `100` which is the key for the interval between 100 and 105.
- Example:
- [source,console,id=composite-aggregation-histogram-field-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "histo": { "histogram" : { "field": "price", "interval": 5 } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- The values are built from a numeric field or a script that return numerical values:
- [source,console,id=composite-aggregation-histogram-script-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- {
- "histo": {
- "histogram" : {
- "interval": 5,
- "script" : {
- "source": "doc['price'].value",
- "lang": "painless"
- }
- }
- }
- }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- [[_date_histogram]]
- ===== Date histogram
- The `date_histogram` is similar to the `histogram` value source except that the interval
- is specified by date/time expression:
- [source,console,id=composite-aggregation-datehistogram-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "date": { "date_histogram" : { "field": "timestamp", "calendar_interval": "1d" } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- The example above creates an interval per day and translates all `timestamp` values to the start of its closest intervals.
- Available expressions for interval: `year`, `quarter`, `month`, `week`, `day`, `hour`, `minute`, `second`
- Time values can also be specified via abbreviations supported by <<time-units,time units>> parsing.
- Note that fractional time values are not supported, but you can address this by shifting to another
- time unit (e.g., `1.5h` could instead be specified as `90m`).
- *Format*
- Internally, a date is represented as a 64 bit number representing a timestamp in milliseconds-since-the-epoch.
- These timestamps are returned as the bucket keys. It is possible to return a formatted date string instead using
- the format specified with the format parameter:
- [source,console,id=composite-aggregation-datehistogram-format-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- {
- "date": {
- "date_histogram" : {
- "field": "timestamp",
- "calendar_interval": "1d",
- "format": "yyyy-MM-dd" <1>
- }
- }
- }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- <1> Supports expressive date <<date-format-pattern,format pattern>>
- *Time Zone*
- Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
- rounding is also done in UTC. The `time_zone` parameter can be used to indicate
- that bucketing should use a different time zone.
- Time zones may either be specified as an ISO 8601 UTC offset (e.g. `+01:00` or
- `-08:00`) or as a timezone id, an identifier used in the TZ database like
- `America/Los_Angeles`.
- *Offset*
- include::datehistogram-aggregation.asciidoc[tag=offset-explanation]
- [source,console,id=composite-aggregation-datehistogram-offset-example]
- ----
- PUT my_index/_doc/1?refresh
- {
- "date": "2015-10-01T05:30:00Z"
- }
- PUT my_index/_doc/2?refresh
- {
- "date": "2015-10-01T06:30:00Z"
- }
- GET my_index/_search?size=0
- {
- "aggs": {
- "my_buckets": {
- "composite" : {
- "sources" : [
- {
- "date": {
- "date_histogram" : {
- "field": "date",
- "calendar_interval": "day",
- "offset": "+6h",
- "format": "iso8601"
- }
- }
- }
- ]
- }
- }
- }
- }
- ----
- include::datehistogram-aggregation.asciidoc[tag=offset-result-intro]
- [source,console-result]
- ----
- {
- ...
- "aggregations": {
- "my_buckets": {
- "after_key": { "date": "2015-10-01T06:00:00.000Z" },
- "buckets": [
- {
- "key": { "date": "2015-09-30T06:00:00.000Z" },
- "doc_count": 1
- },
- {
- "key": { "date": "2015-10-01T06:00:00.000Z" },
- "doc_count": 1
- }
- ]
- }
- }
- }
- ----
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- include::datehistogram-aggregation.asciidoc[tag=offset-note]
- [[_geotile_grid]]
- ===== GeoTile grid
- The `geotile_grid` value source works on `geo_point` fields and groups points into buckets that represent
- cells in a grid. The resulting grid can be sparse and only contains cells
- that have matching data. Each cell corresponds to a
- https://en.wikipedia.org/wiki/Tiled_web_map[map tile] as used by many online map
- sites. Each cell is labeled using a "{zoom}/{x}/{y}" format, where zoom is equal
- to the user-specified precision.
- [source,console,id=composite-aggregation-geotilegrid-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "tile": { "geotile_grid" : { "field": "location", "precision": 8 } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- *Precision*
- The highest-precision geotile of length 29 produces cells that cover
- less than 10cm by 10cm of land. This precision is uniquely suited for composite aggregations as each
- tile does not have to be generated and loaded in memory.
- See https://wiki.openstreetmap.org/wiki/Zoom_levels[Zoom level documentation]
- on how precision (zoom) correlates to size on the ground. Precision for this
- aggregation can be between 0 and 29, inclusive.
- *Bounding box filtering*
- The geotile source can optionally be constrained to a specific geo bounding box, which reduces
- the range of tiles used. These bounds are useful when only a specific part of a geographical area needs high
- precision tiling.
- [source,console,id=composite-aggregation-geotilegrid-boundingbox-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- {
- "tile": {
- "geotile_grid" : {
- "field" : "location",
- "precision" : 22,
- "bounds": {
- "top_left" : "52.4, 4.9",
- "bottom_right" : "52.3, 5.0"
- }
- }
- }
- }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- ===== Mixing different values source
- The `sources` parameter accepts an array of values source.
- It is possible to mix different values source to create composite buckets.
- For example:
- [source,console,id=composite-aggregation-mixing-sources-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d" } } },
- { "product": { "terms": {"field": "product" } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- This will create composite buckets from the values created by two values source, a `date_histogram` and a `terms`.
- Each bucket is composed of two values, one for each value source defined in the aggregation.
- Any type of combinations is allowed and the order in the array is preserved
- in the composite buckets.
- [source,console,id=composite-aggregation-mixing-three-sources-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "shop": { "terms": {"field": "shop" } } },
- { "product": { "terms": { "field": "product" } } },
- { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d" } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- ==== Order
- By default the composite buckets are sorted by their natural ordering. Values are sorted
- in ascending order of their values. When multiple value sources are requested, the ordering is done per value
- source, the first value of the composite bucket is compared to the first value of the other composite bucket and if they are equals the
- next values in the composite bucket are used for tie-breaking. This means that the composite bucket
- `[foo, 100]` is considered smaller than `[foobar, 0]` because `foo` is considered smaller than `foobar`.
- It is possible to define the direction of the sort for each value source by setting `order` to `asc` (default value)
- or `desc` (descending order) directly in the value source definition.
- For example:
- [source,console,id=composite-aggregation-order-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } },
- { "product": { "terms": {"field": "product", "order": "asc" } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- \... will sort the composite bucket in descending order when comparing values from the `date_histogram` source
- and in ascending order when comparing values from the `terms` source.
- ==== Missing bucket
- By default documents without a value for a given source are ignored.
- It is possible to include them in the response by setting `missing_bucket` to
- `true` (defaults to `false`):
- [source,console,id=composite-aggregation-missing-bucket-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "product_name": { "terms" : { "field": "product", "missing_bucket": true } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- In the example above the source `product_name` will emit an explicit `null` value
- for documents without a value for the field `product`.
- The `order` specified in the source dictates whether the `null` values should rank
- first (ascending order, `asc`) or last (descending order, `desc`).
- ==== Size
- The `size` parameter can be set to define how many composite buckets should be returned.
- Each composite bucket is considered as a single bucket so setting a size of 10 will return the
- first 10 composite buckets created from the values source.
- The response contains the values for each composite bucket in an array containing the values extracted
- from each value source.
- ==== Pagination
- If the number of composite buckets is too high (or unknown) to be returned in a single response
- it is possible to split the retrieval in multiple requests.
- Since the composite buckets are flat by nature, the requested `size` is exactly the number of composite buckets
- that will be returned in the response (assuming that they are at least `size` composite buckets to return).
- If all composite buckets should be retrieved it is preferable to use a small size (`100` or `1000` for instance)
- and then use the `after` parameter to retrieve the next results.
- For example:
- [source,console,id=composite-aggregation-after-key-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "size": 2,
- "sources" : [
- { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d" } } },
- { "product": { "terms": {"field": "product" } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[s/_search/_search\?filter_path=aggregations/]
- \... returns:
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "my_buckets": {
- "after_key": {
- "date": 1494288000000,
- "product": "mad max"
- },
- "buckets": [
- {
- "key": {
- "date": 1494201600000,
- "product": "rocky"
- },
- "doc_count": 1
- },
- {
- "key": {
- "date": 1494288000000,
- "product": "mad max"
- },
- "doc_count": 2
- }
- ]
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\.//]
- To get the next set of buckets, resend the same aggregation with the `after`
- parameter set to the `after_key` value returned in the response.
- For example, this request uses the `after_key` value provided in the previous response:
- [source,console,id=composite-aggregation-after-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "size": 2,
- "sources" : [
- { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } },
- { "product": { "terms": {"field": "product", "order": "asc" } } }
- ],
- "after": { "date": 1494288000000, "product": "mad max" } <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> Should restrict the aggregation to buckets that sort **after** the provided values.
- NOTE: The `after_key` is *usually* the key to the last bucket returned in
- the response, but that isn't guaranteed. Always use the returned `after_key` instead
- of derriving it from the buckets.
- ==== Early termination
- For optimal performance the <<index-modules-index-sorting,index sort>> should be set on the index so that it matches
- parts or fully the source order in the composite aggregation.
- For instance the following index sort:
- [source,console]
- --------------------------------------------------
- PUT twitter
- {
- "settings" : {
- "index" : {
- "sort.field" : ["username", "timestamp"], <1>
- "sort.order" : ["asc", "desc"] <2>
- }
- },
- "mappings": {
- "properties": {
- "username": {
- "type": "keyword",
- "doc_values": true
- },
- "timestamp": {
- "type": "date"
- }
- }
- }
- }
- --------------------------------------------------
- <1> This index is sorted by `username` first then by `timestamp`.
- <2> ... in ascending order for the `username` field and in descending order for the `timestamp` field.
- .. could be used to optimize these composite aggregations:
- [source,console]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "user_name": { "terms" : { "field": "user_name" } } } <1>
- ]
- }
- }
- }
- }
- --------------------------------------------------
- <1> `user_name` is a prefix of the index sort and the order matches (`asc`).
- [source,console]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "user_name": { "terms" : { "field": "user_name" } } }, <1>
- { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } } <2>
- ]
- }
- }
- }
- }
- --------------------------------------------------
- <1> `user_name` is a prefix of the index sort and the order matches (`asc`).
- <2> `timestamp` matches also the prefix and the order matches (`desc`).
- In order to optimize the early termination it is advised to set `track_total_hits` in the request
- to `false`. The number of total hits that match the request can be retrieved on the first request
- and it would be costly to compute this number on every page:
- [source,console]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "track_total_hits": false,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "user_name": { "terms" : { "field": "user_name" } } },
- { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- Note that the order of the source is important, in the example below switching the `user_name` with the `timestamp`
- would deactivate the sort optimization since this configuration wouldn't match the index sort specification.
- If the order of sources do not matter for your use case you can follow these simple guidelines:
- * Put the fields with the highest cardinality first.
- * Make sure that the order of the field matches the order of the index sort.
- * Put multi-valued fields last since they cannot be used for early termination.
- WARNING: <<index-modules-index-sorting,index sort>> can slowdown indexing, it is very important to test index sorting
- with your specific use case and dataset to ensure that it matches your requirement. If it doesn't note that `composite`
- aggregations will also try to early terminate on non-sorted indices if the query matches all document (`match_all` query).
- ==== Sub-aggregations
- Like any `multi-bucket` aggregations the `composite` aggregation can hold sub-aggregations.
- These sub-aggregations can be used to compute other buckets or statistics on each composite bucket created by this
- parent aggregation.
- For instance the following example computes the average value of a field
- per composite bucket:
- [source,console,id=composite-aggregation-subaggregations-example]
- --------------------------------------------------
- GET /_search
- {
- "size": 0,
- "aggs" : {
- "my_buckets": {
- "composite" : {
- "sources" : [
- { "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } },
- { "product": { "terms": {"field": "product" } } }
- ]
- },
- "aggregations": {
- "the_avg": {
- "avg": { "field": "price" }
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[s/_search/_search\?filter_path=aggregations/]
- \... returns:
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "my_buckets": {
- "after_key": {
- "date": 1494201600000,
- "product": "rocky"
- },
- "buckets": [
- {
- "key": {
- "date": 1494460800000,
- "product": "apocalypse now"
- },
- "doc_count": 1,
- "the_avg": {
- "value": 10.0
- }
- },
- {
- "key": {
- "date": 1494374400000,
- "product": "mad max"
- },
- "doc_count": 1,
- "the_avg": {
- "value": 27.0
- }
- },
- {
- "key": {
- "date": 1494288000000,
- "product" : "mad max"
- },
- "doc_count": 2,
- "the_avg": {
- "value": 22.5
- }
- },
- {
- "key": {
- "date": 1494201600000,
- "product": "rocky"
- },
- "doc_count": 1,
- "the_avg": {
- "value": 10.0
- }
- }
- ]
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\.//]
- ==== Pipeline aggregations
- The composite agg is not currently compatible with pipeline aggregations, nor does it make sense in most cases.
- E.g. due to the paging nature of composite aggs, a single logical partition (one day for example) might be spread
- over multiple pages. Since pipeline aggregations are purely post-processing on the final list of buckets,
- running something like a derivative on a composite page could lead to inaccurate results as it is only taking into
- account a "partial" result on that page.
- Pipeline aggs that are self contained to a single bucket (such as `bucket_selector`) might be supported in the future.
|