123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296 |
- [[search-aggregations-bucket-terms-aggregation]]
- === Terms
- A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
- Example:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "genders" : {
- "terms" : { "field" : "gender" }
- }
- }
- }
- --------------------------------------------------
- Response:
- [source,js]
- --------------------------------------------------
- {
- ...
- "aggregations" : {
- "genders" : {
- "buckets" : [
- {
- "key" : "male",
- "doc_count" : 10
- },
- {
- "key" : "female",
- "doc_count" : 10
- },
- ]
- }
- }
- }
- --------------------------------------------------
- By default, the `terms` aggregation will return the buckets for the top ten terms ordered by the `doc_count`. One can
- change this default behaviour by setting the `size` parameter.
- ==== Size & Shard Size
- The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
- default, the node coordinating the search process will request each shard to provide its own top `size` term buckets
- and once all shards respond, it will reduce the results to the final list that will then be returned to the client.
- This means that if the number of unique terms is greater than `size`, the returned list is slightly off and not accurate
- (it could be that the term counts are slightly off and it could even be that a term that should have been in the top
- size buckets was not returned).
- The higher the requested `size` is, the more accurate the results will be, but also, the more expensive it will be to
- compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
- transfers between the nodes and the client).
- The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
- it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
- coordinating node will then reduce them to a final result which will be based on the `size` parameter - this way,
- one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to
- the client.
- NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
- override it and reset it to be equal to `size`.
- added[1.1.0] It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don't use this
- on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network.
- ==== Order
- The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by
- their `doc_count` descending. It is also possible to change this behaviour as follows:
- Ordering the buckets by their `doc_count` in an ascending manner:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "genders" : {
- "terms" : {
- "field" : "gender",
- "order" : { "_count" : "asc" }
- }
- }
- }
- }
- --------------------------------------------------
- Ordering the buckets alphabetically by their terms in an ascending manner:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "genders" : {
- "terms" : {
- "field" : "gender",
- "order" : { "_term" : "asc" }
- }
- }
- }
- }
- --------------------------------------------------
- Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name):
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "genders" : {
- "terms" : {
- "field" : "gender",
- "order" : { "avg_height" : "desc" }
- },
- "aggs" : {
- "avg_height" : { "avg" : { "field" : "height" } }
- }
- }
- }
- }
- --------------------------------------------------
- Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation name):
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "genders" : {
- "terms" : {
- "field" : "gender",
- "order" : { "stats.avg" : "desc" }
- },
- "aggs" : {
- "height_stats" : { "stats" : { "field" : "height" } }
- }
- }
- }
- }
- --------------------------------------------------
- ==== Minimum document count
- It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "tags" : {
- "terms" : {
- "field" : "tag",
- "min_doc_count": 10
- }
- }
- }
- }
- --------------------------------------------------
- The above aggregation would only return tags which have been found in 10 hits or more. Default value is `1`.
- NOTE: Setting `min_doc_count`=`0` will also return buckets for terms that didn't match any hit. However, some of
- the returned terms which have a document count of zero might only belong to deleted documents, so there is
- no warranty that a `match_all` query would find a positive document count for those terms.
- WARNING: When NOT sorting on `doc_count` descending, high values of `min_doc_count` may return a number of buckets
- which is less than `size` because not enough data was gathered from the shards. Missing buckets can be
- back by increasing `shard_size`.
- ==== Script
- Generating the terms using a script:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "genders" : {
- "terms" : {
- "script" : "doc['gender'].value"
- }
- }
- }
- }
- --------------------------------------------------
- ==== Value Script
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "genders" : {
- "terms" : {
- "field" : "gender",
- "script" : "'Gender: ' +_value"
- }
- }
- }
- }
- --------------------------------------------------
- ==== Filtering Values
- It is possible to filter the values for which buckets will be created. This can be done using the `include` and
- `exclude` parameters which are based on regular expressions.
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "tags" : {
- "terms" : {
- "field" : "tags",
- "include" : ".*sport.*",
- "exclude" : "water_.*"
- }
- }
- }
- }
- --------------------------------------------------
- In the above example, buckets will be created for all the tags that has the word `sport` in them, except those starting
- with `water_` (so the tag `water_sports` will no be aggregated). The `include` regular expression will determine what
- values are "allowed" to be aggregated, while the `exclude` determines the values that should not be aggregated. When
- both are defined, the `exclude` has precedence, meaning, the `include` is evaluated first and only then the `exclude`.
- The regular expression are based on the Java(TM) http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html[Pattern],
- and as such, they it is also possible to pass in flags that will determine how the compiled regular expression will work:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "tags" : {
- "terms" : {
- "field" : "tags",
- "include" : {
- "pattern" : ".*sport.*",
- "flags" : "CANON_EQ|CASE_INSENSITIVE" <1>
- },
- "exclude" : {
- "pattern" : "water_.*",
- "flags" : "CANON_EQ|CASE_INSENSITIVE"
- }
- }
- }
- }
- }
- --------------------------------------------------
- <1> the flags are concatenated using the `|` character as a separator
- The possible flags that can be used are:
- http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CANON_EQ[`CANON_EQ`],
- http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CASE_INSENSITIVE[`CASE_INSENSITIVE`],
- http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#COMMENTS[`COMMENTS`],
- http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#DOTALL[`DOTALL`],
- http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#LITERAL[`LITERAL`],
- http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#MULTILINE[`MULTILINE`],
- http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CASE[`UNICODE_CASE`],
- http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS[`UNICODE_CHARACTER_CLASS`] and
- http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNIX_LINES[`UNIX_LINES`]
- ==== Execution hint
- There are two mechanisms by which terms aggregations can be executed: either by using field values directly in order to aggregate
- data per-bucket (`map`), or by using ordinals of the field values instead of the values themselves (`ordinals`). Although the
- latter execution mode can be expected to be slightly faster, it is only available for use when the underlying data source exposes
- those terms ordinals. Moreover, it may actually be slower if most field values are unique. Elasticsearch tries to have sensible
- defaults when it comes to the execution mode that should be used, but in case you know that one execution mode may perform better
- than the other one, you have the ability to "hint" it to Elasticsearch:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "tags" : {
- "terms" : {
- "field" : "tags",
- "execution_hint": "map" <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> the possible values are `map` and `ordinals`
- Please note that Elasticsearch will ignore this execution hint if it is not applicable.
|