123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157 |
- [[search-aggregations-metrics-cardinality-aggregation]]
- === Cardinality Aggregation
- A `single-value` metrics aggregation that calculates an approximate count of
- distinct values. Values can be extracted either from specific fields in the
- document or generated by a script.
- Assume you are indexing books and would like to count the unique authors that
- match a query:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "author_count" : {
- "cardinality" : {
- "field" : "author"
- }
- }
- }
- }
- --------------------------------------------------
- ==== Precision control
- This aggregation also supports the `precision_threshold` and `rehash` options:
- experimental[The `precision_threshold` and `rehash` options are specific to the current internal implementation of the `cardinality` agg, which may change in the future]
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "author_count" : {
- "cardinality" : {
- "field" : "author_hash",
- "precision_threshold": 100, <1>
- "rehash": false <2>
- }
- }
- }
- }
- --------------------------------------------------
- <1> The `precision_threshold` options allows to trade memory for accuracy, and
- defines a unique count below which counts are expected to be close to
- accurate. Above this value, counts might become a bit more fuzzy. The maximum
- supported value is 40000, thresholds above this number will have the same
- effect as a threshold of 40000.
- Default value depends on the number of parent aggregations that multiple
- create buckets (such as terms or histograms).
- <2> If you computed a hash on client-side, stored it into your documents and want
- Elasticsearch to use them to compute counts using this hash function without
- rehashing values, it is possible to specify `rehash: false`. Default value is
- `true`. Please note that the hash must be indexed as a long when `rehash` is
- false.
- ==== Counts are approximate
- Computing exact counts requires loading values into a hash set and returning its
- size. This doesn't scale when working on high-cardinality sets and/or large
- values as the required memory usage and the need to communicate those
- per-shard sets between nodes would utilize too many resources of the cluster.
- This `cardinality` aggregation is based on the
- http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
- algorithm, which counts based on the hashes of the values with some interesting
- properties:
- * configurable precision, which decides on how to trade memory for accuracy,
- * excellent accuracy on low-cardinality sets,
- * fixed memory usage: no matter if there are tens or billions of unique values,
- memory usage only depends on the configured precision.
- For a precision threshold of `c`, the implementation that we are using requires
- about `c * 8` bytes.
- The following chart shows how the error varies before and after the threshold:
- image:images/cardinality_error.png[]
- For all 3 thresholds, counts have been accurate up to the configured threshold
- (although not guaranteed, this is likely to be the case). Please also note that
- even with a threshold as low as 100, the error remains under 5%, even when
- counting millions of items.
- ==== Pre-computed hashes
- If you don't want Elasticsearch to re-compute hashes on every run of this
- aggregation, it is possible to use pre-computed hashes, either by computing a
- hash on client-side, indexing it and specifying `rehash: false`, or by using
- the special `murmur3` field mapper, typically in the context of a `multi-field`
- in the mapping:
- [source,js]
- --------------------------------------------------
- {
- "author": {
- "type": "string",
- "fields": {
- "hash": {
- "type": "murmur3"
- }
- }
- }
- }
- --------------------------------------------------
- With such a mapping, Elasticsearch is going to compute hashes of the `author`
- field at indexing time and store them in the `author.hash` field. This
- way, unique counts can be computed using the cardinality aggregation by only
- loading the hashes into memory, not the values of the `author` field, and
- without computing hashes on the fly:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "author_count" : {
- "cardinality" : {
- "field" : "author.hash"
- }
- }
- }
- }
- --------------------------------------------------
- NOTE: `rehash` is automatically set to `false` when computing unique counts on
- a `murmur3` field.
- NOTE: Pre-computing hashes is usually only useful on very large and/or
- high-cardinality fields as it saves CPU and memory. However, on numeric
- fields, hashing is very fast and storing the original values requires as much
- or less memory than storing the hashes. This is also true on low-cardinality
- string fields, especially given that those have an optimization in order to
- make sure that hashes are computed at most once per unique value per segment.
- ==== Script
- The `cardinality` metric supports scripting, with a noticeable performance hit
- however since hashes need to be computed on the fly.
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "author_count" : {
- "cardinality" : {
- "script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
- }
- }
- }
- }
- --------------------------------------------------
- TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory.
|