123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157 |
- [[search-aggregations-metrics-cardinality-aggregation]]
- === Cardinality Aggregation
- A `single-value` metrics aggregation that calculates an approximate count of
- distinct values. Values can be extracted either from specific fields in the
- document or generated by a script.
- Assume you are indexing books and would like to count the unique authors that
- match a query:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "author_count" : {
- "cardinality" : {
- "field" : "author"
- }
- }
- }
- }
- --------------------------------------------------
- ==== Precision control
- This aggregation also supports the `precision_threshold` option:
- experimental[The `precision_threshold` option is specific to the current internal implementation of the `cardinality` agg, which may change in the future]
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "author_count" : {
- "cardinality" : {
- "field" : "author_hash",
- "precision_threshold": 100 <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> The `precision_threshold` options allows to trade memory for accuracy, and
- defines a unique count below which counts are expected to be close to
- accurate. Above this value, counts might become a bit more fuzzy. The maximum
- supported value is 40000, thresholds above this number will have the same
- effect as a threshold of 40000.
- Default value depends on the number of parent aggregations that multiple
- create buckets (such as terms or histograms).
- ==== Counts are approximate
- Computing exact counts requires loading values into a hash set and returning its
- size. This doesn't scale when working on high-cardinality sets and/or large
- values as the required memory usage and the need to communicate those
- per-shard sets between nodes would utilize too many resources of the cluster.
- This `cardinality` aggregation is based on the
- http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
- algorithm, which counts based on the hashes of the values with some interesting
- properties:
- * configurable precision, which decides on how to trade memory for accuracy,
- * excellent accuracy on low-cardinality sets,
- * fixed memory usage: no matter if there are tens or billions of unique values,
- memory usage only depends on the configured precision.
- For a precision threshold of `c`, the implementation that we are using requires
- about `c * 8` bytes.
- The following chart shows how the error varies before and after the threshold:
- image:images/cardinality_error.png[]
- For all 3 thresholds, counts have been accurate up to the configured threshold
- (although not guaranteed, this is likely to be the case). Please also note that
- even with a threshold as low as 100, the error remains under 5%, even when
- counting millions of items.
- ==== Pre-computed hashes
- On string fields that have a high cardinality, it might be faster to store the
- hash of your field values in your index and then run the cardinality aggregation
- on this field. This can either be done by providing hash values from client-side
- or by letting elasticsearch compute hash values for you by using the
- {plugins}/mapper-size.html[`mapper-murmur3`] plugin.
- NOTE: Pre-computing hashes is usually only useful on very large and/or
- high-cardinality fields as it saves CPU and memory. However, on numeric
- fields, hashing is very fast and storing the original values requires as much
- or less memory than storing the hashes. This is also true on low-cardinality
- string fields, especially given that those have an optimization in order to
- make sure that hashes are computed at most once per unique value per segment.
- ==== Script
- The `cardinality` metric supports scripting, with a noticeable performance hit
- however since hashes need to be computed on the fly.
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "author_count" : {
- "cardinality" : {
- "script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
- }
- }
- }
- }
- --------------------------------------------------
- This will interpret the `script` parameter as an `inline` script with the default script language and no script parameters. To use a file script use the following syntax:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "author_count" : {
- "cardinality" : {
- "script" : {
- "file": "my_script",
- "params": {
- "first_name_field": "author.first_name",
- "last_name_field": "author.last_name"
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- TIP: for indexed scripts replace the `file` parameter with an `id` parameter.
- ==== Missing value
- The `missing` parameter defines how documents that are missing a value should be treated.
- By default they will be ignored but it is also possible to treat them as if they
- had a value.
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "tag_cardinality" : {
- "cardinality" : {
- "field" : "tag",
- "missing": "N/A" <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> Documents without a value in the `tag` field will fall into the same bucket as documents that have the value `N/A`.
|