123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249 |
- [[search-aggregations-metrics-cardinality-aggregation]]
- === Cardinality aggregation
- ++++
- <titleabbrev>Cardinality</titleabbrev>
- ++++
- A `single-value` metrics aggregation that calculates an approximate count of
- distinct values. Values can be extracted either from specific fields in the
- document or generated by a script.
- Assume you are indexing store sales and would like to count the unique number of sold products that match a query:
- [source,console]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "type_count": {
- "cardinality": {
- "field": "type"
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- Response:
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "type_count": {
- "value": 3
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- ==== Precision control
- This aggregation also supports the `precision_threshold` option:
- [source,console]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "type_count": {
- "cardinality": {
- "field": "type",
- "precision_threshold": 100 <1>
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- <1> The `precision_threshold` options allows to trade memory for accuracy, and
- defines a unique count below which counts are expected to be close to
- accurate. Above this value, counts might become a bit more fuzzy. The maximum
- supported value is 40000, thresholds above this number will have the same
- effect as a threshold of 40000. The default value is +3000+.
- ==== Counts are approximate
- Computing exact counts requires loading values into a hash set and returning its
- size. This doesn't scale when working on high-cardinality sets and/or large
- values as the required memory usage and the need to communicate those
- per-shard sets between nodes would utilize too many resources of the cluster.
- This `cardinality` aggregation is based on the
- https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
- algorithm, which counts based on the hashes of the values with some interesting
- properties:
- * configurable precision, which decides on how to trade memory for accuracy,
- * excellent accuracy on low-cardinality sets,
- * fixed memory usage: no matter if there are tens or billions of unique values,
- memory usage only depends on the configured precision.
- For a precision threshold of `c`, the implementation that we are using requires
- about `c * 8` bytes.
- The following chart shows how the error varies before and after the threshold:
- ////
- To generate this chart use this gnuplot script:
- [source,gnuplot]
- -------
- #!/usr/bin/gnuplot
- reset
- set terminal png size 1000,400
- set xlabel "Actual cardinality"
- set logscale x
- set ylabel "Relative error (%)"
- set yrange [0:8]
- set title "Cardinality error"
- set grid
- set style data lines
- plot "test.dat" using 1:2 title "threshold=100", \
- "" using 1:3 title "threshold=1000", \
- "" using 1:4 title "threshold=10000"
- #
- -------
- and generate data in a 'test.dat' file using the below Java code:
- [source,java]
- -------
- private static double error(HyperLogLogPlusPlus h, long expected) {
- double actual = h.cardinality(0);
- return Math.abs(expected - actual) / expected;
- }
- public static void main(String[] args) {
- HyperLogLogPlusPlus h100 = new HyperLogLogPlusPlus(precisionFromThreshold(100), BigArrays.NON_RECYCLING_INSTANCE, 1);
- HyperLogLogPlusPlus h1000 = new HyperLogLogPlusPlus(precisionFromThreshold(1000), BigArrays.NON_RECYCLING_INSTANCE, 1);
- HyperLogLogPlusPlus h10000 = new HyperLogLogPlusPlus(precisionFromThreshold(10000), BigArrays.NON_RECYCLING_INSTANCE, 1);
- int next = 100;
- int step = 10;
- for (int i = 1; i <= 10000000; ++i) {
- long h = BitMixer.mix64(i);
- h100.collect(0, h);
- h1000.collect(0, h);
- h10000.collect(0, h);
- if (i == next) {
- System.out.println(i + " " + error(h100, i)*100 + " " + error(h1000, i)*100 + " " + error(h10000, i)*100);
- next += step;
- if (next >= 100 * step) {
- step *= 10;
- }
- }
- }
- }
- -------
- ////
- image:images/cardinality_error.png[]
- For all 3 thresholds, counts have been accurate up to the configured threshold.
- Although not guaranteed, this is likely to be the case. Accuracy in practice depends
- on the dataset in question. In general, most datasets show consistently good
- accuracy. Also note that even with a threshold as low as 100, the error
- remains very low (1-6% as seen in the above graph) even when counting millions of items.
- The HyperLogLog++ algorithm depends on the leading zeros of hashed
- values, the exact distributions of hashes in a dataset can affect the
- accuracy of the cardinality.
- Please also note that even with a threshold as low as 100, the error remains
- very low, even when counting millions of items.
- ==== Pre-computed hashes
- On string fields that have a high cardinality, it might be faster to store the
- hash of your field values in your index and then run the cardinality aggregation
- on this field. This can either be done by providing hash values from client-side
- or by letting Elasticsearch compute hash values for you by using the
- {plugins}/mapper-murmur3.html[`mapper-murmur3`] plugin.
- NOTE: Pre-computing hashes is usually only useful on very large and/or
- high-cardinality fields as it saves CPU and memory. However, on numeric
- fields, hashing is very fast and storing the original values requires as much
- or less memory than storing the hashes. This is also true on low-cardinality
- string fields, especially given that those have an optimization in order to
- make sure that hashes are computed at most once per unique value per segment.
- ==== Script
- The `cardinality` metric supports scripting, with a noticeable performance hit
- however since hashes need to be computed on the fly.
- [source,console]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "type_promoted_count": {
- "cardinality": {
- "script": {
- "lang": "painless",
- "source": "doc['type'].value + ' ' + doc['promoted'].value"
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters. To use a stored script use the following syntax:
- [source,console]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "type_promoted_count": {
- "cardinality": {
- "script": {
- "id": "my_script",
- "params": {
- "type_field": "type",
- "promoted_field": "promoted"
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[skip:no script]
- ==== Missing value
- The `missing` parameter defines how documents that are missing a value should be treated.
- By default they will be ignored but it is also possible to treat them as if they
- had a value.
- [source,console]
- --------------------------------------------------
- POST /sales/_search?size=0
- {
- "aggs": {
- "tag_cardinality": {
- "cardinality": {
- "field": "tag",
- "missing": "N/A" <1>
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:sales]
- <1> Documents without a value in the `tag` field will fall into the same bucket as documents that have the value `N/A`.
|