lqb
/
elasticsearch
mirror of https://gitee.com/mirrors/elasticsearch.git


			
				
					
						
						
							12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
							[discrete]
[[esql-agg-count-distinct]]
=== `COUNT_DISTINCT`

*Syntax*

[source,esql]
----
COUNT_DISTINCT(expression[, precision_threshold])
----

*Parameters*

`expression`::
Expression that outputs the values on which to perform a distinct count.

`precision_threshold`::
Precision threshold. Refer to <<esql-agg-count-distinct-approximate>>. The
maximum supported value is 40000. Thresholds above this number will have the
same effect as a threshold of 40000. The default value is 3000.

*Description*

Returns the approximate number of distinct values.

*Supported types*

Can take any field type as input.

*Examples*

[source.merge.styled,esql]
----
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]
----
[%header.monospaced.styled,format=dsv,separator=|]
|===
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]
|===

With the optional second parameter to configure the precision threshold:

[source.merge.styled,esql]
----
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]
----
[%header.monospaced.styled,format=dsv,separator=|]
|===
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]
|===

The expression can use inline functions. This example splits a string into
multiple values using the `SPLIT` function and counts the unique values:

[source.merge.styled,esql]
----
include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExpression]
----
[%header.monospaced.styled,format=dsv,separator=|]
|===
include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExpression-result]
|===

[discrete]
[[esql-agg-count-distinct-approximate]]
==== Counts are approximate

Computing exact counts requires loading values into a set and returning its
size. This doesn't scale when working on high-cardinality sets and/or large
values as the required memory usage and the need to communicate those
per-shard sets between nodes would utilize too many resources of the cluster.

This `COUNT_DISTINCT` function is based on the
https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
algorithm, which counts based on the hashes of the values with some interesting
properties:

include::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]

The `COUNT_DISTINCT` function takes an optional second parameter to configure
the precision threshold. The precision_threshold options allows to trade memory
for accuracy, and defines a unique count below which counts are expected to be
close to accurate. Above this value, counts might become a bit more fuzzy. The
maximum supported value is 40000, thresholds above this number will have the
same effect as a threshold of 40000. The default value is `3000`.