123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192 |
- [[search-aggregations-metrics-percentile-aggregation]]
- === Percentiles Aggregation
- A `multi-value` metrics aggregation that calculates one or more percentiles
- over numeric values extracted from the aggregated documents. These values
- can be extracted either from specific numeric fields in the documents, or
- be generated by a provided script.
- Percentiles show the point at which a certain percentage of observed values
- occur. For example, the 95th percentile is the value which is greater than 95%
- of the observed values.
- Percentiles are often used to find outliers. In normal distributions, the
- 0.13th and 99.87th percentiles represents three standard deviations from the
- mean. Any data which falls outside three standard deviations is often considered
- an anomaly.
- When a range of percentiles are retrieved, they can be used to estimate the
- data distribution and determine if the data is skewed, bimodal, etc.
- Assume your data consists of website load times. The average and median
- load times are not overly useful to an administrator. The max may be interesting,
- but it can be easily skewed by a single slow response.
- Let's look at a range of percentiles representing load time:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "load_time_outlier" : {
- "percentiles" : {
- "field" : "load_time" <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> The field `load_time` must be a numeric field
- By default, the `percentile` metric will generate a range of
- percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
- [source,js]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "load_time_outlier": {
- "values" : {
- "1.0": 15,
- "5.0": 20,
- "25.0": 23,
- "50.0": 25,
- "75.0": 29,
- "95.0": 60,
- "99.0": 150
- }
- }
- }
- }
- --------------------------------------------------
- As you can see, the aggregation will return a calculated value for each percentile
- in the default range. If we assume response times are in milliseconds, it is
- immediately obvious that the webpage normally loads in 15-30ms, but occasionally
- spikes to 60-150ms.
- Often, administrators are only interested in outliers -- the extreme percentiles.
- We can specify just the percents we are interested in (requested percentiles
- must be a value between 0-100 inclusive):
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "load_time_outlier" : {
- "percentiles" : {
- "field" : "load_time",
- "percents" : [95, 99, 99.9] <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> Use the `percents` parameter to specify particular percentiles to calculate
- ==== Script
- The percentile metric supports scripting. For example, if our load times
- are in milliseconds but we want percentiles calculated in seconds, we could use
- a script to convert them on-the-fly:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "load_time_outlier" : {
- "percentiles" : {
- "script" : "doc['load_time'].value / timeUnit", <1>
- "params" : {
- "timeUnit" : 1000 <2>
- }
- }
- }
- }
- }
- --------------------------------------------------
- <1> The `field` parameter is replaced with a `script` parameter, which uses the
- script to generate values which percentiles are calculated on
- <2> Scripting supports parameterized input just like any other script
- TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory.
- [[search-aggregations-metrics-percentile-aggregation-approximation]]
- ==== Percentiles are (usually) approximate
- There are many different algorithms to calculate percentiles. The naive
- implementation simply stores all the values in a sorted array. To find the 50th
- percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
- Clearly, the naive implementation does not scale -- the sorted array grows
- linearly with the number of values in your dataset. To calculate percentiles
- across potentially billions of values in an Elasticsearch cluster, _approximate_
- percentiles are calculated.
- The algorithm used by the `percentile` metric is called TDigest (introduced by
- Ted Dunning in
- https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]).
- When using this metric, there are a few guidelines to keep in mind:
- - Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%)
- are more accurate than less extreme percentiles, such as the median
- - For small sets of values, percentiles are highly accurate (and potentially
- 100% accurate if the data is small enough).
- - As the quantity of values in a bucket grows, the algorithm begins to approximate
- the percentiles. It is effectively trading accuracy for memory savings. The
- exact level of inaccuracy is difficult to generalize, since it depends on your
- data distribution and volume of data being aggregated
- The following chart shows the relative error on a uniform distribution depending
- on the number of collected values and the requested percentile:
- image:images/percentiles_error.png[]
- It shows how precision is better for extreme percentiles. The reason why error diminishes
- for large number of values is that the law of large numbers makes the distribution of
- values more and more uniform and the t-digest tree can do a better job at summarizing
- it. It would not be the case on more skewed distributions.
- [[search-aggregations-metrics-percentile-aggregation-compression]]
- ==== Compression
- experimental[The `compression` parameter is specific to the current internal implementation of percentiles, and may change in the future]
- Approximate algorithms must balance memory utilization with estimation accuracy.
- This balance can be controlled using a `compression` parameter:
- [source,js]
- --------------------------------------------------
- {
- "aggs" : {
- "load_time_outlier" : {
- "percentiles" : {
- "field" : "load_time",
- "compression" : 200 <1>
- }
- }
- }
- }
- --------------------------------------------------
- <1> Compression controls memory usage and approximation error
- The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
- more nodes available, the higher the accuracy (and large memory footprint) proportional
- to the volume of data. The `compression` parameter limits the maximum number of
- nodes to `20 * compression`.
- Therefore, by increasing the compression value, you can increase the accuracy of
- your percentiles at the cost of more memory. Larger compression values also
- make the algorithm slower since the underlying tree data structure grows in size,
- resulting in more expensive operations. The default compression value is
- `100`.
- A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount
- of data which arrives sorted and in-order) the default settings will produce a
- TDigest roughly 64KB in size. In practice data tends to be more random and
- the TDigest will use less memory.
|