percentile-rank-aggregation.asciidoc 7.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241
  1. [[search-aggregations-metrics-percentile-rank-aggregation]]
  2. === Percentile ranks aggregation
  3. ++++
  4. <titleabbrev>Percentile ranks</titleabbrev>
  5. ++++
  6. A `multi-value` metrics aggregation that calculates one or more percentile ranks
  7. over numeric values extracted from the aggregated documents. These values can be
  8. generated by a provided script or extracted from specific numeric or
  9. <<histogram,histogram fields>> in the documents.
  10. [NOTE]
  11. ==================================================
  12. Please see <<search-aggregations-metrics-percentile-aggregation-approximation>>
  13. and <<search-aggregations-metrics-percentile-aggregation-compression>> for advice
  14. regarding approximation and memory use of the percentile ranks aggregation
  15. ==================================================
  16. Percentile rank show the percentage of observed values which are below certain
  17. value. For example, if a value is greater than or equal to 95% of the observed values
  18. it is said to be at the 95th percentile rank.
  19. Assume your data consists of website load times. You may have a service agreement that
  20. 95% of page loads complete within 500ms and 99% of page loads complete within 600ms.
  21. Let's look at a range of percentiles representing load time:
  22. [source,console]
  23. --------------------------------------------------
  24. GET latency/_search
  25. {
  26. "size": 0,
  27. "aggs": {
  28. "load_time_ranks": {
  29. "percentile_ranks": {
  30. "field": "load_time", <1>
  31. "values": [ 500, 600 ]
  32. }
  33. }
  34. }
  35. }
  36. --------------------------------------------------
  37. // TEST[setup:latency]
  38. <1> The field `load_time` must be a numeric field
  39. The response will look like this:
  40. [source,console-result]
  41. --------------------------------------------------
  42. {
  43. ...
  44. "aggregations": {
  45. "load_time_ranks": {
  46. "values": {
  47. "500.0": 90.01,
  48. "600.0": 100.0
  49. }
  50. }
  51. }
  52. }
  53. --------------------------------------------------
  54. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  55. // TESTRESPONSE[s/"500.0": 90.01/"500.0": 55.00000000000001/]
  56. // TESTRESPONSE[s/"600.0": 100.0/"600.0": 64.0/]
  57. From this information you can determine you are hitting the 99% load time target but not quite
  58. hitting the 95% load time target
  59. ==== Keyed Response
  60. By default the `keyed` flag is set to `true` associates a unique string key with each bucket and returns the ranges as a hash rather than an array. Setting the `keyed` flag to `false` will disable this behavior:
  61. [source,console]
  62. --------------------------------------------------
  63. GET latency/_search
  64. {
  65. "size": 0,
  66. "aggs": {
  67. "load_time_ranks": {
  68. "percentile_ranks": {
  69. "field": "load_time",
  70. "values": [ 500, 600 ],
  71. "keyed": false
  72. }
  73. }
  74. }
  75. }
  76. --------------------------------------------------
  77. // TEST[setup:latency]
  78. Response:
  79. [source,console-result]
  80. --------------------------------------------------
  81. {
  82. ...
  83. "aggregations": {
  84. "load_time_ranks": {
  85. "values": [
  86. {
  87. "key": 500.0,
  88. "value": 90.01
  89. },
  90. {
  91. "key": 600.0,
  92. "value": 100.0
  93. }
  94. ]
  95. }
  96. }
  97. }
  98. --------------------------------------------------
  99. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  100. // TESTRESPONSE[s/"value": 90.01/"value": 55.00000000000001/]
  101. // TESTRESPONSE[s/"value": 100.0/"value": 64.0/]
  102. ==== Script
  103. The percentile rank metric supports scripting. For example, if our load times
  104. are in milliseconds but we want to specify values in seconds, we could use
  105. a script to convert them on-the-fly:
  106. [source,console]
  107. --------------------------------------------------
  108. GET latency/_search
  109. {
  110. "size": 0,
  111. "aggs": {
  112. "load_time_ranks": {
  113. "percentile_ranks": {
  114. "values": [ 500, 600 ],
  115. "script": {
  116. "lang": "painless",
  117. "source": "doc['load_time'].value / params.timeUnit", <1>
  118. "params": {
  119. "timeUnit": 1000 <2>
  120. }
  121. }
  122. }
  123. }
  124. }
  125. }
  126. --------------------------------------------------
  127. // TEST[setup:latency]
  128. <1> The `field` parameter is replaced with a `script` parameter, which uses the
  129. script to generate values which percentile ranks are calculated on
  130. <2> Scripting supports parameterized input just like any other script
  131. This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters. To use a stored script use the following syntax:
  132. [source,console]
  133. --------------------------------------------------
  134. GET latency/_search
  135. {
  136. "size": 0,
  137. "aggs": {
  138. "load_time_ranks": {
  139. "percentile_ranks": {
  140. "values": [ 500, 600 ],
  141. "script": {
  142. "id": "my_script",
  143. "params": {
  144. "field": "load_time"
  145. }
  146. }
  147. }
  148. }
  149. }
  150. }
  151. --------------------------------------------------
  152. // TEST[setup:latency,stored_example_script]
  153. ==== HDR Histogram
  154. NOTE: This setting exposes the internal implementation of HDR Histogram and the syntax may change in the future.
  155. https://github.com/HdrHistogram/HdrHistogram[HDR Histogram] (High Dynamic Range Histogram) is an alternative implementation
  156. that can be useful when calculating percentile ranks for latency measurements as it can be faster than the t-digest implementation
  157. with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a
  158. number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000
  159. microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to
  160. 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).
  161. The HDR Histogram can be used by specifying the `hdr` object in the request:
  162. [source,console]
  163. --------------------------------------------------
  164. GET latency/_search
  165. {
  166. "size": 0,
  167. "aggs": {
  168. "load_time_ranks": {
  169. "percentile_ranks": {
  170. "field": "load_time",
  171. "values": [ 500, 600 ],
  172. "hdr": { <1>
  173. "number_of_significant_value_digits": 3 <2>
  174. }
  175. }
  176. }
  177. }
  178. }
  179. --------------------------------------------------
  180. // TEST[setup:latency]
  181. <1> `hdr` object indicates that HDR Histogram should be used to calculate the percentiles and specific settings for this algorithm can be specified inside the object
  182. <2> `number_of_significant_value_digits` specifies the resolution of values for the histogram in number of significant digits
  183. The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use
  184. the HDRHistogram if the range of values is unknown as this could lead to high memory usage.
  185. ==== Missing value
  186. The `missing` parameter defines how documents that are missing a value should be treated.
  187. By default they will be ignored but it is also possible to treat them as if they
  188. had a value.
  189. [source,console]
  190. --------------------------------------------------
  191. GET latency/_search
  192. {
  193. "size": 0,
  194. "aggs": {
  195. "load_time_ranks": {
  196. "percentile_ranks": {
  197. "field": "load_time",
  198. "values": [ 500, 600 ],
  199. "missing": 10 <1>
  200. }
  201. }
  202. }
  203. }
  204. --------------------------------------------------
  205. // TEST[setup:latency]
  206. <1> Documents without a value in the `load_time` field will fall into the same bucket as documents that have the value `10`.