percentile-rank-aggregation.asciidoc 7.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238
  1. [[search-aggregations-metrics-percentile-rank-aggregation]]
  2. === Percentile Ranks Aggregation
  3. A `multi-value` metrics aggregation that calculates one or more percentile ranks
  4. over numeric values extracted from the aggregated documents. These values can be
  5. generated by a provided script or extracted from specific numeric or
  6. <<histogram,histogram fields>> in the documents.
  7. [NOTE]
  8. ==================================================
  9. Please see <<search-aggregations-metrics-percentile-aggregation-approximation>>
  10. and <<search-aggregations-metrics-percentile-aggregation-compression>> for advice
  11. regarding approximation and memory use of the percentile ranks aggregation
  12. ==================================================
  13. Percentile rank show the percentage of observed values which are below certain
  14. value. For example, if a value is greater than or equal to 95% of the observed values
  15. it is said to be at the 95th percentile rank.
  16. Assume your data consists of website load times. You may have a service agreement that
  17. 95% of page loads complete within 500ms and 99% of page loads complete within 600ms.
  18. Let's look at a range of percentiles representing load time:
  19. [source,console]
  20. --------------------------------------------------
  21. GET latency/_search
  22. {
  23. "size": 0,
  24. "aggs": {
  25. "load_time_ranks": {
  26. "percentile_ranks": {
  27. "field": "load_time", <1>
  28. "values": [ 500, 600 ]
  29. }
  30. }
  31. }
  32. }
  33. --------------------------------------------------
  34. // TEST[setup:latency]
  35. <1> The field `load_time` must be a numeric field
  36. The response will look like this:
  37. [source,console-result]
  38. --------------------------------------------------
  39. {
  40. ...
  41. "aggregations": {
  42. "load_time_ranks": {
  43. "values": {
  44. "500.0": 90.01,
  45. "600.0": 100.0
  46. }
  47. }
  48. }
  49. }
  50. --------------------------------------------------
  51. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  52. // TESTRESPONSE[s/"500.0": 90.01/"500.0": 55.00000000000001/]
  53. // TESTRESPONSE[s/"600.0": 100.0/"600.0": 64.0/]
  54. From this information you can determine you are hitting the 99% load time target but not quite
  55. hitting the 95% load time target
  56. ==== Keyed Response
  57. By default the `keyed` flag is set to `true` associates a unique string key with each bucket and returns the ranges as a hash rather than an array. Setting the `keyed` flag to `false` will disable this behavior:
  58. [source,console]
  59. --------------------------------------------------
  60. GET latency/_search
  61. {
  62. "size": 0,
  63. "aggs": {
  64. "load_time_ranks": {
  65. "percentile_ranks": {
  66. "field": "load_time",
  67. "values": [ 500, 600 ],
  68. "keyed": false
  69. }
  70. }
  71. }
  72. }
  73. --------------------------------------------------
  74. // TEST[setup:latency]
  75. Response:
  76. [source,console-result]
  77. --------------------------------------------------
  78. {
  79. ...
  80. "aggregations": {
  81. "load_time_ranks": {
  82. "values": [
  83. {
  84. "key": 500.0,
  85. "value": 90.01
  86. },
  87. {
  88. "key": 600.0,
  89. "value": 100.0
  90. }
  91. ]
  92. }
  93. }
  94. }
  95. --------------------------------------------------
  96. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  97. // TESTRESPONSE[s/"value": 90.01/"value": 55.00000000000001/]
  98. // TESTRESPONSE[s/"value": 100.0/"value": 64.0/]
  99. ==== Script
  100. The percentile rank metric supports scripting. For example, if our load times
  101. are in milliseconds but we want to specify values in seconds, we could use
  102. a script to convert them on-the-fly:
  103. [source,console]
  104. --------------------------------------------------
  105. GET latency/_search
  106. {
  107. "size": 0,
  108. "aggs": {
  109. "load_time_ranks": {
  110. "percentile_ranks": {
  111. "values": [ 500, 600 ],
  112. "script": {
  113. "lang": "painless",
  114. "source": "doc['load_time'].value / params.timeUnit", <1>
  115. "params": {
  116. "timeUnit": 1000 <2>
  117. }
  118. }
  119. }
  120. }
  121. }
  122. }
  123. --------------------------------------------------
  124. // TEST[setup:latency]
  125. <1> The `field` parameter is replaced with a `script` parameter, which uses the
  126. script to generate values which percentile ranks are calculated on
  127. <2> Scripting supports parameterized input just like any other script
  128. This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters. To use a stored script use the following syntax:
  129. [source,console]
  130. --------------------------------------------------
  131. GET latency/_search
  132. {
  133. "size": 0,
  134. "aggs": {
  135. "load_time_ranks": {
  136. "percentile_ranks": {
  137. "values": [ 500, 600 ],
  138. "script": {
  139. "id": "my_script",
  140. "params": {
  141. "field": "load_time"
  142. }
  143. }
  144. }
  145. }
  146. }
  147. }
  148. --------------------------------------------------
  149. // TEST[setup:latency,stored_example_script]
  150. ==== HDR Histogram
  151. NOTE: This setting exposes the internal implementation of HDR Histogram and the syntax may change in the future.
  152. https://github.com/HdrHistogram/HdrHistogram[HDR Histogram] (High Dynamic Range Histogram) is an alternative implementation
  153. that can be useful when calculating percentile ranks for latency measurements as it can be faster than the t-digest implementation
  154. with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a
  155. number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000
  156. microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to
  157. 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).
  158. The HDR Histogram can be used by specifying the `hdr` object in the request:
  159. [source,console]
  160. --------------------------------------------------
  161. GET latency/_search
  162. {
  163. "size": 0,
  164. "aggs": {
  165. "load_time_ranks": {
  166. "percentile_ranks": {
  167. "field": "load_time",
  168. "values": [ 500, 600 ],
  169. "hdr": { <1>
  170. "number_of_significant_value_digits": 3 <2>
  171. }
  172. }
  173. }
  174. }
  175. }
  176. --------------------------------------------------
  177. // TEST[setup:latency]
  178. <1> `hdr` object indicates that HDR Histogram should be used to calculate the percentiles and specific settings for this algorithm can be specified inside the object
  179. <2> `number_of_significant_value_digits` specifies the resolution of values for the histogram in number of significant digits
  180. The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use
  181. the HDRHistogram if the range of values is unknown as this could lead to high memory usage.
  182. ==== Missing value
  183. The `missing` parameter defines how documents that are missing a value should be treated.
  184. By default they will be ignored but it is also possible to treat them as if they
  185. had a value.
  186. [source,console]
  187. --------------------------------------------------
  188. GET latency/_search
  189. {
  190. "size": 0,
  191. "aggs": {
  192. "load_time_ranks": {
  193. "percentile_ranks": {
  194. "field": "load_time",
  195. "values": [ 500, 600 ],
  196. "missing": 10 <1>
  197. }
  198. }
  199. }
  200. }
  201. --------------------------------------------------
  202. // TEST[setup:latency]
  203. <1> Documents without a value in the `load_time` field will fall into the same bucket as documents that have the value `10`.