boxplot-aggregation.asciidoc 4.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186
  1. [role="xpack"]
  2. [testenv="basic"]
  3. [[search-aggregations-metrics-boxplot-aggregation]]
  4. === Boxplot Aggregation
  5. A `boxplot` metrics aggregation that computes boxplot of numeric values extracted from the aggregated documents.
  6. These values can be generated by a provided script or extracted from specific numeric or
  7. <<histogram,histogram fields>> in the documents.
  8. The `boxplot` aggregation returns essential information for making a https://en.wikipedia.org/wiki/Box_plot[box plot]: minimum, maximum,
  9. median, first quartile (25th percentile) and third quartile (75th percentile) values.
  10. ==== Syntax
  11. A `boxplot` aggregation looks like this in isolation:
  12. [source,js]
  13. --------------------------------------------------
  14. {
  15. "boxplot": {
  16. "field": "load_time"
  17. }
  18. }
  19. --------------------------------------------------
  20. // NOTCONSOLE
  21. Let's look at a boxplot representing load time:
  22. [source,console]
  23. --------------------------------------------------
  24. GET latency/_search
  25. {
  26. "size": 0,
  27. "aggs": {
  28. "load_time_boxplot": {
  29. "boxplot": {
  30. "field": "load_time" <1>
  31. }
  32. }
  33. }
  34. }
  35. --------------------------------------------------
  36. // TEST[setup:latency]
  37. <1> The field `load_time` must be a numeric field
  38. The response will look like this:
  39. [source,console-result]
  40. --------------------------------------------------
  41. {
  42. ...
  43. "aggregations": {
  44. "load_time_boxplot": {
  45. "min": 0.0,
  46. "max": 990.0,
  47. "q1": 165.0,
  48. "q2": 445.0,
  49. "q3": 725.0
  50. }
  51. }
  52. }
  53. --------------------------------------------------
  54. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  55. ==== Script
  56. The boxplot metric supports scripting. For example, if our load times
  57. are in milliseconds but we want values calculated in seconds, we could use
  58. a script to convert them on-the-fly:
  59. [source,console]
  60. --------------------------------------------------
  61. GET latency/_search
  62. {
  63. "size": 0,
  64. "aggs": {
  65. "load_time_boxplot": {
  66. "boxplot": {
  67. "script": {
  68. "lang": "painless",
  69. "source": "doc['load_time'].value / params.timeUnit", <1>
  70. "params": {
  71. "timeUnit": 1000 <2>
  72. }
  73. }
  74. }
  75. }
  76. }
  77. }
  78. --------------------------------------------------
  79. // TEST[setup:latency]
  80. <1> The `field` parameter is replaced with a `script` parameter, which uses the
  81. script to generate values which percentiles are calculated on
  82. <2> Scripting supports parameterized input just like any other script
  83. This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters. To use a
  84. stored script use the following syntax:
  85. [source,console]
  86. --------------------------------------------------
  87. GET latency/_search
  88. {
  89. "size": 0,
  90. "aggs": {
  91. "load_time_boxplot": {
  92. "boxplot": {
  93. "script": {
  94. "id": "my_script",
  95. "params": {
  96. "field": "load_time"
  97. }
  98. }
  99. }
  100. }
  101. }
  102. }
  103. --------------------------------------------------
  104. // TEST[setup:latency,stored_example_script]
  105. [[search-aggregations-metrics-boxplot-aggregation-approximation]]
  106. ==== Boxplot values are (usually) approximate
  107. The algorithm used by the `boxplot` metric is called TDigest (introduced by
  108. Ted Dunning in
  109. https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]).
  110. [WARNING]
  111. ====
  112. Boxplot as other percentile aggregations are also
  113. https://en.wikipedia.org/wiki/Nondeterministic_algorithm[non-deterministic].
  114. This means you can get slightly different results using the same data.
  115. ====
  116. [[search-aggregations-metrics-boxplot-aggregation-compression]]
  117. ==== Compression
  118. Approximate algorithms must balance memory utilization with estimation accuracy.
  119. This balance can be controlled using a `compression` parameter:
  120. [source,console]
  121. --------------------------------------------------
  122. GET latency/_search
  123. {
  124. "size": 0,
  125. "aggs": {
  126. "load_time_boxplot": {
  127. "boxplot": {
  128. "field": "load_time",
  129. "compression": 200 <1>
  130. }
  131. }
  132. }
  133. }
  134. --------------------------------------------------
  135. // TEST[setup:latency]
  136. <1> Compression controls memory usage and approximation error
  137. include::percentile-aggregation.asciidoc[tags=t-digest]
  138. ==== Missing value
  139. The `missing` parameter defines how documents that are missing a value should be treated.
  140. By default they will be ignored but it is also possible to treat them as if they
  141. had a value.
  142. [source,console]
  143. --------------------------------------------------
  144. GET latency/_search
  145. {
  146. "size": 0,
  147. "aggs": {
  148. "grade_boxplot": {
  149. "boxplot": {
  150. "field": "grade",
  151. "missing": 10 <1>
  152. }
  153. }
  154. }
  155. }
  156. --------------------------------------------------
  157. // TEST[setup:latency]
  158. <1> Documents without a value in the `grade` field will fall into the same bucket as documents that have the value `10`.