sampler-aggregation.asciidoc 4.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
  1. [[search-aggregations-bucket-sampler-aggregation]]
  2. === Sampler aggregation
  3. ++++
  4. <titleabbrev>Sampler</titleabbrev>
  5. ++++
  6. A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
  7. .Example use cases:
  8. * Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
  9. * Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
  10. Example:
  11. A query on StackOverflow data for the popular term `javascript` OR the rarer term
  12. `kibana` will match many documents - most of them missing the word Kibana. To focus
  13. the `significant_terms` aggregation on top-scoring documents that are more likely to match
  14. the most interesting parts of our query we use a sample.
  15. [source,console,id=sampler-aggregation-example]
  16. --------------------------------------------------
  17. POST /stackoverflow/_search?size=0
  18. {
  19. "query": {
  20. "query_string": {
  21. "query": "tags:kibana OR tags:javascript"
  22. }
  23. },
  24. "aggs": {
  25. "sample": {
  26. "sampler": {
  27. "shard_size": 200
  28. },
  29. "aggs": {
  30. "keywords": {
  31. "significant_terms": {
  32. "field": "tags",
  33. "exclude": [ "kibana", "javascript" ]
  34. }
  35. }
  36. }
  37. }
  38. }
  39. }
  40. --------------------------------------------------
  41. // TEST[setup:stackoverflow]
  42. Response:
  43. [source,console-result]
  44. --------------------------------------------------
  45. {
  46. ...
  47. "aggregations": {
  48. "sample": {
  49. "doc_count": 200, <1>
  50. "keywords": {
  51. "doc_count": 200,
  52. "bg_count": 650,
  53. "buckets": [
  54. {
  55. "key": "elasticsearch",
  56. "doc_count": 150,
  57. "score": 1.078125,
  58. "bg_count": 200
  59. },
  60. {
  61. "key": "logstash",
  62. "doc_count": 50,
  63. "score": 0.5625,
  64. "bg_count": 50
  65. }
  66. ]
  67. }
  68. }
  69. }
  70. }
  71. --------------------------------------------------
  72. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  73. <1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was
  74. therefore limited rather than unbounded.
  75. Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
  76. less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms.
  77. [source,console,id=sampler-aggregation-no-sampler-example]
  78. --------------------------------------------------
  79. POST /stackoverflow/_search?size=0
  80. {
  81. "query": {
  82. "query_string": {
  83. "query": "tags:kibana OR tags:javascript"
  84. }
  85. },
  86. "aggs": {
  87. "low_quality_keywords": {
  88. "significant_terms": {
  89. "field": "tags",
  90. "size": 3,
  91. "exclude": [ "kibana", "javascript" ]
  92. }
  93. }
  94. }
  95. }
  96. --------------------------------------------------
  97. // TEST[setup:stackoverflow]
  98. Response:
  99. [source,console-result]
  100. --------------------------------------------------
  101. {
  102. ...
  103. "aggregations": {
  104. "low_quality_keywords": {
  105. "doc_count": 600,
  106. "bg_count": 650,
  107. "buckets": [
  108. {
  109. "key": "angular",
  110. "doc_count": 200,
  111. "score": 0.02777,
  112. "bg_count": 200
  113. },
  114. {
  115. "key": "jquery",
  116. "doc_count": 200,
  117. "score": 0.02777,
  118. "bg_count": 200
  119. },
  120. {
  121. "key": "logstash",
  122. "doc_count": 50,
  123. "score": 0.0069,
  124. "bg_count": 50
  125. }
  126. ]
  127. }
  128. }
  129. }
  130. --------------------------------------------------
  131. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  132. // TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/]
  133. // TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/]
  134. ==== shard_size
  135. The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
  136. The default value is 100.
  137. ==== Limitations
  138. [[sampler-breadth-first-nested-agg]]
  139. ===== Cannot be nested under `breadth_first` aggregations
  140. Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
  141. It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
  142. In this situation an error will be thrown.