sampler-aggregation.asciidoc 5.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160
  1. [[search-aggregations-bucket-sampler-aggregation]]
  2. === Sampler Aggregation
  3. A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
  4. .Example use cases:
  5. * Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
  6. * Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
  7. Example:
  8. A query on StackOverflow data for the popular term `javascript` OR the rarer term
  9. `kibana` will match many documents - most of them missing the word Kibana. To focus
  10. the `significant_terms` aggregation on top-scoring documents that are more likely to match
  11. the most interesting parts of our query we use a sample.
  12. [source,console,id=sampler-aggregation-example]
  13. --------------------------------------------------
  14. POST /stackoverflow/_search?size=0
  15. {
  16. "query": {
  17. "query_string": {
  18. "query": "tags:kibana OR tags:javascript"
  19. }
  20. },
  21. "aggs": {
  22. "sample": {
  23. "sampler": {
  24. "shard_size": 200
  25. },
  26. "aggs": {
  27. "keywords": {
  28. "significant_terms": {
  29. "field": "tags",
  30. "exclude": ["kibana", "javascript"]
  31. }
  32. }
  33. }
  34. }
  35. }
  36. }
  37. --------------------------------------------------
  38. // TEST[setup:stackoverflow]
  39. Response:
  40. [source,console-result]
  41. --------------------------------------------------
  42. {
  43. ...
  44. "aggregations": {
  45. "sample": {
  46. "doc_count": 200,<1>
  47. "keywords": {
  48. "doc_count": 200,
  49. "bg_count": 650,
  50. "buckets": [
  51. {
  52. "key": "elasticsearch",
  53. "doc_count": 150,
  54. "score": 1.078125,
  55. "bg_count": 200
  56. },
  57. {
  58. "key": "logstash",
  59. "doc_count": 50,
  60. "score": 0.5625,
  61. "bg_count": 50
  62. }
  63. ]
  64. }
  65. }
  66. }
  67. }
  68. --------------------------------------------------
  69. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  70. <1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was
  71. therefore limited rather than unbounded.
  72. Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
  73. less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms.
  74. [source,console,id=sampler-aggregation-no-sampler-example]
  75. --------------------------------------------------
  76. POST /stackoverflow/_search?size=0
  77. {
  78. "query": {
  79. "query_string": {
  80. "query": "tags:kibana OR tags:javascript"
  81. }
  82. },
  83. "aggs": {
  84. "low_quality_keywords": {
  85. "significant_terms": {
  86. "field": "tags",
  87. "size": 3,
  88. "exclude":["kibana", "javascript"]
  89. }
  90. }
  91. }
  92. }
  93. --------------------------------------------------
  94. // TEST[setup:stackoverflow]
  95. Response:
  96. [source,console-result]
  97. --------------------------------------------------
  98. {
  99. ...
  100. "aggregations": {
  101. "low_quality_keywords": {
  102. "doc_count": 600,
  103. "bg_count": 650,
  104. "buckets": [
  105. {
  106. "key": "angular",
  107. "doc_count": 200,
  108. "score": 0.02777,
  109. "bg_count": 200
  110. },
  111. {
  112. "key": "jquery",
  113. "doc_count": 200,
  114. "score": 0.02777,
  115. "bg_count": 200
  116. },
  117. {
  118. "key": "logstash",
  119. "doc_count": 50,
  120. "score": 0.0069,
  121. "bg_count": 50
  122. }
  123. ]
  124. }
  125. }
  126. }
  127. --------------------------------------------------
  128. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  129. // TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/]
  130. // TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/]
  131. ==== shard_size
  132. The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
  133. The default value is 100.
  134. ==== Limitations
  135. [[sampler-breadth-first-nested-agg]]
  136. ===== Cannot be nested under `breadth_first` aggregations
  137. Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
  138. It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
  139. In this situation an error will be thrown.