123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161 |
- [[search-aggregations-bucket-sampler-aggregation]]
- === Sampler Aggregation
- A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
- .Example use cases:
- * Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
- * Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
-
- Example:
- A query on StackOverflow data for the popular term `javascript` OR the rarer term
- `kibana` will match many documents - most of them missing the word Kibana. To focus
- the `significant_terms` aggregation on top-scoring documents that are more likely to match
- the most interesting parts of our query we use a sample.
- [source,js]
- --------------------------------------------------
- POST /stackoverflow/_search?size=0
- {
- "query": {
- "query_string": {
- "query": "tags:kibana OR tags:javascript"
- }
- },
- "aggs": {
- "sample": {
- "sampler": {
- "shard_size": 200
- },
- "aggs": {
- "keywords": {
- "significant_terms": {
- "field": "tags",
- "exclude": ["kibana", "javascript"]
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[setup:stackoverflow]
- Response:
- [source,js]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "sample": {
- "doc_count": 200,<1>
- "keywords": {
- "doc_count": 200,
- "bg_count": 650,
- "buckets": [
- {
- "key": "elasticsearch",
- "doc_count": 150,
- "score": 1.078125,
- "bg_count": 200
- },
- {
- "key": "logstash",
- "doc_count": 50,
- "score": 0.5625,
- "bg_count": 50
- }
- ]
- }
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- <1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was
- therefore limited rather than unbounded.
- Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
- less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms.
- [source,js]
- --------------------------------------------------
- POST /stackoverflow/_search?size=0
- {
- "query": {
- "query_string": {
- "query": "tags:kibana OR tags:javascript"
- }
- },
- "aggs": {
- "low_quality_keywords": {
- "significant_terms": {
- "field": "tags",
- "size": 3,
- "exclude":["kibana", "javascript"]
- }
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[setup:stackoverflow]
- Response:
- [source,js]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "low_quality_keywords": {
- "doc_count": 600,
- "bg_count": 650,
- "buckets": [
- {
- "key": "angular",
- "doc_count": 200,
- "score": 0.02777,
- "bg_count": 200
- },
- {
- "key": "jquery",
- "doc_count": 200,
- "score": 0.02777,
- "bg_count": 200
- },
- {
- "key": "logstash",
- "doc_count": 50,
- "score": 0.0069,
- "bg_count": 50
- }
- ]
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- // TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/]
- // TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/]
- ==== shard_size
- The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
- The default value is 100.
- ==== Limitations
- ===== Cannot be nested under `breadth_first` aggregations
- Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
- It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
- In this situation an error will be thrown.
|