|
@@ -6,16 +6,16 @@ The `diversified_sampler` aggregation adds the ability to limit the number of ma
|
|
|
|
|
|
NOTE: Any good market researcher will tell you that when working with samples of data it is important
|
|
|
that the sample represents a healthy variety of opinions rather than being skewed by any single voice.
|
|
|
-The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography,
|
|
|
-a large spike in a timeline or an over-active forum spammer).
|
|
|
+The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography,
|
|
|
+a large spike in a timeline or an over-active forum spammer).
|
|
|
|
|
|
|
|
|
.Example use cases:
|
|
|
* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
|
|
|
* Removing bias from analytics by ensuring fair representation of content from different sources
|
|
|
* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
|
|
|
-
|
|
|
-A choice of `field` or `script` setting is used to provide values used for de-duplication and the `max_docs_per_value` setting controls the maximum
|
|
|
+
|
|
|
+A choice of `field` or `script` setting is used to provide values used for de-duplication and the `max_docs_per_value` setting controls the maximum
|
|
|
number of documents collected on any one shard which share a common value. The default setting for `max_docs_per_value` is 1.
|
|
|
|
|
|
The aggregation will throw an error if the choice of `field` or `script` produces multiple values for a single document (de-duplication using multi-valued fields is not supported due to efficiency concerns).
|
|
@@ -39,7 +39,7 @@ POST /stackoverflow/_search?size=0
|
|
|
"my_unbiased_sample": {
|
|
|
"diversified_sampler": {
|
|
|
"shard_size": 200,
|
|
|
- "field" : "author"
|
|
|
+ "field" : "author"
|
|
|
},
|
|
|
"aggs": {
|
|
|
"keywords": {
|
|
@@ -89,7 +89,7 @@ Response:
|
|
|
|
|
|
==== Scripted example:
|
|
|
|
|
|
-In this scenario we might want to diversify on a combination of field values. We can use a `script` to produce a hash of the
|
|
|
+In this scenario we might want to diversify on a combination of field values. We can use a `script` to produce a hash of the
|
|
|
multiple values in a tags field to ensure we don't have a sample that consists of the same repeated combinations of tags.
|
|
|
|
|
|
[source,js]
|
|
@@ -109,7 +109,7 @@ POST /stackoverflow/_search?size=0
|
|
|
"script" : {
|
|
|
"lang": "painless",
|
|
|
"source": "doc['tags'].values.hashCode()"
|
|
|
- }
|
|
|
+ }
|
|
|
},
|
|
|
"aggs": {
|
|
|
"keywords": {
|
|
@@ -150,7 +150,7 @@ Response:
|
|
|
"doc_count": 3,
|
|
|
"score": 1.34,
|
|
|
"bg_count": 200
|
|
|
- },
|
|
|
+ }
|
|
|
]
|
|
|
}
|
|
|
}
|
|
@@ -175,11 +175,11 @@ The default setting is "1".
|
|
|
|
|
|
The optional `execution_hint` setting can influence the management of the values used for de-duplication.
|
|
|
Each option will hold up to `shard_size` values in memory while performing de-duplication but the type of value held can be controlled as follows:
|
|
|
-
|
|
|
+
|
|
|
- hold field values directly (`map`)
|
|
|
- hold ordinals of the field as determined by the Lucene index (`global_ordinals`)
|
|
|
- hold hashes of the field values - with potential for hash collisions (`bytes_hash`)
|
|
|
-
|
|
|
+
|
|
|
The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not.
|
|
|
The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions.
|
|
|
Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.
|