|
@@ -292,7 +292,7 @@ If the number of unique terms is greater than `size`, the returned list can be s
|
|
|
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
|
|
|
size buckets was not returned).
|
|
|
|
|
|
-coming[1.2.0] If set to `0`, the `size` will be set to `Integer.MAX_VALUE`.
|
|
|
+added[1.2.0] If set to `0`, the `size` will be set to `Integer.MAX_VALUE`.
|
|
|
|
|
|
To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard
|
|
|
using a heuristic based on the number of shards. To take manual control of this setting the `shard_size` parameter
|
|
@@ -305,7 +305,7 @@ a consolidated review by the reducing node before the final selection. Obviously
|
|
|
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter.
|
|
|
|
|
|
|
|
|
-coming[1.2.0] If set to `0`, the `shard_size` will be set to `Integer.MAX_VALUE`.
|
|
|
+added[1.2.0] If set to `0`, the `shard_size` will be set to `Integer.MAX_VALUE`.
|
|
|
|
|
|
|
|
|
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
|
|
@@ -336,7 +336,7 @@ The above aggregation would only return tags which have been found in 10 hits or
|
|
|
|
|
|
Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
|
|
|
|
|
|
-coming[1.2.0] `shard_min_doc_count` parameter
|
|
|
+added[1.2.0] `shard_min_doc_count` parameter
|
|
|
|
|
|
The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it.
|
|
|
|