|
@@ -54,19 +54,19 @@ size buckets was not returned). If set to `0`, the `size` will be set to `Intege
|
|
|
|
|
|
==== Document counts are approximate
|
|
|
|
|
|
-As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always
|
|
|
-accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are
|
|
|
+As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always
|
|
|
+accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are
|
|
|
combined to give a final view. Consider the following scenario:
|
|
|
|
|
|
-A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with
|
|
|
-3 shards. In this case each shard is asked to give its top 5 terms.
|
|
|
+A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with
|
|
|
+3 shards. In this case each shard is asked to give its top 5 terms.
|
|
|
|
|
|
[source,js]
|
|
|
--------------------------------------------------
|
|
|
{
|
|
|
"aggs" : {
|
|
|
"products" : {
|
|
|
- "terms" : {
|
|
|
+ "terms" : {
|
|
|
"field" : "product",
|
|
|
"size" : 5
|
|
|
}
|
|
@@ -75,23 +75,23 @@ A request is made to obtain the top 5 terms in the field product, ordered by des
|
|
|
}
|
|
|
--------------------------------------------------
|
|
|
|
|
|
-The terms for each of the three shards are shown below with their
|
|
|
+The terms for each of the three shards are shown below with their
|
|
|
respective document counts in brackets:
|
|
|
|
|
|
[width="100%",cols="^2,^2,^2,^2",options="header"]
|
|
|
|=========================================================
|
|
|
| | Shard A | Shard B | Shard C
|
|
|
|
|
|
-| 1 | Product A (25) | Product A (30) | Product A (45)
|
|
|
-| 2 | Product B (18) | Product B (25) | Product C (44)
|
|
|
-| 3 | Product C (6) | Product F (17) | Product Z (36)
|
|
|
-| 4 | Product D (3) | Product Z (16) | Product G (30)
|
|
|
-| 5 | Product E (2) | Product G (15) | Product E (29)
|
|
|
-| 6 | Product F (2) | Product H (14) | Product H (28)
|
|
|
-| 7 | Product G (2) | Product I (10) | Product Q (2)
|
|
|
-| 8 | Product H (2) | Product Q (6) | Product D (1)
|
|
|
-| 9 | Product I (1) | Product J (8) |
|
|
|
-| 10 | Product J (1) | Product C (4) |
|
|
|
+| 1 | Product A (25) | Product A (30) | Product A (45)
|
|
|
+| 2 | Product B (18) | Product B (25) | Product C (44)
|
|
|
+| 3 | Product C (6) | Product F (17) | Product Z (36)
|
|
|
+| 4 | Product D (3) | Product Z (16) | Product G (30)
|
|
|
+| 5 | Product E (2) | Product G (15) | Product E (29)
|
|
|
+| 6 | Product F (2) | Product H (14) | Product H (28)
|
|
|
+| 7 | Product G (2) | Product I (10) | Product Q (2)
|
|
|
+| 8 | Product H (2) | Product Q (6) | Product D (1)
|
|
|
+| 9 | Product I (1) | Product J (8) |
|
|
|
+| 10 | Product J (1) | Product C (4) |
|
|
|
|
|
|
|=========================================================
|
|
|
|
|
@@ -102,41 +102,41 @@ The shards will return their top 5 terms so the results from the shards will be:
|
|
|
|=========================================================
|
|
|
| | Shard A | Shard B | Shard C
|
|
|
|
|
|
-| 1 | Product A (25) | Product A (30) | Product A (45)
|
|
|
-| 2 | Product B (18) | Product B (25) | Product C (44)
|
|
|
-| 3 | Product C (6) | Product F (17) | Product Z (36)
|
|
|
-| 4 | Product D (3) | Product Z (16) | Product G (30)
|
|
|
-| 5 | Product E (2) | Product G (15) | Product E (29)
|
|
|
+| 1 | Product A (25) | Product A (30) | Product A (45)
|
|
|
+| 2 | Product B (18) | Product B (25) | Product C (44)
|
|
|
+| 3 | Product C (6) | Product F (17) | Product Z (36)
|
|
|
+| 4 | Product D (3) | Product Z (16) | Product G (30)
|
|
|
+| 5 | Product E (2) | Product G (15) | Product E (29)
|
|
|
|
|
|
|=========================================================
|
|
|
|
|
|
-Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces
|
|
|
+Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces
|
|
|
the following:
|
|
|
|
|
|
[width="40%",cols="^2,^2"]
|
|
|
|=========================================================
|
|
|
|
|
|
-| 1 | Product A (100)
|
|
|
-| 2 | Product Z (52)
|
|
|
-| 3 | Product C (50)
|
|
|
-| 4 | Product G (45)
|
|
|
-| 5 | Product B (43)
|
|
|
+| 1 | Product A (100)
|
|
|
+| 2 | Product Z (52)
|
|
|
+| 3 | Product C (50)
|
|
|
+| 4 | Product G (45)
|
|
|
+| 5 | Product B (43)
|
|
|
|
|
|
|=========================================================
|
|
|
|
|
|
-Because Product A was returned from all shards we know that its document count value is accurate. Product C was only
|
|
|
-returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on
|
|
|
-shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also
|
|
|
-returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of
|
|
|
-combining the results to produce the final list of terms, that there is an error in the document count for Product C and
|
|
|
-not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of
|
|
|
+Because Product A was returned from all shards we know that its document count value is accurate. Product C was only
|
|
|
+returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on
|
|
|
+shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also
|
|
|
+returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of
|
|
|
+combining the results to produce the final list of terms, that there is an error in the document count for Product C and
|
|
|
+not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of
|
|
|
terms because it did not make it into the top five terms on any of the shards.
|
|
|
|
|
|
==== Shard Size
|
|
|
|
|
|
The higher the requested `size` is, the more accurate the results will be, but also, the more expensive it will be to
|
|
|
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
|
|
|
-transfers between the nodes and the client).
|
|
|
+transfers between the nodes and the client).
|
|
|
|
|
|
The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
|
|
|
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
|
|
@@ -148,17 +148,15 @@ the client. If set to `0`, the `shard_size` will be set to `Integer.MAX_VALUE`.
|
|
|
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
|
|
|
override it and reset it to be equal to `size`.
|
|
|
|
|
|
-added[1.1.0] It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don't use this
|
|
|
+It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don't use this
|
|
|
on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network.
|
|
|
|
|
|
==== Calculating Document Count Error
|
|
|
|
|
|
-coming[1.4.0]
|
|
|
-
|
|
|
-There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as
|
|
|
-a whole which represents the maximum potential document count for a term which did not make it into the final list of
|
|
|
-terms. This is calculated as the sum of the document count from the last term returned from each shard .For the example
|
|
|
-given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned
|
|
|
+There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as
|
|
|
+a whole which represents the maximum potential document count for a term which did not make it into the final list of
|
|
|
+terms. This is calculated as the sum of the document count from the last term returned from each shard .For the example
|
|
|
+given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned
|
|
|
could have the 4th highest document count.
|
|
|
|
|
|
[source,js]
|
|
@@ -185,13 +183,13 @@ could have the 4th highest document count.
|
|
|
}
|
|
|
--------------------------------------------------
|
|
|
|
|
|
-The second error value can be enabled by setting the `show_term_doc_count_error` parameter to true. This shows an error value
|
|
|
-for each term returned by the aggregation which represents the 'worst case' error in the document count and can be useful when
|
|
|
-deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for the last term returned
|
|
|
-by all shards which did not return the term. In the example above the error in the document count for Product C would be 15 as
|
|
|
-Shard B was the only shard not to return the term and the document count of the last termit did return was 15. The actual document
|
|
|
-count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that it would be off by
|
|
|
-15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident that the count
|
|
|
+The second error value can be enabled by setting the `show_term_doc_count_error` parameter to true. This shows an error value
|
|
|
+for each term returned by the aggregation which represents the 'worst case' error in the document count and can be useful when
|
|
|
+deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for the last term returned
|
|
|
+by all shards which did not return the term. In the example above the error in the document count for Product C would be 15 as
|
|
|
+Shard B was the only shard not to return the term and the document count of the last termit did return was 15. The actual document
|
|
|
+count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that it would be off by
|
|
|
+15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident that the count
|
|
|
returned is accurate.
|
|
|
|
|
|
[source,js]
|
|
@@ -220,10 +218,10 @@ returned is accurate.
|
|
|
}
|
|
|
--------------------------------------------------
|
|
|
|
|
|
-These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is
|
|
|
-ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard
|
|
|
-does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the
|
|
|
-aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be
|
|
|
+These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is
|
|
|
+ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard
|
|
|
+does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the
|
|
|
+aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be
|
|
|
determined and is given a value of -1 to indicate this.
|
|
|
|
|
|
==== Order
|
|
@@ -342,8 +340,6 @@ PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR
|
|
|
|
|
|
The above will sort the countries buckets based on the average height among the female population.
|
|
|
|
|
|
-coming[1.4.0]
|
|
|
-
|
|
|
Multiple criteria can be used to order the buckets by providing an array of order criteria such as the following:
|
|
|
|
|
|
[source,js]
|
|
@@ -368,13 +364,13 @@ Multiple criteria can be used to order the buckets by providing an array of orde
|
|
|
}
|
|
|
--------------------------------------------------
|
|
|
|
|
|
-The above will sort the countries buckets based on the average height among the female population and then by
|
|
|
+The above will sort the countries buckets based on the average height among the female population and then by
|
|
|
their `doc_count` in descending order.
|
|
|
|
|
|
-NOTE: In the event that two buckets share the same values for all order criteria the bucket's term value is used as a
|
|
|
+NOTE: In the event that two buckets share the same values for all order criteria the bucket's term value is used as a
|
|
|
tie-breaker in ascending alphabetical order to prevent non-deterministic ordering of buckets.
|
|
|
|
|
|
-==== Minimum document count
|
|
|
+==== Minimum document count
|
|
|
|
|
|
It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option:
|
|
|
|
|
@@ -397,7 +393,7 @@ The above aggregation would only return tags which have been found in 10 hits or
|
|
|
|
|
|
Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
|
|
|
|
|
|
-added[1.2.0] `shard_min_doc_count` parameter
|
|
|
+`shard_min_doc_count` parameter
|
|
|
|
|
|
The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required `min_doc_count` even after merging the local counts. `shard_min_doc_count` is set to `0` per default and has no effect unless you explicitly set it.
|
|
|
|
|
@@ -530,7 +526,7 @@ strings that represent the terms as they are found in the index:
|
|
|
}
|
|
|
}
|
|
|
}
|
|
|
---------------------------------------------------
|
|
|
+--------------------------------------------------
|
|
|
|
|
|
==== Multi-field terms aggregation
|
|
|
|
|
@@ -561,12 +557,12 @@ this single field, which will benefit from the global ordinals optimization.
|
|
|
|
|
|
==== Collect mode
|
|
|
|
|
|
-added[1.3.0] Deferring calculation of child aggregations
|
|
|
+Deferring calculation of child aggregations
|
|
|
|
|
|
For fields with many unique terms and a small number of required results it can be more efficient to delay the calculation
|
|
|
of child aggregations until the top parent-level aggs have been pruned. Ordinarily, all branches of the aggregation tree
|
|
|
are expanded in one depth-first pass and only then any pruning occurs. In some rare scenarios this can be very wasteful and can hit memory constraints.
|
|
|
-An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars:
|
|
|
+An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars:
|
|
|
|
|
|
[source,js]
|
|
|
--------------------------------------------------
|
|
@@ -590,11 +586,11 @@ An example problem scenario is querying a movie database for the 10 most popular
|
|
|
}
|
|
|
--------------------------------------------------
|
|
|
|
|
|
-Even though the number of movies may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets
|
|
|
-during calculation - a single movie will produce n² buckets where n is the number of actors. The sane option would be to first determine
|
|
|
+Even though the number of movies may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets
|
|
|
+during calculation - a single movie will produce n² buckets where n is the number of actors. The sane option would be to first determine
|
|
|
the 10 most popular actors and only then examine the top co-stars for these 10 actors. This alternative strategy is what we call the `breadth_first` collection
|
|
|
mode as opposed to the default `depth_first` mode:
|
|
|
-
|
|
|
+
|
|
|
[source,js]
|
|
|
--------------------------------------------------
|
|
|
{
|
|
@@ -620,24 +616,20 @@ mode as opposed to the default `depth_first` mode:
|
|
|
|
|
|
|
|
|
When using `breadth_first` mode the set of documents that fall into the uppermost buckets are
|
|
|
-cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents.
|
|
|
+cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents.
|
|
|
In most requests the volume of buckets generated is smaller than the number of documents that fall into them so the default `depth_first`
|
|
|
-collection mode is normally the best bet but occasionally the `breadth_first` strategy can be significantly more efficient. Currently
|
|
|
+collection mode is normally the best bet but occasionally the `breadth_first` strategy can be significantly more efficient. Currently
|
|
|
elasticsearch will always use the `depth_first` collect_mode unless explicitly instructed to use `breadth_first` as in the above example.
|
|
|
Note that the `order` parameter can still be used to refer to data from a child aggregation when using the `breadth_first` setting - the parent
|
|
|
aggregation understands that this child aggregation will need to be called first before any of the other child aggregations.
|
|
|
|
|
|
WARNING: It is not possible to nest aggregations such as `top_hits` which require access to match score information under an aggregation that uses
|
|
|
the `breadth_first` collection mode. This is because this would require a RAM buffer to hold the float score value for every document and
|
|
|
-this would typically be too costly in terms of RAM.
|
|
|
+this would typically be too costly in terms of RAM.
|
|
|
|
|
|
[[search-aggregations-bucket-terms-aggregation-execution-hint]]
|
|
|
==== Execution hint
|
|
|
|
|
|
-added[1.2.0] Added the `global_ordinals`, `global_ordinals_hash` and `global_ordinals_low_cardinality` execution modes
|
|
|
-
|
|
|
-deprecated[1.3.0] Removed the `ordinals` execution mode
|
|
|
-
|
|
|
There are different mechanisms by which terms aggregations can be executed:
|
|
|
|
|
|
- by using field values directly in order to aggregate data per-bucket (`map`)
|