123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193 |
- [role="xpack"]
- [[search-aggregations-metrics-string-stats-aggregation]]
- === String stats aggregation
- ++++
- <titleabbrev>String stats</titleabbrev>
- ++++
- A `multi-value` metrics aggregation that computes statistics over string values extracted from the aggregated documents.
- These values can be retrieved either from specific `keyword` fields.
- The string stats aggregation returns the following results:
- * `count` - The number of non-empty fields counted.
- * `min_length` - The length of the shortest term.
- * `max_length` - The length of the longest term.
- * `avg_length` - The average length computed over all terms.
- * `entropy` - The {wikipedia}/Entropy_(information_theory)[Shannon Entropy] value computed over all terms collected by
- the aggregation. Shannon entropy quantifies the amount of information contained in the field. It is a very useful metric for
- measuring a wide range of properties of a data set, such as diversity, similarity, randomness etc.
- For example:
- [source,console]
- --------------------------------------------------
- POST /my-index-000001/_search?size=0
- {
- "aggs": {
- "message_stats": { "string_stats": { "field": "message.keyword" } }
- }
- }
- --------------------------------------------------
- // TEST[setup:messages]
- The above aggregation computes the string statistics for the `message` field in all documents. The aggregation type
- is `string_stats` and the `field` parameter defines the field of the documents the stats will be computed on.
- The above will return the following:
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "message_stats": {
- "count": 5,
- "min_length": 24,
- "max_length": 30,
- "avg_length": 28.8,
- "entropy": 3.94617750050791
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- The name of the aggregation (`message_stats` above) also serves as the key by which the aggregation result can be retrieved from
- the returned response.
- ==== Character distribution
- The computation of the Shannon Entropy value is based on the probability of each character appearing in all terms collected
- by the aggregation. To view the probability distribution for all characters, we can add the `show_distribution` (default: `false`) parameter.
- [source,console]
- --------------------------------------------------
- POST /my-index-000001/_search?size=0
- {
- "aggs": {
- "message_stats": {
- "string_stats": {
- "field": "message.keyword",
- "show_distribution": true <1>
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:messages]
- <1> Set the `show_distribution` parameter to `true`, so that probability distribution for all characters is returned in the results.
- [source,console-result]
- --------------------------------------------------
- {
- ...
- "aggregations": {
- "message_stats": {
- "count": 5,
- "min_length": 24,
- "max_length": 30,
- "avg_length": 28.8,
- "entropy": 3.94617750050791,
- "distribution": {
- " ": 0.1527777777777778,
- "e": 0.14583333333333334,
- "s": 0.09722222222222222,
- "m": 0.08333333333333333,
- "t": 0.0763888888888889,
- "h": 0.0625,
- "a": 0.041666666666666664,
- "i": 0.041666666666666664,
- "r": 0.041666666666666664,
- "g": 0.034722222222222224,
- "n": 0.034722222222222224,
- "o": 0.034722222222222224,
- "u": 0.034722222222222224,
- "b": 0.027777777777777776,
- "w": 0.027777777777777776,
- "c": 0.013888888888888888,
- "E": 0.006944444444444444,
- "l": 0.006944444444444444,
- "1": 0.006944444444444444,
- "2": 0.006944444444444444,
- "3": 0.006944444444444444,
- "4": 0.006944444444444444,
- "y": 0.006944444444444444
- }
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- The `distribution` object shows the probability of each character appearing in all terms. The characters are sorted by descending probability.
- ==== Script
- If you need to get the `string_stats` for something more complex than a single
- field, run the aggregation on a <<runtime,runtime field>>.
- [source,console]
- ----
- POST /my-index-000001/_search
- {
- "size": 0,
- "runtime_mappings": {
- "message_and_context": {
- "type": "keyword",
- "script": """
- emit(doc['message.keyword'].value + ' ' + doc['context.keyword'].value)
- """
- }
- },
- "aggs": {
- "message_stats": {
- "string_stats": { "field": "message_and_context" }
- }
- }
- }
- ----
- // TEST[setup:messages]
- // TEST[s/_search/_search?filter_path=aggregations/]
- ////
- [source,console-result]
- ----
- {
- "aggregations": {
- "message_stats": {
- "count": 5,
- "min_length": 28,
- "max_length": 34,
- "avg_length": 32.8,
- "entropy": 3.9797778402765784
- }
- }
- }
- ----
- ////
- ==== Missing value
- The `missing` parameter defines how documents that are missing a value should be treated.
- By default they will be ignored but it is also possible to treat them as if they had a value.
- [source,console]
- --------------------------------------------------
- POST /my-index-000001/_search?size=0
- {
- "aggs": {
- "message_stats": {
- "string_stats": {
- "field": "message.keyword",
- "missing": "[empty message]" <1>
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:messages]
- <1> Documents without a value in the `message` field will be treated as documents that have the value `[empty message]`.
|