1
0

string-stats-aggregation.asciidoc 7.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217
  1. [role="xpack"]
  2. [testenv="basic"]
  3. [[search-aggregations-metrics-string-stats-aggregation]]
  4. === String Stats Aggregation
  5. A `multi-value` metrics aggregation that computes statistics over string values extracted from the aggregated documents.
  6. These values can be retrieved either from specific `keyword` fields in the documents or can be generated by a provided script.
  7. The string stats aggregation returns the following results:
  8. * `count` - The number of non-empty fields counted.
  9. * `min_length` - The length of the shortest term.
  10. * `max_length` - The length of the longest term.
  11. * `avg_length` - The average length computed over all terms.
  12. * `entropy` - The https://en.wikipedia.org/wiki/Entropy_(information_theory)[Shannon Entropy] value computed over all terms collected by
  13. the aggregation. Shannon entropy quantifies the amount of information contained in the field. It is a very useful metric for
  14. measuring a wide range of properties of a data set, such as diversity, similarity, randomness etc.
  15. Assuming the data consists of a twitter messages:
  16. [source,console]
  17. --------------------------------------------------
  18. POST /twitter/_search?size=0
  19. {
  20. "aggs" : {
  21. "message_stats" : { "string_stats" : { "field" : "message.keyword" } }
  22. }
  23. }
  24. --------------------------------------------------
  25. // TEST[setup:twitter]
  26. The above aggregation computes the string statistics for the `message` field in all documents. The aggregation type
  27. is `string_stats` and the `field` parameter defines the field of the documents the stats will be computed on.
  28. The above will return the following:
  29. [source,console-result]
  30. --------------------------------------------------
  31. {
  32. ...
  33. "aggregations": {
  34. "message_stats" : {
  35. "count" : 5,
  36. "min_length" : 24,
  37. "max_length" : 30,
  38. "avg_length" : 28.8,
  39. "entropy" : 3.94617750050791
  40. }
  41. }
  42. }
  43. --------------------------------------------------
  44. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  45. The name of the aggregation (`message_stats` above) also serves as the key by which the aggregation result can be retrieved from
  46. the returned response.
  47. ==== Character distribution
  48. The computation of the Shannon Entropy value is based on the probability of each character appearing in all terms collected
  49. by the aggregation. To view the probability distribution for all characters, we can add the `show_distribution` (default: `false`) parameter.
  50. [source,console]
  51. --------------------------------------------------
  52. POST /twitter/_search?size=0
  53. {
  54. "aggs" : {
  55. "message_stats" : {
  56. "string_stats" : {
  57. "field" : "message.keyword",
  58. "show_distribution": true <1>
  59. }
  60. }
  61. }
  62. }
  63. --------------------------------------------------
  64. // TEST[setup:twitter]
  65. <1> Set the `show_distribution` parameter to `true`, so that probability distribution for all characters is returned in the results.
  66. [source,console-result]
  67. --------------------------------------------------
  68. {
  69. ...
  70. "aggregations": {
  71. "message_stats" : {
  72. "count" : 5,
  73. "min_length" : 24,
  74. "max_length" : 30,
  75. "avg_length" : 28.8,
  76. "entropy" : 3.94617750050791,
  77. "distribution" : {
  78. " " : 0.1527777777777778,
  79. "e" : 0.14583333333333334,
  80. "s" : 0.09722222222222222,
  81. "m" : 0.08333333333333333,
  82. "t" : 0.0763888888888889,
  83. "h" : 0.0625,
  84. "a" : 0.041666666666666664,
  85. "i" : 0.041666666666666664,
  86. "r" : 0.041666666666666664,
  87. "g" : 0.034722222222222224,
  88. "n" : 0.034722222222222224,
  89. "o" : 0.034722222222222224,
  90. "u" : 0.034722222222222224,
  91. "b" : 0.027777777777777776,
  92. "w" : 0.027777777777777776,
  93. "c" : 0.013888888888888888,
  94. "E" : 0.006944444444444444,
  95. "l" : 0.006944444444444444,
  96. "1" : 0.006944444444444444,
  97. "2" : 0.006944444444444444,
  98. "3" : 0.006944444444444444,
  99. "4" : 0.006944444444444444,
  100. "y" : 0.006944444444444444
  101. }
  102. }
  103. }
  104. }
  105. --------------------------------------------------
  106. // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
  107. The `distribution` object shows the probability of each character appearing in all terms. The characters are sorted by descending probability.
  108. ==== Script
  109. Computing the message string stats based on a script:
  110. [source,console]
  111. --------------------------------------------------
  112. POST /twitter/_search?size=0
  113. {
  114. "aggs" : {
  115. "message_stats" : {
  116. "string_stats" : {
  117. "script" : {
  118. "lang": "painless",
  119. "source": "doc['message.keyword'].value"
  120. }
  121. }
  122. }
  123. }
  124. }
  125. --------------------------------------------------
  126. // TEST[setup:twitter]
  127. This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters.
  128. To use a stored script use the following syntax:
  129. [source,console]
  130. --------------------------------------------------
  131. POST /twitter/_search?size=0
  132. {
  133. "aggs" : {
  134. "message_stats" : {
  135. "string_stats" : {
  136. "script" : {
  137. "id": "my_script",
  138. "params" : {
  139. "field" : "message.keyword"
  140. }
  141. }
  142. }
  143. }
  144. }
  145. }
  146. --------------------------------------------------
  147. // TEST[setup:twitter,stored_example_script]
  148. ===== Value Script
  149. We can use a value script to modify the message (eg we can add a prefix) and compute the new stats:
  150. [source,console]
  151. --------------------------------------------------
  152. POST /twitter/_search?size=0
  153. {
  154. "aggs" : {
  155. "message_stats" : {
  156. "string_stats" : {
  157. "field" : "message.keyword",
  158. "script" : {
  159. "lang": "painless",
  160. "source": "params.prefix + _value",
  161. "params" : {
  162. "prefix" : "Message: "
  163. }
  164. }
  165. }
  166. }
  167. }
  168. }
  169. --------------------------------------------------
  170. // TEST[setup:twitter]
  171. ==== Missing value
  172. The `missing` parameter defines how documents that are missing a value should be treated.
  173. By default they will be ignored but it is also possible to treat them as if they had a value.
  174. [source,console]
  175. --------------------------------------------------
  176. POST /twitter/_search?size=0
  177. {
  178. "aggs" : {
  179. "message_stats" : {
  180. "string_stats" : {
  181. "field" : "message.keyword",
  182. "missing": "[empty message]" <1>
  183. }
  184. }
  185. }
  186. }
  187. --------------------------------------------------
  188. // TEST[setup:twitter]
  189. <1> Documents without a value in the `message` field will be treated as documents that have the value `[empty message]`.