frequent-item-sets-aggregation.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405
  1. [[search-aggregations-bucket-frequent-item-sets-aggregation]]
  2. === Frequent item sets aggregation
  3. ++++
  4. <titleabbrev>Frequent item sets</titleabbrev>
  5. ++++
  6. A bucket aggregation which finds frequent item sets. It is a form of association
  7. rules mining that identifies items that often occur together. Items that are
  8. frequently purchased together or log events that tend to co-occur are examples
  9. of frequent item sets. Finding frequent item sets helps to discover
  10. relationships between different data points (items).
  11. The aggregation reports closed item sets. A frequent item set is called closed
  12. if no superset exists with the same ratio of documents (also known as its
  13. <<frequent-item-sets-minimum-support,support value>>). For example, we have the two
  14. following candidates for a frequent item set, which have the same support value:
  15. 1. `apple, orange, banana`
  16. 2. `apple, orange, banana, tomato`.
  17. Only the second item set (`apple, orange, banana, tomato`) is returned, and the
  18. first set – which is a subset of the second one – is skipped. Both item sets
  19. might be returned if their support values are different.
  20. The runtime of the aggregation depends on the data and the provided parameters.
  21. It might take a significant time for the aggregation to complete. For this
  22. reason, it is recommended to use <<async-search,async search>> to run your
  23. requests asynchronously.
  24. ==== Syntax
  25. A `frequent_item_sets` aggregation looks like this in isolation:
  26. [source,js]
  27. --------------------------------------------------
  28. "frequent_item_sets": {
  29. "minimum_set_size": 3,
  30. "fields": [
  31. {"field": "my_field_1"},
  32. {"field": "my_field_2"}
  33. ]
  34. }
  35. --------------------------------------------------
  36. // NOTCONSOLE
  37. .`frequent_item_sets` Parameters
  38. |===
  39. |Parameter Name |Description |Required |Default Value
  40. |`fields` |(array) Fields to analyze. | Required |
  41. |`minimum_set_size` | (integer) The <<frequent-item-sets-minimum-set-size,minimum size>> of one item set. | Optional | `1`
  42. |`minimum_support` | (integer) The <<frequent-item-sets-minimum-support,minimum support>> of one item set. | Optional | `0.1`
  43. |`size` | (integer) The number of top item sets to return. | Optional | `10`
  44. |`filter` | (object) Query that filters documents from the analysis | Optional | `match_all`
  45. |===
  46. [discrete]
  47. [[frequent-item-sets-fields]]
  48. ==== Fields
  49. Supported field types for the analyzed fields are keyword, numeric, ip, date,
  50. and arrays of these types. You can also add runtime fields to your analyzed
  51. fields.
  52. If the combined cardinality of the analyzed fields are high, the aggregation
  53. might require a significant amount of system resources.
  54. You can filter the values for each field by using the `include` and `exclude`
  55. parameters. The parameters can be regular expression strings or arrays of
  56. strings of exact terms. The filtered values are removed from the analysis and
  57. therefore reduce the runtime. If both `include` and `exclude` are defined,
  58. `exclude` takes precedence; it means `include` is evaluated first and then
  59. `exclude`.
  60. [discrete]
  61. [[frequent-item-sets-minimum-set-size]]
  62. ==== Minimum set size
  63. The minimum set size is the minimum number of items the set needs to contain. A
  64. value of 1 returns the frequency of single items. Only item sets that contain at
  65. least the number of `minimum_set_size` items are returned. For example, the item
  66. set `orange, banana, apple` is returned only if the minimum set size is 3 or
  67. lower.
  68. [discrete]
  69. [[frequent-item-sets-minimum-support]]
  70. ==== Minimum support
  71. The minimum support value is the ratio of documents that an item set must exist
  72. in to be considered "frequent". In particular, it is a normalized value between
  73. 0 and 1. It is calculated by dividing the number of documents containing the
  74. item set by the total number of documents.
  75. For example, if a given item set is contained by five documents and the total
  76. number of documents is 20, then the support of the item set is 5/20 = 0.25.
  77. Therefore, this set is returned only if the minimum support is 0.25 or lower.
  78. As a higher minimum support prunes more items, the calculation is less resource
  79. intensive. The `minimum_support` parameter has an effect on the required memory
  80. and the runtime of the aggregation.
  81. [discrete]
  82. [[frequent-item-sets-size]]
  83. ==== Size
  84. This parameter defines the maximum number of item sets to return. The result
  85. contains top-k item sets; the item sets with the highest support values. This
  86. parameter has a significant effect on the required memory and the runtime of the
  87. aggregation.
  88. [discrete]
  89. [[frequent-item-sets-filter]]
  90. ==== Filter
  91. A query to filter documents to use as part of the analysis. Documents that
  92. don't match the filter are ignored when generating the item sets, however still
  93. count when calculating the support of an item set.
  94. Use the filter if you want to narrow the item set analysis to fields of interest.
  95. Use a top-level query to filter the data set.
  96. [discrete]
  97. [[frequent-item-sets-example]]
  98. ==== Examples
  99. In the following examples, we use the e-commerce {kib} sample data set.
  100. [discrete]
  101. ==== Aggregation with two analyzed fields and an `exclude` parameter
  102. In the first example, the goal is to find out based on transaction data (1.)
  103. from what product categories the customers purchase products frequently together
  104. and (2.) from which cities they make those purchases. We want to exclude results
  105. where location information is not available (where the city name is `other`).
  106. Finally, we are interested in sets with three or more items, and want to see the
  107. first three frequent item sets with the highest support.
  108. Note that we use the <<async-search,async search>> endpoint in this first
  109. example.
  110. [source,console]
  111. -------------------------------------------------
  112. POST /kibana_sample_data_ecommerce/_async_search
  113. {
  114. "size":0,
  115. "aggs":{
  116. "my_agg":{
  117. "frequent_item_sets":{
  118. "minimum_set_size":3,
  119. "fields":[
  120. {
  121. "field":"category.keyword"
  122. },
  123. {
  124. "field":"geoip.city_name",
  125. "exclude":"other"
  126. }
  127. ],
  128. "size":3
  129. }
  130. }
  131. }
  132. }
  133. -------------------------------------------------
  134. // TEST[skip:setup kibana sample data]
  135. The response of the API call above contains an identifier (`id`) of the async
  136. search request. You can use the identifier to retrieve the search results:
  137. [source,console]
  138. -------------------------------------------------
  139. GET /_async_search/<id>
  140. -------------------------------------------------
  141. // TEST[skip:setup kibana sample data]
  142. The API returns a response similar to the following one:
  143. [source,console-result]
  144. -------------------------------------------------
  145. (...)
  146. "aggregations" : {
  147. "my_agg" : {
  148. "buckets" : [ <1>
  149. {
  150. "key" : { <2>
  151. "category.keyword" : [
  152. "Women's Clothing",
  153. "Women's Shoes"
  154. ],
  155. "geoip.city_name" : [
  156. "New York"
  157. ]
  158. },
  159. "doc_count" : 217, <3>
  160. "support" : 0.04641711229946524 <4>
  161. },
  162. {
  163. "key" : {
  164. "category.keyword" : [
  165. "Women's Clothing",
  166. "Women's Accessories"
  167. ],
  168. "geoip.city_name" : [
  169. "New York"
  170. ]
  171. },
  172. "doc_count" : 135,
  173. "support" : 0.028877005347593583
  174. },
  175. {
  176. "key" : {
  177. "category.keyword" : [
  178. "Men's Clothing",
  179. "Men's Shoes"
  180. ],
  181. "geoip.city_name" : [
  182. "Cairo"
  183. ]
  184. },
  185. "doc_count" : 123,
  186. "support" : 0.026310160427807486
  187. }
  188. ],
  189. (...)
  190. }
  191. }
  192. -------------------------------------------------
  193. // TEST[skip:setup kibana sample data]
  194. <1> The array of returned item sets.
  195. <2> The `key` object contains one item set. In this case, it consists of two
  196. values of the `category.keyword` field and one value of the `geoip.city_name`.
  197. <3> The number of documents that contain the item set.
  198. <4> The support value of the item set. It is calculated by dividing the number
  199. of documents containing the item set by the total number of documents.
  200. The response shows that the categories customers purchase from most frequently
  201. together are `Women's Clothing` and `Women's Shoes` and customers from New York
  202. tend to buy items from these categories frequently together. In other words,
  203. customers who buy products labelled `Women's Clothing` more likely buy products
  204. also from the `Women's Shoes` category and customers from New York most likely
  205. buy products from these categories together. The item set with the second
  206. highest support is `Women's Clothing` and `Women's Accessories` with customers
  207. mostly from New York. Finally, the item set with the third highest support is
  208. `Men's Clothing` and `Men's Shoes` with customers mostly from Cairo.
  209. [discrete]
  210. ==== Aggregation with two analyzed fields and a filter
  211. We take the first example, but want to narrow the item sets to places in Europe.
  212. For that, we add a filter, and this time, we don't use the `exclude` parameter:
  213. [source,console]
  214. -------------------------------------------------
  215. POST /kibana_sample_data_ecommerce/_async_search
  216. {
  217. "size": 0,
  218. "aggs": {
  219. "my_agg": {
  220. "frequent_item_sets": {
  221. "minimum_set_size": 3,
  222. "fields": [
  223. { "field": "category.keyword" },
  224. { "field": "geoip.city_name" }
  225. ],
  226. "size": 3,
  227. "filter": {
  228. "term": {
  229. "geoip.continent_name": "Europe"
  230. }
  231. }
  232. }
  233. }
  234. }
  235. }
  236. -------------------------------------------------
  237. // TEST[skip:setup kibana sample data]
  238. The result will only show item sets that created from documents matching the
  239. filter, namely purchases in Europe. Using `filter`, the calculated `support`
  240. still takes all purchases into acount. That's different than specifying a query
  241. at the top-level, in which case `support` gets calculated only from purchases in
  242. Europe.
  243. [discrete]
  244. ==== Analyzing numeric values by using a runtime field
  245. The frequent items aggregation enables you to bucket numeric values by using
  246. <<runtime,runtime fields>>. The next example demonstrates how to use a script to
  247. add a runtime field to your documents called `price_range`, which is
  248. calculated from the taxful total price of the individual transactions. The
  249. runtime field then can be used in the frequent items aggregation as a field to
  250. analyze.
  251. [source,console]
  252. -------------------------------------------------
  253. GET kibana_sample_data_ecommerce/_search
  254. {
  255. "runtime_mappings": {
  256. "price_range": {
  257. "type": "keyword",
  258. "script": {
  259. "source": """
  260. def bucket_start = (long) Math.floor(doc['taxful_total_price'].value / 50) * 50;
  261. def bucket_end = bucket_start + 50;
  262. emit(bucket_start.toString() + "-" + bucket_end.toString());
  263. """
  264. }
  265. }
  266. },
  267. "size": 0,
  268. "aggs": {
  269. "my_agg": {
  270. "frequent_item_sets": {
  271. "minimum_set_size": 4,
  272. "fields": [
  273. {
  274. "field": "category.keyword"
  275. },
  276. {
  277. "field": "price_range"
  278. },
  279. {
  280. "field": "geoip.city_name"
  281. }
  282. ],
  283. "size": 3
  284. }
  285. }
  286. }
  287. }
  288. -------------------------------------------------
  289. // TEST[skip:setup kibana sample data]
  290. The API returns a response similar to the following one:
  291. [source,console-result]
  292. -------------------------------------------------
  293. (...)
  294. "aggregations" : {
  295. "my_agg" : {
  296. "buckets" : [
  297. {
  298. "key" : {
  299. "category.keyword" : [
  300. "Women's Clothing",
  301. "Women's Shoes"
  302. ],
  303. "price_range" : [
  304. "50-100"
  305. ],
  306. "geoip.city_name" : [
  307. "New York"
  308. ]
  309. },
  310. "doc_count" : 100,
  311. "support" : 0.0213903743315508
  312. },
  313. {
  314. "key" : {
  315. "category.keyword" : [
  316. "Women's Clothing",
  317. "Women's Shoes"
  318. ],
  319. "price_range" : [
  320. "50-100"
  321. ],
  322. "geoip.city_name" : [
  323. "Dubai"
  324. ]
  325. },
  326. "doc_count" : 59,
  327. "support" : 0.012620320855614974
  328. },
  329. {
  330. "key" : {
  331. "category.keyword" : [
  332. "Men's Clothing",
  333. "Men's Shoes"
  334. ],
  335. "price_range" : [
  336. "50-100"
  337. ],
  338. "geoip.city_name" : [
  339. "Marrakesh"
  340. ]
  341. },
  342. "doc_count" : 53,
  343. "support" : 0.011336898395721925
  344. }
  345. ],
  346. (...)
  347. }
  348. }
  349. -------------------------------------------------
  350. // TEST[skip:setup kibana sample data]
  351. The response shows the categories that customers purchase from most frequently
  352. together, the location of the customers who tend to buy items from these
  353. categories, and the most frequent price ranges of these purchases.