123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405 |
- [[search-aggregations-bucket-frequent-item-sets-aggregation]]
- === Frequent item sets aggregation
- ++++
- <titleabbrev>Frequent item sets</titleabbrev>
- ++++
- A bucket aggregation which finds frequent item sets. It is a form of association
- rules mining that identifies items that often occur together. Items that are
- frequently purchased together or log events that tend to co-occur are examples
- of frequent item sets. Finding frequent item sets helps to discover
- relationships between different data points (items).
- The aggregation reports closed item sets. A frequent item set is called closed
- if no superset exists with the same ratio of documents (also known as its
- <<frequent-item-sets-minimum-support,support value>>). For example, we have the two
- following candidates for a frequent item set, which have the same support value:
- 1. `apple, orange, banana`
- 2. `apple, orange, banana, tomato`.
- Only the second item set (`apple, orange, banana, tomato`) is returned, and the
- first set – which is a subset of the second one – is skipped. Both item sets
- might be returned if their support values are different.
- The runtime of the aggregation depends on the data and the provided parameters.
- It might take a significant time for the aggregation to complete. For this
- reason, it is recommended to use <<async-search,async search>> to run your
- requests asynchronously.
- ==== Syntax
- A `frequent_item_sets` aggregation looks like this in isolation:
- [source,js]
- --------------------------------------------------
- "frequent_item_sets": {
- "minimum_set_size": 3,
- "fields": [
- {"field": "my_field_1"},
- {"field": "my_field_2"}
- ]
- }
- --------------------------------------------------
- // NOTCONSOLE
- .`frequent_item_sets` Parameters
- |===
- |Parameter Name |Description |Required |Default Value
- |`fields` |(array) Fields to analyze. | Required |
- |`minimum_set_size` | (integer) The <<frequent-item-sets-minimum-set-size,minimum size>> of one item set. | Optional | `1`
- |`minimum_support` | (integer) The <<frequent-item-sets-minimum-support,minimum support>> of one item set. | Optional | `0.1`
- |`size` | (integer) The number of top item sets to return. | Optional | `10`
- |`filter` | (object) Query that filters documents from the analysis | Optional | `match_all`
- |===
- [discrete]
- [[frequent-item-sets-fields]]
- ==== Fields
- Supported field types for the analyzed fields are keyword, numeric, ip, date,
- and arrays of these types. You can also add runtime fields to your analyzed
- fields.
- If the combined cardinality of the analyzed fields are high, the aggregation
- might require a significant amount of system resources.
- You can filter the values for each field by using the `include` and `exclude`
- parameters. The parameters can be regular expression strings or arrays of
- strings of exact terms. The filtered values are removed from the analysis and
- therefore reduce the runtime. If both `include` and `exclude` are defined,
- `exclude` takes precedence; it means `include` is evaluated first and then
- `exclude`.
- [discrete]
- [[frequent-item-sets-minimum-set-size]]
- ==== Minimum set size
- The minimum set size is the minimum number of items the set needs to contain. A
- value of 1 returns the frequency of single items. Only item sets that contain at
- least the number of `minimum_set_size` items are returned. For example, the item
- set `orange, banana, apple` is returned only if the minimum set size is 3 or
- lower.
- [discrete]
- [[frequent-item-sets-minimum-support]]
- ==== Minimum support
- The minimum support value is the ratio of documents that an item set must exist
- in to be considered "frequent". In particular, it is a normalized value between
- 0 and 1. It is calculated by dividing the number of documents containing the
- item set by the total number of documents.
- For example, if a given item set is contained by five documents and the total
- number of documents is 20, then the support of the item set is 5/20 = 0.25.
- Therefore, this set is returned only if the minimum support is 0.25 or lower.
- As a higher minimum support prunes more items, the calculation is less resource
- intensive. The `minimum_support` parameter has an effect on the required memory
- and the runtime of the aggregation.
- [discrete]
- [[frequent-item-sets-size]]
- ==== Size
- This parameter defines the maximum number of item sets to return. The result
- contains top-k item sets; the item sets with the highest support values. This
- parameter has a significant effect on the required memory and the runtime of the
- aggregation.
- [discrete]
- [[frequent-item-sets-filter]]
- ==== Filter
- A query to filter documents to use as part of the analysis. Documents that
- don't match the filter are ignored when generating the item sets, however still
- count when calculating the support of an item set.
- Use the filter if you want to narrow the item set analysis to fields of interest.
- Use a top-level query to filter the data set.
- [discrete]
- [[frequent-item-sets-example]]
- ==== Examples
- In the following examples, we use the e-commerce {kib} sample data set.
- [discrete]
- ==== Aggregation with two analyzed fields and an `exclude` parameter
- In the first example, the goal is to find out based on transaction data (1.)
- from what product categories the customers purchase products frequently together
- and (2.) from which cities they make those purchases. We want to exclude results
- where location information is not available (where the city name is `other`).
- Finally, we are interested in sets with three or more items, and want to see the
- first three frequent item sets with the highest support.
- Note that we use the <<async-search,async search>> endpoint in this first
- example.
- [source,console]
- -------------------------------------------------
- POST /kibana_sample_data_ecommerce/_async_search
- {
- "size":0,
- "aggs":{
- "my_agg":{
- "frequent_item_sets":{
- "minimum_set_size":3,
- "fields":[
- {
- "field":"category.keyword"
- },
- {
- "field":"geoip.city_name",
- "exclude":"other"
- }
- ],
- "size":3
- }
- }
- }
- }
- -------------------------------------------------
- // TEST[skip:setup kibana sample data]
- The response of the API call above contains an identifier (`id`) of the async
- search request. You can use the identifier to retrieve the search results:
- [source,console]
- -------------------------------------------------
- GET /_async_search/<id>
- -------------------------------------------------
- // TEST[skip:setup kibana sample data]
- The API returns a response similar to the following one:
- [source,console-result]
- -------------------------------------------------
- (...)
- "aggregations" : {
- "my_agg" : {
- "buckets" : [ <1>
- {
- "key" : { <2>
- "category.keyword" : [
- "Women's Clothing",
- "Women's Shoes"
- ],
- "geoip.city_name" : [
- "New York"
- ]
- },
- "doc_count" : 217, <3>
- "support" : 0.04641711229946524 <4>
- },
- {
- "key" : {
- "category.keyword" : [
- "Women's Clothing",
- "Women's Accessories"
- ],
- "geoip.city_name" : [
- "New York"
- ]
- },
- "doc_count" : 135,
- "support" : 0.028877005347593583
- },
- {
- "key" : {
- "category.keyword" : [
- "Men's Clothing",
- "Men's Shoes"
- ],
- "geoip.city_name" : [
- "Cairo"
- ]
- },
- "doc_count" : 123,
- "support" : 0.026310160427807486
- }
- ],
- (...)
- }
- }
- -------------------------------------------------
- // TEST[skip:setup kibana sample data]
- <1> The array of returned item sets.
- <2> The `key` object contains one item set. In this case, it consists of two
- values of the `category.keyword` field and one value of the `geoip.city_name`.
- <3> The number of documents that contain the item set.
- <4> The support value of the item set. It is calculated by dividing the number
- of documents containing the item set by the total number of documents.
- The response shows that the categories customers purchase from most frequently
- together are `Women's Clothing` and `Women's Shoes` and customers from New York
- tend to buy items from these categories frequently together. In other words,
- customers who buy products labelled `Women's Clothing` more likely buy products
- also from the `Women's Shoes` category and customers from New York most likely
- buy products from these categories together. The item set with the second
- highest support is `Women's Clothing` and `Women's Accessories` with customers
- mostly from New York. Finally, the item set with the third highest support is
- `Men's Clothing` and `Men's Shoes` with customers mostly from Cairo.
- [discrete]
- ==== Aggregation with two analyzed fields and a filter
- We take the first example, but want to narrow the item sets to places in Europe.
- For that, we add a filter, and this time, we don't use the `exclude` parameter:
- [source,console]
- -------------------------------------------------
- POST /kibana_sample_data_ecommerce/_async_search
- {
- "size": 0,
- "aggs": {
- "my_agg": {
- "frequent_item_sets": {
- "minimum_set_size": 3,
- "fields": [
- { "field": "category.keyword" },
- { "field": "geoip.city_name" }
- ],
- "size": 3,
- "filter": {
- "term": {
- "geoip.continent_name": "Europe"
- }
- }
- }
- }
- }
- }
- -------------------------------------------------
- // TEST[skip:setup kibana sample data]
- The result will only show item sets that created from documents matching the
- filter, namely purchases in Europe. Using `filter`, the calculated `support`
- still takes all purchases into acount. That's different than specifying a query
- at the top-level, in which case `support` gets calculated only from purchases in
- Europe.
- [discrete]
- ==== Analyzing numeric values by using a runtime field
- The frequent items aggregation enables you to bucket numeric values by using
- <<runtime,runtime fields>>. The next example demonstrates how to use a script to
- add a runtime field to your documents called `price_range`, which is
- calculated from the taxful total price of the individual transactions. The
- runtime field then can be used in the frequent items aggregation as a field to
- analyze.
- [source,console]
- -------------------------------------------------
- GET kibana_sample_data_ecommerce/_search
- {
- "runtime_mappings": {
- "price_range": {
- "type": "keyword",
- "script": {
- "source": """
- def bucket_start = (long) Math.floor(doc['taxful_total_price'].value / 50) * 50;
- def bucket_end = bucket_start + 50;
- emit(bucket_start.toString() + "-" + bucket_end.toString());
- """
- }
- }
- },
- "size": 0,
- "aggs": {
- "my_agg": {
- "frequent_item_sets": {
- "minimum_set_size": 4,
- "fields": [
- {
- "field": "category.keyword"
- },
- {
- "field": "price_range"
- },
- {
- "field": "geoip.city_name"
- }
- ],
- "size": 3
- }
- }
- }
- }
- -------------------------------------------------
- // TEST[skip:setup kibana sample data]
- The API returns a response similar to the following one:
- [source,console-result]
- -------------------------------------------------
- (...)
- "aggregations" : {
- "my_agg" : {
- "buckets" : [
- {
- "key" : {
- "category.keyword" : [
- "Women's Clothing",
- "Women's Shoes"
- ],
- "price_range" : [
- "50-100"
- ],
- "geoip.city_name" : [
- "New York"
- ]
- },
- "doc_count" : 100,
- "support" : 0.0213903743315508
- },
- {
- "key" : {
- "category.keyword" : [
- "Women's Clothing",
- "Women's Shoes"
- ],
- "price_range" : [
- "50-100"
- ],
- "geoip.city_name" : [
- "Dubai"
- ]
- },
- "doc_count" : 59,
- "support" : 0.012620320855614974
- },
- {
- "key" : {
- "category.keyword" : [
- "Men's Clothing",
- "Men's Shoes"
- ],
- "price_range" : [
- "50-100"
- ],
- "geoip.city_name" : [
- "Marrakesh"
- ]
- },
- "doc_count" : 53,
- "support" : 0.011336898395721925
- }
- ],
- (...)
- }
- }
- -------------------------------------------------
- // TEST[skip:setup kibana sample data]
- The response shows the categories that customers purchase from most frequently
- together, the location of the customers who tend to buy items from these
- categories, and the most frequent price ranges of these purchases.
|