123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332 |
- [role="xpack"]
- [testenv="basic"]
- [[transform-examples]]
- === {transform-cap} examples
- ++++
- <titleabbrev>Examples</titleabbrev>
- ++++
- beta[]
- These examples demonstrate how to use {transforms} to derive useful
- insights from your data. All the examples use one of the
- {kibana-ref}/add-sample-data.html[{kib} sample datasets]. For a more detailed,
- step-by-step example, see
- <<ecommerce-transforms>>.
- * <<example-best-customers>>
- * <<example-airline>>
- * <<example-clientips>>
- [[example-best-customers]]
- ==== Finding your best customers
- In this example, we use the eCommerce orders sample dataset to find the customers
- who spent the most in our hypothetical webshop. Let's transform the data such
- that the destination index contains the number of orders, the total price of
- the orders, the amount of unique products and the average price per order,
- and the total amount of ordered products for each customer.
- [source,console]
- ----------------------------------
- POST _data_frame/transforms/_preview
- {
- "source": {
- "index": "kibana_sample_data_ecommerce"
- },
- "dest" : { <1>
- "index" : "sample_ecommerce_orders_by_customer"
- },
- "pivot": {
- "group_by": { <2>
- "user": { "terms": { "field": "user" }},
- "customer_id": { "terms": { "field": "customer_id" }}
- },
- "aggregations": {
- "order_count": { "value_count": { "field": "order_id" }},
- "total_order_amt": { "sum": { "field": "taxful_total_price" }},
- "avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
- "avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
- "total_unique_products": { "cardinality": { "field": "products.product_id" }}
- }
- }
- }
- ----------------------------------
- // TEST[skip:setup kibana sample data]
- <1> This is the destination index for the {dataframe}. It is ignored by
- `_preview`.
- <2> Two `group_by` fields have been selected. This means the {dataframe} will
- contain a unique row per `user` and `customer_id` combination. Within this
- dataset both these fields are unique. By including both in the {dataframe} it
- gives more context to the final results.
- NOTE: In the example above, condensed JSON formatting has been used for easier
- readability of the pivot object.
- The preview {transforms} API enables you to see the layout of the
- {dataframe} in advance, populated with some sample values. For example:
- [source,js]
- ----------------------------------
- {
- "preview" : [
- {
- "total_order_amt" : 3946.9765625,
- "order_count" : 59.0,
- "total_unique_products" : 116.0,
- "avg_unique_products_per_order" : 2.0,
- "customer_id" : "10",
- "user" : "recip",
- "avg_amt_per_order" : 66.89790783898304
- },
- ...
- ]
- }
- ----------------------------------
- // NOTCONSOLE
- This {dataframe} makes it easier to answer questions such as:
- * Which customers spend the most?
- * Which customers spend the most per order?
- * Which customers order most often?
- * Which customers ordered the least number of different products?
- It's possible to answer these questions using aggregations alone, however
- {dataframes} allow us to persist this data as a customer centric index. This
- enables us to analyze data at scale and gives more flexibility to explore and
- navigate data from a customer centric perspective. In some cases, it can even
- make creating visualizations much simpler.
- [[example-airline]]
- ==== Finding air carriers with the most delays
- In this example, we use the Flights sample dataset to find out which air carrier
- had the most delays. First, we filter the source data such that it excludes all
- the cancelled flights by using a query filter. Then we transform the data to
- contain the distinct number of flights, the sum of delayed minutes, and the sum
- of the flight minutes by air carrier. Finally, we use a
- {ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_script`]
- to determine what percentage of the flight time was actually delay.
- [source,console]
- ----------------------------------
- POST _data_frame/transforms/_preview
- {
- "source": {
- "index": "kibana_sample_data_flights",
- "query": { <1>
- "bool": {
- "filter": [
- { "term": { "Cancelled": false } }
- ]
- }
- }
- },
- "dest" : { <2>
- "index" : "sample_flight_delays_by_carrier"
- },
- "pivot": {
- "group_by": { <3>
- "carrier": { "terms": { "field": "Carrier" }}
- },
- "aggregations": {
- "flights_count": { "value_count": { "field": "FlightNum" }},
- "delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
- "flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
- "delay_time_percentage": { <4>
- "bucket_script": {
- "buckets_path": {
- "delay_time": "delay_mins_total.value",
- "flight_time": "flight_mins_total.value"
- },
- "script": "(params.delay_time / params.flight_time) * 100"
- }
- }
- }
- }
- }
- ----------------------------------
- // TEST[skip:setup kibana sample data]
- <1> Filter the source data to select only flights that were not cancelled.
- <2> This is the destination index for the {dataframe}. It is ignored by
- `_preview`.
- <3> The data is grouped by the `Carrier` field which contains the airline name.
- <4> This `bucket_script` performs calculations on the results that are returned
- by the aggregation. In this particular example, it calculates what percentage of
- travel time was taken up by delays.
- The preview shows you that the new index would contain data like this for each
- carrier:
- [source,js]
- ----------------------------------
- {
- "preview" : [
- {
- "carrier" : "ES-Air",
- "flights_count" : 2802.0,
- "flight_mins_total" : 1436927.5130677223,
- "delay_time_percentage" : 9.335543983955839,
- "delay_mins_total" : 134145.0
- },
- ...
- ]
- }
- ----------------------------------
- // NOTCONSOLE
- This {dataframe} makes it easier to answer questions such as:
- * Which air carrier has the most delays as a percentage of flight time?
- NOTE: This data is fictional and does not reflect actual delays
- or flight stats for any of the featured destination or origin airports.
- [[example-clientips]]
- ==== Finding suspicious client IPs by using scripted metrics
- With {transforms}, you can use
- {ref}/search-aggregations-metrics-scripted-metric-aggregation.html[scripted
- metric aggregations] on your data. These aggregations are flexible and make
- it possible to perform very complex processing. Let's use scripted metrics to
- identify suspicious client IPs in the web log sample dataset.
- We transform the data such that the new index contains the sum of bytes and the
- number of distinct URLs, agents, incoming requests by location, and geographic
- destinations for each client IP. We also use a scripted field to count the
- specific types of HTTP responses that each client IP receives. Ultimately, the
- example below transforms web log data into an entity centric index where the
- entity is `clientip`.
- [source,console]
- ----------------------------------
- POST _data_frame/transforms/_preview
- {
- "source": {
- "index": "kibana_sample_data_logs",
- "query": { <1>
- "range" : {
- "timestamp" : {
- "gte" : "now-30d/d"
- }
- }
- }
- },
- "dest" : { <2>
- "index" : "sample_weblogs_by_clientip"
- },
- "pivot": {
- "group_by": { <3>
- "clientip": { "terms": { "field": "clientip" } }
- },
- "aggregations": {
- "url_dc": { "cardinality": { "field": "url.keyword" }},
- "bytes_sum": { "sum": { "field": "bytes" }},
- "geo.src_dc": { "cardinality": { "field": "geo.src" }},
- "agent_dc": { "cardinality": { "field": "agent.keyword" }},
- "geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
- "responses.total": { "value_count": { "field": "timestamp" }},
- "responses.counts": { <4>
- "scripted_metric": {
- "init_script": "state.responses = ['error':0L,'success':0L,'other':0L]",
- "map_script": """
- def code = doc['response.keyword'].value;
- if (code.startsWith('5') || code.startsWith('4')) {
- state.responses.error += 1 ;
- } else if(code.startsWith('2')) {
- state.responses.success += 1;
- } else {
- state.responses.other += 1;
- }
- """,
- "combine_script": "state.responses",
- "reduce_script": """
- def counts = ['error': 0L, 'success': 0L, 'other': 0L];
- for (responses in states) {
- counts.error += responses['error'];
- counts.success += responses['success'];
- counts.other += responses['other'];
- }
- return counts;
- """
- }
- },
- "timestamp.min": { "min": { "field": "timestamp" }},
- "timestamp.max": { "max": { "field": "timestamp" }},
- "timestamp.duration_ms": { <5>
- "bucket_script": {
- "buckets_path": {
- "min_time": "timestamp.min.value",
- "max_time": "timestamp.max.value"
- },
- "script": "(params.max_time - params.min_time)"
- }
- }
- }
- }
- }
- ----------------------------------
- // TEST[skip:setup kibana sample data]
- <1> This range query limits the {transform} to documents that are within the last
- 30 days at the point in time the {transform} checkpoint is processed.
- For batch {dataframes} this occurs once.
- <2> This is the destination index for the {dataframe}. It is ignored by
- `_preview`.
- <3> The data is grouped by the `clientip` field.
- <4> This `scripted_metric` performs a distributed operation on the web log data
- to count specific types of HTTP responses (error, success, and other).
- <5> This `bucket_script` calculates the duration of the `clientip` access based
- on the results of the aggregation.
- The preview shows you that the new index would contain data like this for each
- client IP:
- [source,js]
- ----------------------------------
- {
- "preview" : [
- {
- "geo" : {
- "src_dc" : 12.0,
- "dest_dc" : 9.0
- },
- "clientip" : "0.72.176.46",
- "agent_dc" : 3.0,
- "responses" : {
- "total" : 14.0,
- "counts" : {
- "other" : 0,
- "success" : 14,
- "error" : 0
- }
- },
- "bytes_sum" : 74808.0,
- "timestamp" : {
- "duration_ms" : 4.919943239E9,
- "min" : "2019-06-17T07:51:57.333Z",
- "max" : "2019-08-13T06:31:00.572Z"
- },
- "url_dc" : 11.0
- },
- ...
- }
- ----------------------------------
- // NOTCONSOLE
- This {dataframe} makes it easier to answer questions such as:
- * Which client IPs are transferring the most amounts of data?
- * Which client IPs are interacting with a high number of different URLs?
-
- * Which client IPs have high error rates?
-
- * Which client IPs are interacting with a high number of destination countries?
|