lqb
/
elasticsearch
mirror of https://gitee.com/mirrors/elasticsearch.git


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485
							[role="xpack"]
[testenv="basic"]
[[transform-examples]]
= {transform-cap} examples
++++
<titleabbrev>Examples</titleabbrev>
++++

These examples demonstrate how to use {transforms} to derive useful insights 
from your data. All the examples use one of the 
{kibana-ref}/add-sample-data.html[{kib} sample datasets]. For a more detailed, 
step-by-step example, see <<ecommerce-transforms>>.

* <<example-best-customers>>
* <<example-airline>>
* <<example-clientips>>
* <<example-last-log>>


[[example-best-customers]]
== Finding your best customers

This example uses the eCommerce orders sample data set to find the customers who 
spent the most in a hypothetical webshop. Let's transform the data such that the 
destination index contains the number of orders, the total price of the orders, 
the amount of unique products and the average price per order, and the total 
amount of ordered products for each customer.

[source,console]
----------------------------------
POST _transform/_preview
{
  "source": {
    "index": "kibana_sample_data_ecommerce"
  },
  "dest" : { <1>
    "index" : "sample_ecommerce_orders_by_customer"
  },
  "pivot": {
    "group_by": { <2>
      "user": { "terms": { "field": "user" }}, 
      "customer_id": { "terms": { "field": "customer_id" }}
    },
    "aggregations": {
      "order_count": { "value_count": { "field": "order_id" }},
      "total_order_amt": { "sum": { "field": "taxful_total_price" }},
      "avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
      "avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
      "total_unique_products": { "cardinality": { "field": "products.product_id" }}
    }
  }
}
----------------------------------
// TEST[skip:setup kibana sample data]

<1> The destination index for the {transform}. It is ignored by `_preview`.
<2> Two `group_by` fields is selected. This means the {transform} contains a 
unique row per `user` and `customer_id` combination. Within this data set, both 
these fields are unique. By including both in the {transform}, it gives more 
context to the final results.

NOTE: In the example above, condensed JSON formatting is used for easier 
readability of the pivot object.

The preview {transforms} API enables you to see the layout of the {transform} in 
advance, populated with some sample values. For example:

[source,js]
----------------------------------
{
  "preview" : [
    {
      "total_order_amt" : 3946.9765625,
      "order_count" : 59.0,
      "total_unique_products" : 116.0,
      "avg_unique_products_per_order" : 2.0,
      "customer_id" : "10",
      "user" : "recip",
      "avg_amt_per_order" : 66.89790783898304
    },
    ...
    ]
  }
----------------------------------
// NOTCONSOLE


This {transform} makes it easier to answer questions such as:

* Which customers spend the most?

* Which customers spend the most per order?

* Which customers order most often?

* Which customers ordered the least number of different products?

It's possible to answer these questions using aggregations alone, however 
{transforms} allow us to persist this data as a customer centric index. This 
enables us to analyze data at scale and gives more flexibility to explore and 
navigate data from a customer centric perspective. In some cases, it can even 
make creating visualizations much simpler.


[[example-airline]]
== Finding air carriers with the most delays

This example uses the Flights sample data set to find out which air carrier 
had the most delays. First, filter the source data such that it excludes all 
the cancelled flights by using a query filter. Then transform the data to 
contain the distinct number of flights, the sum of delayed minutes, and the sum 
of the flight minutes by air carrier. Finally, use a 
<<search-aggregations-pipeline-bucket-script-aggregation,`bucket_script`>>
to determine what percentage of the flight time was actually delay.

[source,console]
----------------------------------
POST _transform/_preview
{
  "source": {
    "index": "kibana_sample_data_flights",
    "query": { <1>
      "bool": {
        "filter": [
          { "term":  { "Cancelled": false } }
        ]
      }
    }
  },
  "dest" : { <2>
    "index" : "sample_flight_delays_by_carrier"
  },
  "pivot": {
    "group_by": { <3>
      "carrier": { "terms": { "field": "Carrier" }}
    },
    "aggregations": {
      "flights_count": { "value_count": { "field": "FlightNum" }},
      "delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
      "flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
      "delay_time_percentage": { <4>
        "bucket_script": {
          "buckets_path": {
            "delay_time": "delay_mins_total.value",
            "flight_time": "flight_mins_total.value"
          },
          "script": "(params.delay_time / params.flight_time) * 100"
        }
      }
    }
  }
}
----------------------------------
// TEST[skip:setup kibana sample data]

<1> Filter the source data to select only flights that are not cancelled.
<2> The destination index for the {transform}. It is ignored by `_preview`.
<3> The data is grouped by the `Carrier` field which contains the airline name.
<4> This `bucket_script` performs calculations on the results that are returned 
by the aggregation. In this particular example, it calculates what percentage of 
travel time was taken up by delays.

The preview shows you that the new index would contain data like this for each 
carrier:

[source,js]
----------------------------------
{
  "preview" : [
    {
      "carrier" : "ES-Air",
      "flights_count" : 2802.0,
      "flight_mins_total" : 1436927.5130677223,
      "delay_time_percentage" : 9.335543983955839,
      "delay_mins_total" : 134145.0
    },
    ...
  ]
}
----------------------------------
// NOTCONSOLE

This {transform} makes it easier to answer questions such as:

* Which air carrier has the most delays as a percentage of flight time?

NOTE: This data is fictional and does not reflect actual delays or flight stats 
for any of the featured destination or origin airports.


[[example-clientips]]
== Finding suspicious client IPs

This example uses the web log sample data set to identify suspicious client IPs. 
It transform the data such that the new index contains the sum of bytes and the 
number of distinct URLs, agents, incoming requests by location, and geographic 
destinations for each client IP. It also uses filter aggregations to count the 
specific types of HTTP responses that each client IP receives. Ultimately, the 
example below transforms web log data into an entity centric index where the 
entity is `clientip`.

[source,console]
----------------------------------
PUT _transform/suspicious_client_ips
{
  "source": {
    "index": "kibana_sample_data_logs"
  },
  "dest" : { <1>
    "index" : "sample_weblogs_by_clientip"
  },
  "sync" : { <2>
    "time": {
      "field": "timestamp",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {  <3>
      "clientip": { "terms": { "field": "clientip" } }
      },
    "aggregations": {
      "url_dc": { "cardinality": { "field": "url.keyword" }},
      "bytes_sum": { "sum": { "field": "bytes" }},
      "geo.src_dc": { "cardinality": { "field": "geo.src" }},
      "agent_dc": { "cardinality": { "field": "agent.keyword" }},
      "geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
      "responses.total": { "value_count": { "field": "timestamp" }},
      "success" : { <4>
         "filter": { 
            "term": { "response" : "200"}} 
        },
      "error404" : {
         "filter": { 
            "term": { "response" : "404"}}
        },
      "error503" : {
         "filter": { 
            "term": { "response" : "503"}}
        },
      "timestamp.min": { "min": { "field": "timestamp" }},
      "timestamp.max": { "max": { "field": "timestamp" }},
      "timestamp.duration_ms": { <5>
        "bucket_script": {
          "buckets_path": {
            "min_time": "timestamp.min.value",
            "max_time": "timestamp.max.value"
          },
          "script": "(params.max_time - params.min_time)"
        }
      }
    }
  }
}
----------------------------------
// TEST[skip:setup kibana sample data]

<1> The destination index for the {transform}.
<2> Configures the {transform} to run continuously. It uses the `timestamp` 
field to synchronize the source and destination indices. The worst case 
ingestion delay is 60 seconds.
<3> The data is grouped by the `clientip` field.
<4> Filter aggregation that counts the occurrences of successful (`200`) 
responses in the `response` field. The following two aggregations (`error404` 
and `error503`) count the error responses by error codes.
<5> This `bucket_script` calculates the duration of the `clientip` access based
on the results of the aggregation.


After you create the {transform}, you must start it:

[source,console]
----------------------------------
POST _transform/suspicious_client_ips/_start
----------------------------------
// TEST[skip:setup kibana sample data]


Shortly thereafter, the first results should be available in the destination
index:

[source,console]
----------------------------------
GET sample_weblogs_by_clientip/_search
----------------------------------
// TEST[skip:setup kibana sample data]


The search result shows you data like this for each client IP:

[source,js]
----------------------------------
    "hits" : [
      {
        "_index" : "sample_weblogs_by_clientip",
        "_id" : "MOeHH_cUL5urmartKj-b5UQAAAAAAAAA",
        "_score" : 1.0,
        "_source" : {
          "geo" : {
            "src_dc" : 2.0,
            "dest_dc" : 2.0
          },
          "success" : 2,
          "error404" : 0,
          "error503" : 0,
          "clientip" : "0.72.176.46",
          "agent_dc" : 2.0,
          "bytes_sum" : 4422.0,
          "responses" : {
            "total" : 2.0
          },
          "url_dc" : 2.0,
          "timestamp" : {
            "duration_ms" : 5.2191698E8,
            "min" : "2020-03-16T07:51:57.333Z",
            "max" : "2020-03-22T08:50:34.313Z"
          }
        }
      }
    ]
----------------------------------
// NOTCONSOLE

NOTE: Like other Kibana sample data sets, the web log sample dataset contains
timestamps relative to when you installed it, including timestamps in the 
future. The {ctransform} will pick up the data points once they are in the past. 
If you installed the web log sample dataset some time ago, you can uninstall and 
reinstall it and the timestamps will change.


This {transform} makes it easier to answer questions such as:

* Which client IPs are transferring the most amounts of data?

* Which client IPs are interacting with a high number of different URLs?

* Which client IPs have high error rates?

* Which client IPs are interacting with a high number of destination countries?


[[example-last-log]]
== Finding the last log event for each IP address

This example uses the web log sample data set to find the last log from an IP 
address. Let's use the `latest` type of {transform} in continuous mode. It 
copies the most recent document for each unique key from the source index to the destination index
and updates the destination index as new data comes into the source index. 

Pick the `clientip` field as the unique key; the data is grouped by this field. 
Select `timestamp` as the date field that sorts the data chronologically. For 
continuous mode, specify a date field that is used to identify new documents, 
and an interval between checks for changes in the source index.

 Let's assume that we're interested in retaining documents only for IP addresses that appeared recently in the log. You can define a retention policy and specify a date field that is used to calculate 
the age of a document. This example uses the same date field that is used to 
sort the data. Then set the maximum age of a document; documents that are older 
than the value you set will be removed from the destination index.

This {transform} creates the destination index that contains the latest login 
date for each client IP. As the {transform} runs in continuous mode, the 
destination index will be updated as new data that comes into the source index. 
Finally, every document that is older than 30 days will be removed from the 
destination index due to the applied retention policy.

[source,console]
----------------------------------
PUT _transform/last-log-from-clientip
{
  "source": {
    "index": [
      "kibana_sample_data_logs"
    ]
  },
  "latest": {
    "unique_key": [ <1>
      "clientip"
    ],
    "sort": "timestamp" <2>
  },
  "frequency": "1m", <3>
  "dest": {
    "index": "last-log-from-clientip"
  },
  "sync": { <4>
    "time": {
      "field": "timestamp",
      "delay": "60s"
    }
  },
  "retention_policy": { <5>
    "time": {
      "field": "timestamp",
      "max_age": "30d"
    }
  },
  "settings": {
    "max_page_search_size": 500
  }
}

----------------------------------
// TEST[skip:setup kibana sample data]

<1> Specifies the field for grouping the data.
<2> Specifies the date field that is used for sorting the data.
<3> Sets the interval for the {transform} to check for changes in the source 
index.
<4> Contains the time field and delay settings used to synchronize the source 
and destination indices.
<5> Specifies the retention policy for the transform. Documents that are older 
than the configured value will be removed from the destination index. 


After you create the {transform}, start it:

[source,console]
----------------------------------
POST _transform/last-log-from-clientip/_start
----------------------------------
// TEST[skip:setup kibana sample data]


After the {transform} processes the data, search the destination index:

[source,console]
----------------------------------
GET last-log-from-clientip/_search
----------------------------------
// TEST[skip:setup kibana sample data]


The search result shows you data like this for each client IP:

[source,js]
----------------------------------
{
  "_index" : "last-log-from-clientip",
  "_id" : "MOeHH_cUL5urmartKj-b5UQAAAAAAAAA",
  "_score" : 1.0,
  "_source" : {
    "referer" : "http://twitter.com/error/don-lind",
    "request" : "/elasticsearch",
    "agent" : "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
    "extension" : "",
    "memory" : null,
    "ip" : "0.72.176.46",
    "index" : "kibana_sample_data_logs",
    "message" : "0.72.176.46 - - [2018-09-18T06:31:00.572Z] \"GET /elasticsearch HTTP/1.1\" 200 7065 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)\"",
    "url" : "https://www.elastic.co/downloads/elasticsearch",
    "tags" : [
      "success",
      "info"
    ],
    "geo" : {
      "srcdest" : "IN:PH",
      "src" : "IN",
      "coordinates" : {
        "lon" : -124.1127917,
        "lat" : 40.80338889
      },
      "dest" : "PH"
    },
    "utc_time" : "2021-05-04T06:31:00.572Z",
    "bytes" : 7065,
    "machine" : {
      "os" : "ios",
      "ram" : 12884901888
    },
    "response" : 200,
    "clientip" : "0.72.176.46",
    "host" : "www.elastic.co",
    "event" : {
      "dataset" : "sample_web_logs"
    },
    "phpmemory" : null,
    "timestamp" : "2021-05-04T06:31:00.572Z"
  }
}
----------------------------------
// NOTCONSOLE

This {transform} makes it easier to answer questions such as:

* What was the most recent log event associated with a specific IP address?