examples.asciidoc 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338
  1. [role="xpack"]
  2. [testenv="basic"]
  3. [[transform-examples]]
  4. = {transform-cap} examples
  5. ++++
  6. <titleabbrev>Examples</titleabbrev>
  7. ++++
  8. These examples demonstrate how to use {transforms} to derive useful
  9. insights from your data. All the examples use one of the
  10. {kibana-ref}/add-sample-data.html[{kib} sample datasets]. For a more detailed,
  11. step-by-step example, see
  12. <<ecommerce-transforms>>.
  13. * <<example-best-customers>>
  14. * <<example-airline>>
  15. * <<example-clientips>>
  16. [[example-best-customers]]
  17. == Finding your best customers
  18. In this example, we use the eCommerce orders sample dataset to find the
  19. customers who spent the most in our hypothetical webshop. Let's transform the
  20. data such that the destination index contains the number of orders, the total
  21. price of the orders, the amount of unique products and the average price per
  22. order, and the total amount of ordered products for each customer.
  23. [source,console]
  24. ----------------------------------
  25. POST _transform/_preview
  26. {
  27. "source": {
  28. "index": "kibana_sample_data_ecommerce"
  29. },
  30. "dest" : { <1>
  31. "index" : "sample_ecommerce_orders_by_customer"
  32. },
  33. "pivot": {
  34. "group_by": { <2>
  35. "user": { "terms": { "field": "user" }},
  36. "customer_id": { "terms": { "field": "customer_id" }}
  37. },
  38. "aggregations": {
  39. "order_count": { "value_count": { "field": "order_id" }},
  40. "total_order_amt": { "sum": { "field": "taxful_total_price" }},
  41. "avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
  42. "avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
  43. "total_unique_products": { "cardinality": { "field": "products.product_id" }}
  44. }
  45. }
  46. }
  47. ----------------------------------
  48. // TEST[skip:setup kibana sample data]
  49. <1> This is the destination index for the {transform}. It is ignored by
  50. `_preview`.
  51. <2> Two `group_by` fields have been selected. This means the {transform} will
  52. contain a unique row per `user` and `customer_id` combination. Within this
  53. dataset both these fields are unique. By including both in the {transform} it
  54. gives more context to the final results.
  55. NOTE: In the example above, condensed JSON formatting has been used for easier
  56. readability of the pivot object.
  57. The preview {transforms} API enables you to see the layout of the
  58. {transform} in advance, populated with some sample values. For example:
  59. [source,js]
  60. ----------------------------------
  61. {
  62. "preview" : [
  63. {
  64. "total_order_amt" : 3946.9765625,
  65. "order_count" : 59.0,
  66. "total_unique_products" : 116.0,
  67. "avg_unique_products_per_order" : 2.0,
  68. "customer_id" : "10",
  69. "user" : "recip",
  70. "avg_amt_per_order" : 66.89790783898304
  71. },
  72. ...
  73. ]
  74. }
  75. ----------------------------------
  76. // NOTCONSOLE
  77. This {transform} makes it easier to answer questions such as:
  78. * Which customers spend the most?
  79. * Which customers spend the most per order?
  80. * Which customers order most often?
  81. * Which customers ordered the least number of different products?
  82. It's possible to answer these questions using aggregations alone, however
  83. {transforms} allow us to persist this data as a customer centric index. This
  84. enables us to analyze data at scale and gives more flexibility to explore and
  85. navigate data from a customer centric perspective. In some cases, it can even
  86. make creating visualizations much simpler.
  87. [[example-airline]]
  88. == Finding air carriers with the most delays
  89. In this example, we use the Flights sample dataset to find out which air carrier
  90. had the most delays. First, we filter the source data such that it excludes all
  91. the cancelled flights by using a query filter. Then we transform the data to
  92. contain the distinct number of flights, the sum of delayed minutes, and the sum
  93. of the flight minutes by air carrier. Finally, we use a
  94. <<search-aggregations-pipeline-bucket-script-aggregation,`bucket_script`>>
  95. to determine what percentage of the flight time was actually delay.
  96. [source,console]
  97. ----------------------------------
  98. POST _transform/_preview
  99. {
  100. "source": {
  101. "index": "kibana_sample_data_flights",
  102. "query": { <1>
  103. "bool": {
  104. "filter": [
  105. { "term": { "Cancelled": false } }
  106. ]
  107. }
  108. }
  109. },
  110. "dest" : { <2>
  111. "index" : "sample_flight_delays_by_carrier"
  112. },
  113. "pivot": {
  114. "group_by": { <3>
  115. "carrier": { "terms": { "field": "Carrier" }}
  116. },
  117. "aggregations": {
  118. "flights_count": { "value_count": { "field": "FlightNum" }},
  119. "delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
  120. "flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
  121. "delay_time_percentage": { <4>
  122. "bucket_script": {
  123. "buckets_path": {
  124. "delay_time": "delay_mins_total.value",
  125. "flight_time": "flight_mins_total.value"
  126. },
  127. "script": "(params.delay_time / params.flight_time) * 100"
  128. }
  129. }
  130. }
  131. }
  132. }
  133. ----------------------------------
  134. // TEST[skip:setup kibana sample data]
  135. <1> Filter the source data to select only flights that were not cancelled.
  136. <2> This is the destination index for the {transform}. It is ignored by
  137. `_preview`.
  138. <3> The data is grouped by the `Carrier` field which contains the airline name.
  139. <4> This `bucket_script` performs calculations on the results that are returned
  140. by the aggregation. In this particular example, it calculates what percentage of
  141. travel time was taken up by delays.
  142. The preview shows you that the new index would contain data like this for each
  143. carrier:
  144. [source,js]
  145. ----------------------------------
  146. {
  147. "preview" : [
  148. {
  149. "carrier" : "ES-Air",
  150. "flights_count" : 2802.0,
  151. "flight_mins_total" : 1436927.5130677223,
  152. "delay_time_percentage" : 9.335543983955839,
  153. "delay_mins_total" : 134145.0
  154. },
  155. ...
  156. ]
  157. }
  158. ----------------------------------
  159. // NOTCONSOLE
  160. This {transform} makes it easier to answer questions such as:
  161. * Which air carrier has the most delays as a percentage of flight time?
  162. NOTE: This data is fictional and does not reflect actual delays
  163. or flight stats for any of the featured destination or origin airports.
  164. [[example-clientips]]
  165. == Finding suspicious client IPs
  166. In this example, we use the web log sample dataset to identify suspicious client
  167. IPs. We transform the data such that the new index contains the sum of bytes and
  168. the number of distinct URLs, agents, incoming requests by location, and
  169. geographic destinations for each client IP. We also use filter aggregations to
  170. count the specific types of HTTP responses that each client IP receives.
  171. Ultimately, the example below transforms web log data into an entity centric
  172. index where the entity is `clientip`.
  173. [source,console]
  174. ----------------------------------
  175. PUT _transform/suspicious_client_ips
  176. {
  177. "source": {
  178. "index": "kibana_sample_data_logs"
  179. },
  180. "dest" : { <1>
  181. "index" : "sample_weblogs_by_clientip"
  182. },
  183. "sync" : { <2>
  184. "time": {
  185. "field": "timestamp",
  186. "delay": "60s"
  187. }
  188. },
  189. "pivot": {
  190. "group_by": { <3>
  191. "clientip": { "terms": { "field": "clientip" } }
  192. },
  193. "aggregations": {
  194. "url_dc": { "cardinality": { "field": "url.keyword" }},
  195. "bytes_sum": { "sum": { "field": "bytes" }},
  196. "geo.src_dc": { "cardinality": { "field": "geo.src" }},
  197. "agent_dc": { "cardinality": { "field": "agent.keyword" }},
  198. "geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
  199. "responses.total": { "value_count": { "field": "timestamp" }},
  200. "success" : { <4>
  201. "filter": {
  202. "term": { "response" : "200"}}
  203. },
  204. "error404" : {
  205. "filter": {
  206. "term": { "response" : "404"}}
  207. },
  208. "error503" : {
  209. "filter": {
  210. "term": { "response" : "503"}}
  211. },
  212. "timestamp.min": { "min": { "field": "timestamp" }},
  213. "timestamp.max": { "max": { "field": "timestamp" }},
  214. "timestamp.duration_ms": { <5>
  215. "bucket_script": {
  216. "buckets_path": {
  217. "min_time": "timestamp.min.value",
  218. "max_time": "timestamp.max.value"
  219. },
  220. "script": "(params.max_time - params.min_time)"
  221. }
  222. }
  223. }
  224. }
  225. }
  226. ----------------------------------
  227. // TEST[skip:setup kibana sample data]
  228. <1> This is the destination index for the {transform}.
  229. <2> Configures the {transform} to run continuously. It uses the `timestamp` field
  230. to synchronize the source and destination indices. The worst case
  231. ingestion delay is 60 seconds.
  232. <3> The data is grouped by the `clientip` field.
  233. <4> Filter aggregation that counts the occurrences of successful (`200`)
  234. responses in the `response` field. The following two aggregations (`error404`
  235. and `error503`) count the error responses by error codes.
  236. <5> This `bucket_script` calculates the duration of the `clientip` access based
  237. on the results of the aggregation.
  238. After you create the {transform}, you must start it:
  239. [source,console]
  240. ----------------------------------
  241. POST _transform/suspicious_client_ips/_start
  242. ----------------------------------
  243. // TEST[skip:setup kibana sample data]
  244. Shortly thereafter, the first results should be available in the destination
  245. index:
  246. [source,console]
  247. ----------------------------------
  248. GET sample_weblogs_by_clientip/_search
  249. ----------------------------------
  250. // TEST[skip:setup kibana sample data]
  251. The search result shows you data like this for each client IP:
  252. [source,js]
  253. ----------------------------------
  254. "hits" : [
  255. {
  256. "_index" : "sample_weblogs_by_clientip",
  257. "_id" : "MOeHH_cUL5urmartKj-b5UQAAAAAAAAA",
  258. "_score" : 1.0,
  259. "_source" : {
  260. "geo" : {
  261. "src_dc" : 2.0,
  262. "dest_dc" : 2.0
  263. },
  264. "success" : 2,
  265. "error404" : 0,
  266. "error503" : 0,
  267. "clientip" : "0.72.176.46",
  268. "agent_dc" : 2.0,
  269. "bytes_sum" : 4422.0,
  270. "responses" : {
  271. "total" : 2.0
  272. },
  273. "url_dc" : 2.0,
  274. "timestamp" : {
  275. "duration_ms" : 5.2191698E8,
  276. "min" : "2020-03-16T07:51:57.333Z",
  277. "max" : "2020-03-22T08:50:34.313Z"
  278. }
  279. }
  280. }
  281. ]
  282. ----------------------------------
  283. // NOTCONSOLE
  284. NOTE: Like other Kibana sample data sets, the web log sample dataset contains
  285. timestamps relative to when you installed it, including timestamps in the
  286. future. The {ctransform} will pick up the data points once they are in the past.
  287. If you installed the web log sample dataset some time ago, you can uninstall and
  288. reinstall it and the timestamps will change.
  289. This {transform} makes it easier to answer questions such as:
  290. * Which client IPs are transferring the most amounts of data?
  291. * Which client IPs are interacting with a high number of different URLs?
  292. * Which client IPs have high error rates?
  293. * Which client IPs are interacting with a high number of destination countries?