dataframe-examples.asciidoc 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335
  1. [role="xpack"]
  2. [testenv="basic"]
  3. [[dataframe-examples]]
  4. == {transform-cap} examples
  5. ++++
  6. <titleabbrev>Examples</titleabbrev>
  7. ++++
  8. beta[]
  9. These examples demonstrate how to use {transforms} to derive useful
  10. insights from your data. All the examples use one of the
  11. {kibana-ref}/add-sample-data.html[{kib} sample datasets]. For a more detailed,
  12. step-by-step example, see
  13. <<ecommerce-dataframes,Transforming your data with {dataframes}>>.
  14. * <<ecommerce-dataframes>>
  15. * <<example-best-customers>>
  16. * <<example-airline>>
  17. * <<example-clientips>>
  18. include::ecommerce-example.asciidoc[]
  19. [[example-best-customers]]
  20. === Finding your best customers
  21. In this example, we use the eCommerce orders sample dataset to find the customers
  22. who spent the most in our hypothetical webshop. Let's transform the data such
  23. that the destination index contains the number of orders, the total price of
  24. the orders, the amount of unique products and the average price per order,
  25. and the total amount of ordered products for each customer.
  26. [source,console]
  27. ----------------------------------
  28. POST _data_frame/transforms/_preview
  29. {
  30. "source": {
  31. "index": "kibana_sample_data_ecommerce"
  32. },
  33. "dest" : { <1>
  34. "index" : "sample_ecommerce_orders_by_customer"
  35. },
  36. "pivot": {
  37. "group_by": { <2>
  38. "user": { "terms": { "field": "user" }},
  39. "customer_id": { "terms": { "field": "customer_id" }}
  40. },
  41. "aggregations": {
  42. "order_count": { "value_count": { "field": "order_id" }},
  43. "total_order_amt": { "sum": { "field": "taxful_total_price" }},
  44. "avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
  45. "avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
  46. "total_unique_products": { "cardinality": { "field": "products.product_id" }}
  47. }
  48. }
  49. }
  50. ----------------------------------
  51. // TEST[skip:setup kibana sample data]
  52. <1> This is the destination index for the {dataframe}. It is ignored by
  53. `_preview`.
  54. <2> Two `group_by` fields have been selected. This means the {dataframe} will
  55. contain a unique row per `user` and `customer_id` combination. Within this
  56. dataset both these fields are unique. By including both in the {dataframe} it
  57. gives more context to the final results.
  58. NOTE: In the example above, condensed JSON formatting has been used for easier
  59. readability of the pivot object.
  60. The preview {transforms} API enables you to see the layout of the
  61. {dataframe} in advance, populated with some sample values. For example:
  62. [source,js]
  63. ----------------------------------
  64. {
  65. "preview" : [
  66. {
  67. "total_order_amt" : 3946.9765625,
  68. "order_count" : 59.0,
  69. "total_unique_products" : 116.0,
  70. "avg_unique_products_per_order" : 2.0,
  71. "customer_id" : "10",
  72. "user" : "recip",
  73. "avg_amt_per_order" : 66.89790783898304
  74. },
  75. ...
  76. ]
  77. }
  78. ----------------------------------
  79. // NOTCONSOLE
  80. This {dataframe} makes it easier to answer questions such as:
  81. * Which customers spend the most?
  82. * Which customers spend the most per order?
  83. * Which customers order most often?
  84. * Which customers ordered the least number of different products?
  85. It's possible to answer these questions using aggregations alone, however
  86. {dataframes} allow us to persist this data as a customer centric index. This
  87. enables us to analyze data at scale and gives more flexibility to explore and
  88. navigate data from a customer centric perspective. In some cases, it can even
  89. make creating visualizations much simpler.
  90. [[example-airline]]
  91. === Finding air carriers with the most delays
  92. In this example, we use the Flights sample dataset to find out which air carrier
  93. had the most delays. First, we filter the source data such that it excludes all
  94. the cancelled flights by using a query filter. Then we transform the data to
  95. contain the distinct number of flights, the sum of delayed minutes, and the sum
  96. of the flight minutes by air carrier. Finally, we use a
  97. {ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_script`]
  98. to determine what percentage of the flight time was actually delay.
  99. [source,console]
  100. ----------------------------------
  101. POST _data_frame/transforms/_preview
  102. {
  103. "source": {
  104. "index": "kibana_sample_data_flights",
  105. "query": { <1>
  106. "bool": {
  107. "filter": [
  108. { "term": { "Cancelled": false } }
  109. ]
  110. }
  111. }
  112. },
  113. "dest" : { <2>
  114. "index" : "sample_flight_delays_by_carrier"
  115. },
  116. "pivot": {
  117. "group_by": { <3>
  118. "carrier": { "terms": { "field": "Carrier" }}
  119. },
  120. "aggregations": {
  121. "flights_count": { "value_count": { "field": "FlightNum" }},
  122. "delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
  123. "flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
  124. "delay_time_percentage": { <4>
  125. "bucket_script": {
  126. "buckets_path": {
  127. "delay_time": "delay_mins_total.value",
  128. "flight_time": "flight_mins_total.value"
  129. },
  130. "script": "(params.delay_time / params.flight_time) * 100"
  131. }
  132. }
  133. }
  134. }
  135. }
  136. ----------------------------------
  137. // TEST[skip:setup kibana sample data]
  138. <1> Filter the source data to select only flights that were not cancelled.
  139. <2> This is the destination index for the {dataframe}. It is ignored by
  140. `_preview`.
  141. <3> The data is grouped by the `Carrier` field which contains the airline name.
  142. <4> This `bucket_script` performs calculations on the results that are returned
  143. by the aggregation. In this particular example, it calculates what percentage of
  144. travel time was taken up by delays.
  145. The preview shows you that the new index would contain data like this for each
  146. carrier:
  147. [source,js]
  148. ----------------------------------
  149. {
  150. "preview" : [
  151. {
  152. "carrier" : "ES-Air",
  153. "flights_count" : 2802.0,
  154. "flight_mins_total" : 1436927.5130677223,
  155. "delay_time_percentage" : 9.335543983955839,
  156. "delay_mins_total" : 134145.0
  157. },
  158. ...
  159. ]
  160. }
  161. ----------------------------------
  162. // NOTCONSOLE
  163. This {dataframe} makes it easier to answer questions such as:
  164. * Which air carrier has the most delays as a percentage of flight time?
  165. NOTE: This data is fictional and does not reflect actual delays
  166. or flight stats for any of the featured destination or origin airports.
  167. [[example-clientips]]
  168. === Finding suspicious client IPs by using scripted metrics
  169. With {transforms}, you can use
  170. {ref}/search-aggregations-metrics-scripted-metric-aggregation.html[scripted
  171. metric aggregations] on your data. These aggregations are flexible and make
  172. it possible to perform very complex processing. Let's use scripted metrics to
  173. identify suspicious client IPs in the web log sample dataset.
  174. We transform the data such that the new index contains the sum of bytes and the
  175. number of distinct URLs, agents, incoming requests by location, and geographic
  176. destinations for each client IP. We also use a scripted field to count the
  177. specific types of HTTP responses that each client IP receives. Ultimately, the
  178. example below transforms web log data into an entity centric index where the
  179. entity is `clientip`.
  180. [source,console]
  181. ----------------------------------
  182. POST _data_frame/transforms/_preview
  183. {
  184. "source": {
  185. "index": "kibana_sample_data_logs",
  186. "query": { <1>
  187. "range" : {
  188. "timestamp" : {
  189. "gte" : "now-30d/d"
  190. }
  191. }
  192. }
  193. },
  194. "dest" : { <2>
  195. "index" : "sample_weblogs_by_clientip"
  196. },
  197. "pivot": {
  198. "group_by": { <3>
  199. "clientip": { "terms": { "field": "clientip" } }
  200. },
  201. "aggregations": {
  202. "url_dc": { "cardinality": { "field": "url.keyword" }},
  203. "bytes_sum": { "sum": { "field": "bytes" }},
  204. "geo.src_dc": { "cardinality": { "field": "geo.src" }},
  205. "agent_dc": { "cardinality": { "field": "agent.keyword" }},
  206. "geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
  207. "responses.total": { "value_count": { "field": "timestamp" }},
  208. "responses.counts": { <4>
  209. "scripted_metric": {
  210. "init_script": "state.responses = ['error':0L,'success':0L,'other':0L]",
  211. "map_script": """
  212. def code = doc['response.keyword'].value;
  213. if (code.startsWith('5') || code.startsWith('4')) {
  214. state.responses.error += 1 ;
  215. } else if(code.startsWith('2')) {
  216. state.responses.success += 1;
  217. } else {
  218. state.responses.other += 1;
  219. }
  220. """,
  221. "combine_script": "state.responses",
  222. "reduce_script": """
  223. def counts = ['error': 0L, 'success': 0L, 'other': 0L];
  224. for (responses in states) {
  225. counts.error += responses['error'];
  226. counts.success += responses['success'];
  227. counts.other += responses['other'];
  228. }
  229. return counts;
  230. """
  231. }
  232. },
  233. "timestamp.min": { "min": { "field": "timestamp" }},
  234. "timestamp.max": { "max": { "field": "timestamp" }},
  235. "timestamp.duration_ms": { <5>
  236. "bucket_script": {
  237. "buckets_path": {
  238. "min_time": "timestamp.min.value",
  239. "max_time": "timestamp.max.value"
  240. },
  241. "script": "(params.max_time - params.min_time)"
  242. }
  243. }
  244. }
  245. }
  246. }
  247. ----------------------------------
  248. // TEST[skip:setup kibana sample data]
  249. <1> This range query limits the {transform} to documents that are within the last
  250. 30 days at the point in time the {transform} checkpoint is processed.
  251. For batch {dataframes} this occurs once.
  252. <2> This is the destination index for the {dataframe}. It is ignored by
  253. `_preview`.
  254. <3> The data is grouped by the `clientip` field.
  255. <4> This `scripted_metric` performs a distributed operation on the web log data
  256. to count specific types of HTTP responses (error, success, and other).
  257. <5> This `bucket_script` calculates the duration of the `clientip` access based
  258. on the results of the aggregation.
  259. The preview shows you that the new index would contain data like this for each
  260. client IP:
  261. [source,js]
  262. ----------------------------------
  263. {
  264. "preview" : [
  265. {
  266. "geo" : {
  267. "src_dc" : 12.0,
  268. "dest_dc" : 9.0
  269. },
  270. "clientip" : "0.72.176.46",
  271. "agent_dc" : 3.0,
  272. "responses" : {
  273. "total" : 14.0,
  274. "counts" : {
  275. "other" : 0,
  276. "success" : 14,
  277. "error" : 0
  278. }
  279. },
  280. "bytes_sum" : 74808.0,
  281. "timestamp" : {
  282. "duration_ms" : 4.919943239E9,
  283. "min" : "2019-06-17T07:51:57.333Z",
  284. "max" : "2019-08-13T06:31:00.572Z"
  285. },
  286. "url_dc" : 11.0
  287. },
  288. ...
  289. }
  290. ----------------------------------
  291. // NOTCONSOLE
  292. This {dataframe} makes it easier to answer questions such as:
  293. * Which client IPs are transferring the most amounts of data?
  294. * Which client IPs are interacting with a high number of different URLs?
  295. * Which client IPs have high error rates?
  296. * Which client IPs are interacting with a high number of destination countries?