common-script-uses.asciidoc 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426
  1. [[common-script-uses]]
  2. == Common scripting use cases
  3. You can write a script to do almost anything, and sometimes, that's
  4. the trouble. It's challenging to know what's possible with scripts,
  5. so the following examples address common uses cases where scripts are
  6. really helpful.
  7. * <<scripting-field-extraction,Field extraction>>
  8. [[scripting-field-extraction]]
  9. === Field extraction
  10. The goal of field extraction is simple; you have fields in your data with a bunch of
  11. information, but you only want to extract pieces and parts.
  12. There are two options at your disposal:
  13. * <<grok,Grok>> is a regular expression dialect that supports aliased
  14. expressions that you can reuse. Because Grok sits on top of regular expressions
  15. (regex), any regular expressions are valid in grok as well.
  16. * <<dissect-processor,Dissect>> extracts structured fields out of text, using
  17. delimiters to define the matching pattern. Unlike grok, dissect doesn't use regular
  18. expressions.
  19. Regex is incredibly powerful but can be complicated. If you don't need the
  20. power of regular expressions, use dissect patterns, which are simple and
  21. often faster than grok patterns. Paying special attention to the parts of the string
  22. you want to discard will help build successful dissect patterns.
  23. Let's start with a simple example by adding the `@timestamp` and `message`
  24. fields to the `my-index` mapping as indexed fields. To remain flexible, use
  25. `wildcard` as the field type for `message`:
  26. [source,console]
  27. ----
  28. PUT /my-index/
  29. {
  30. "mappings": {
  31. "properties": {
  32. "@timestamp": {
  33. "format": "strict_date_optional_time||epoch_second",
  34. "type": "date"
  35. },
  36. "message": {
  37. "type": "wildcard"
  38. }
  39. }
  40. }
  41. }
  42. ----
  43. After mapping the fields you want to retrieve, index a few records from
  44. your log data into {es}. The following request uses the <<docs-bulk,bulk API>>
  45. to index raw log data into `my-index`. Instead of indexing all of your log
  46. data, you can use a small sample to experiment with runtime fields.
  47. [source,console]
  48. ----
  49. POST /my-index/_bulk?refresh
  50. {"index":{}}
  51. {"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  52. {"index":{}}
  53. {"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  54. {"index":{}}
  55. {"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  56. {"index":{}}
  57. {"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
  58. {"index":{}}
  59. {"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
  60. {"index":{}}
  61. {"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  62. {"index":{}}
  63. {"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}
  64. ----
  65. // TEST[continued]
  66. [discrete]
  67. [[field-extraction-ip]]
  68. ==== Extract an IP address from a log message (Grok)
  69. If you want to retrieve results that include `clientip`, you can add that
  70. field as a runtime field in the mapping. The following runtime script defines a
  71. grok pattern that extracts structured fields out of the `message` field.
  72. The script matches on the `%{COMMONAPACHELOG}` log pattern, which understands
  73. the structure of Apache logs. If the pattern matches, the script emits the
  74. value matching the IP address. If the pattern doesn't match
  75. (`clientip != null`), the script just returns the field value without crashing.
  76. [source,console]
  77. ----
  78. PUT my-index/_mappings
  79. {
  80. "runtime": {
  81. "http.clientip": {
  82. "type": "ip",
  83. "script": """
  84. String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
  85. if (clientip != null) emit(clientip); <1>
  86. """
  87. }
  88. }
  89. }
  90. ----
  91. // TEST[continued]
  92. <1> This condition ensures that the script doesn't emit anything even if the pattern of
  93. the message doesn't match.
  94. You can define a simple query to run a search for a specific IP address and
  95. return all related fields. Use the `fields` parameter of the search API to
  96. retrieve the `http.clientip` runtime field.
  97. [source,console]
  98. ----
  99. GET my-index/_search
  100. {
  101. "query": {
  102. "match": {
  103. "http.clientip": "40.135.0.0"
  104. }
  105. },
  106. "fields" : ["http.clientip"]
  107. }
  108. ----
  109. // TEST[continued]
  110. // TEST[s/_search/_search\?filter_path=hits/]
  111. The response includes documents where the value for `http.clientip` matches
  112. `40.135.0.0`.
  113. [source,console-result]
  114. ----
  115. {
  116. "hits" : {
  117. "total" : {
  118. "value" : 1,
  119. "relation" : "eq"
  120. },
  121. "max_score" : 1.0,
  122. "hits" : [
  123. {
  124. "_index" : "my-index",
  125. "_id" : "Rq-ex3gBA_A0V6dYGLQ7",
  126. "_score" : 1.0,
  127. "_source" : {
  128. "timestamp" : "2020-04-30T14:30:17-05:00",
  129. "message" : "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
  130. },
  131. "fields" : {
  132. "http.clientip" : [
  133. "40.135.0.0"
  134. ]
  135. }
  136. }
  137. ]
  138. }
  139. }
  140. ----
  141. // TESTRESPONSE[s/"_id" : "Rq-ex3gBA_A0V6dYGLQ7"/"_id": $body.hits.hits.0._id/]
  142. [discrete]
  143. [[field-extraction-parse]]
  144. ==== Parse a string to extract part of a field (Dissect)
  145. Instead of matching on a log pattern like in the <<field-extraction-ip,previous example>>, you can just define a dissect pattern to include the parts of the string
  146. that you want to discard.
  147. For example, the log data at the start of this section includes a `message`
  148. field. This field contains several pieces of data:
  149. [source,js]
  150. ----
  151. "message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
  152. ----
  153. // NOTCONSOLE
  154. You can define a dissect pattern in a runtime field to extract the https://developer.mozilla.org/en-US/docs/Web/HTTP/Status[HTTP response code], which is
  155. `304` in the previous example.
  156. [source,console]
  157. ----
  158. PUT my-index/_mappings
  159. {
  160. "runtime": {
  161. "http.response": {
  162. "type": "long",
  163. "script": """
  164. String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
  165. if (response != null) emit(Integer.parseInt(response));
  166. """
  167. }
  168. }
  169. }
  170. ----
  171. // TEST[continued]
  172. You can then run a query to retrieve a specific HTTP response using the
  173. `http.response` runtime field:
  174. [source,console]
  175. ----
  176. GET my-index/_search
  177. {
  178. "query": {
  179. "match": {
  180. "http.response": "304"
  181. }
  182. },
  183. "fields" : ["http.response"]
  184. }
  185. ----
  186. // TEST[continued]
  187. // TEST[s/_search/_search\?filter_path=hits/]
  188. The response includes a single document where the HTTP response is `304`:
  189. [source,console-result]
  190. ----
  191. {
  192. "hits" : {
  193. "total" : {
  194. "value" : 1,
  195. "relation" : "eq"
  196. },
  197. "max_score" : 1.0,
  198. "hits" : [
  199. {
  200. "_index" : "my-index",
  201. "_id" : "Sq-ex3gBA_A0V6dYGLQ7",
  202. "_score" : 1.0,
  203. "_source" : {
  204. "timestamp" : "2020-04-30T14:31:22-05:00",
  205. "message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
  206. },
  207. "fields" : {
  208. "http.response" : [
  209. 304
  210. ]
  211. }
  212. }
  213. ]
  214. }
  215. }
  216. ----
  217. // TESTRESPONSE[s/"_id" : "Sq-ex3gBA_A0V6dYGLQ7"/"_id": $body.hits.hits.0._id/]
  218. [discrete]
  219. [[field-extraction-split]]
  220. ==== Split values in a field by a separator (Dissect)
  221. Let's say you want to extract part of a field like in the previous example, but you
  222. want to split on specific values. You can use a dissect pattern to extract only the
  223. information that you want, and also return that data in a specific format.
  224. For example, let's say you have a bunch of garbage collection (gc) log data from {es}
  225. in this format:
  226. [source,txt]
  227. ----
  228. [2021-04-27T16:16:34.699+0000][82460][gc,heap,exit] class space used 266K, capacity 384K, committed 384K, reserved 1048576K
  229. ----
  230. // NOTCONSOLE
  231. You only want to extract the `used`, `capacity`, and `committed` data, along with
  232. the associated values. Let's index some a few documents containing log data to use as
  233. an example:
  234. [source,console]
  235. ----
  236. POST /my-index/_bulk?refresh
  237. {"index":{}}
  238. {"gc": "[2021-04-27T16:16:34.699+0000][82460][gc,heap,exit] class space used 266K, capacity 384K, committed 384K, reserved 1048576K"}
  239. {"index":{}}
  240. {"gc": "[2021-03-24T20:27:24.184+0000][90239][gc,heap,exit] class space used 15255K, capacity 16726K, committed 16844K, reserved 1048576K"}
  241. {"index":{}}
  242. {"gc": "[2021-03-24T20:27:24.184+0000][90239][gc,heap,exit] Metaspace used 115409K, capacity 119541K, committed 120248K, reserved 1153024K"}
  243. {"index":{}}
  244. {"gc": "[2021-04-19T15:03:21.735+0000][84408][gc,heap,exit] class space used 14503K, capacity 15894K, committed 15948K, reserved 1048576K"}
  245. {"index":{}}
  246. {"gc": "[2021-04-19T15:03:21.735+0000][84408][gc,heap,exit] Metaspace used 107719K, capacity 111775K, committed 112724K, reserved 1146880K"}
  247. {"index":{}}
  248. {"gc": "[2021-04-27T16:16:34.699+0000][82460][gc,heap,exit] class space used 266K, capacity 367K, committed 384K, reserved 1048576K"}
  249. ----
  250. Looking at the data again, there's a timestamp, some other data that you're not
  251. interested in, and then the `used`, `capacity`, and `committed` data:
  252. [source,txt]
  253. ----
  254. [2021-04-27T16:16:34.699+0000][82460][gc,heap,exit] class space used 266K, capacity 384K, committed 384K, reserved 1048576K
  255. ----
  256. You can assign variables to each part of the data in the `gc` field, and then return
  257. only the parts that you want. Anything in curly braces `{}` is considered a variable.
  258. For example, the variables `[%{@timestamp}][%{code}][%{desc}]` will match the first
  259. three chunks of data, all of which are in square brackets `[]`.
  260. [source,txt]
  261. ----
  262. [%{@timestamp}][%{code}][%{desc}] %{ident} used %{usize}, capacity %{csize}, committed %{comsize}, reserved %{rsize}
  263. ----
  264. Your dissect pattern can include the terms `used`, `capacity`, and `committed` instead
  265. of using variables, because you want to return those terms exactly. You also assign
  266. variables to the values you want to return, such as `%{usize}`, `%{csize}`, and
  267. `%{comsize}`. The separator in the log data is a comma, so your dissect pattern also
  268. needs to use that separator.
  269. Now that you have a dissect pattern, you can include it in a Painless script as part
  270. of a runtime field. The script uses your dissect pattern to split apart the `gc`
  271. field, and then returns exactly the information that you want as defined by the
  272. `emit` method. Because dissect uses simple syntax, you just need to tell it exactly
  273. what you want.
  274. The following pattern tells dissect to return the term `used`, a blank space, the value
  275. from `gc.usize`, and a comma. This pattern repeats for the other data that you
  276. want to retrieve. While this pattern might not be as useful in production, it provides
  277. a lot of flexibility to experiment with and manipulate your data. In a production
  278. setting, you might just want to use `emit(gc.usize)` and then aggregate on that value
  279. or use it in computations.
  280. [source,painless]
  281. ----
  282. emit("used" + ' ' + gc.usize + ', ' + "capacity" + ' ' + gc.csize + ', ' + "committed" + ' ' + gc.comsize)
  283. ----
  284. Putting it all together, you can create a runtime field named `gc_size` in a search
  285. request. Using the <<search-fields-param,`fields` option>>, you can retrieve all values
  286. for the `gc_size` runtime field. This query also includes a bucket aggregation to group
  287. your data.
  288. [source,console]
  289. ----
  290. GET my-index/_search
  291. {
  292. "runtime_mappings": {
  293. "gc_size": {
  294. "type": "keyword",
  295. "script": """
  296. Map gc=dissect('[%{@timestamp}][%{code}][%{desc}] %{ident} used %{usize}, capacity %{csize}, committed %{comsize}, reserved %{rsize}').extract(doc["gc.keyword"].value);
  297. if (gc != null) emit("used" + ' ' + gc.usize + ', ' + "capacity" + ' ' + gc.csize + ', ' + "committed" + ' ' + gc.comsize);
  298. """
  299. }
  300. },
  301. "size": 1,
  302. "aggs": {
  303. "sizes": {
  304. "terms": {
  305. "field": "gc_size",
  306. "size": 10
  307. }
  308. }
  309. },
  310. "fields" : ["gc_size"]
  311. }
  312. ----
  313. // TEST[continued]
  314. The response includes the data from the `gc_size` field, formatted exactly as you
  315. defined it in the dissect pattern!
  316. [source,console-result]
  317. ----
  318. {
  319. "took" : 2,
  320. "timed_out" : false,
  321. "_shards" : {
  322. "total" : 1,
  323. "successful" : 1,
  324. "skipped" : 0,
  325. "failed" : 0
  326. },
  327. "hits" : {
  328. "total" : {
  329. "value" : 6,
  330. "relation" : "eq"
  331. },
  332. "max_score" : 1.0,
  333. "hits" : [
  334. {
  335. "_index" : "my-index",
  336. "_id" : "GXx3H3kBKGE42WRNlddJ",
  337. "_score" : 1.0,
  338. "_source" : {
  339. "gc" : "[2021-04-27T16:16:34.699+0000][82460][gc,heap,exit] class space used 266K, capacity 384K, committed 384K, reserved 1048576K"
  340. },
  341. "fields" : {
  342. "gc_size" : [
  343. "used 266K, capacity 384K, committed 384K"
  344. ]
  345. }
  346. }
  347. ]
  348. },
  349. "aggregations" : {
  350. "sizes" : {
  351. "doc_count_error_upper_bound" : 0,
  352. "sum_other_doc_count" : 0,
  353. "buckets" : [
  354. {
  355. "key" : "used 107719K, capacity 111775K, committed 112724K",
  356. "doc_count" : 1
  357. },
  358. {
  359. "key" : "used 115409K, capacity 119541K, committed 120248K",
  360. "doc_count" : 1
  361. },
  362. {
  363. "key" : "used 14503K, capacity 15894K, committed 15948K",
  364. "doc_count" : 1
  365. },
  366. {
  367. "key" : "used 15255K, capacity 16726K, committed 16844K",
  368. "doc_count" : 1
  369. },
  370. {
  371. "key" : "used 266K, capacity 367K, committed 384K",
  372. "doc_count" : 1
  373. },
  374. {
  375. "key" : "used 266K, capacity 384K, committed 384K",
  376. "doc_count" : 1
  377. }
  378. ]
  379. }
  380. }
  381. }
  382. ----
  383. // TESTRESPONSE[s/"took" : 2/"took": "$body.took"/]
  384. // TESTRESPONSE[s/"_id" : "GXx3H3kBKGE42WRNlddJ"/"_id": $body.hits.hits.0._id/]