grok-syntax.asciidoc 8.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251
  1. [[grok]]
  2. === Grokking grok
  3. Grok is a regular expression dialect that supports reusable aliased expressions. Grok works really well with syslog logs, Apache and other webserver
  4. logs, mysql logs, and generally any log format that is written for humans and
  5. not computer consumption.
  6. Grok sits on top of the https://github.com/kkos/oniguruma/blob/master/doc/RE[Oniguruma] regular expression library, so any regular expressions are
  7. valid in grok. Grok uses this regular expression language to allow naming
  8. existing patterns and combining them into more complex patterns that match your
  9. fields.
  10. [[grok-syntax]]
  11. ==== Grok patterns
  12. The {stack} ships with numerous https://github.com/elastic/elasticsearch/blob/master/libs/grok/src/main/resources/patterns/legacy/grok-patterns[predefined grok patterns] that simplify working with grok. The syntax for reusing grok patterns
  13. takes one of the following forms:
  14. [%autowidth]
  15. |===
  16. |`%{SYNTAX}` | `%{SYNTAX:ID}` |`%{SYNTAX:ID:TYPE}`
  17. |===
  18. `SYNTAX`::
  19. The name of the pattern that will match your text. For example, `NUMBER` and
  20. `IP` are both patterns that are provided within the default patterns set. The
  21. `NUMBER` pattern matches data like `3.44`, and the `IP` pattern matches data
  22. like `55.3.244.1`.
  23. `ID`::
  24. The identifier you give to the piece of text being matched. For example, `3.44`
  25. could be the duration of an event, so you might call it `duration`. The string
  26. `55.3.244.1` might identify the `client` making a request.
  27. `TYPE`::
  28. The data type you want to cast your named field. `int`, `long`, `double`,
  29. `float` and `boolean` are supported types.
  30. For example, let's say you have message data that looks like this:
  31. [source,txt]
  32. ----
  33. 3.44 55.3.244.1
  34. ----
  35. The first value is a number, followed by what appears to be an IP address. You
  36. can match this text by using the following grok expression:
  37. [source,txt]
  38. ----
  39. %{NUMBER:duration} %{IP:client}
  40. ----
  41. [[grok-ecs]]
  42. ==== Migrating to Elastic Common Schema (ECS)
  43. To ease migration to the {ecs-ref}[Elastic Common Schema (ECS)], a new set of
  44. ECS-compliant patterns is available in addition to the existing patterns. The
  45. new ECS pattern definitions capture event field names that are compliant with
  46. the schema.
  47. The ECS pattern set has all of the pattern definitions from the legacy set, and
  48. is a drop-in replacement. Use the
  49. {logstash-ref}/plugins-filters-grok.html#plugins-filters-grok-ecs_compatibility[`ecs-compatability`]
  50. setting to switch modes.
  51. New features and enhancements will be added to the ECS-compliant files. The
  52. legacy patterns may still receive bug fixes which are backwards compatible.
  53. [[grok-patterns]]
  54. ==== Use grok patterns in Painless scripts
  55. You can incorporate predefined grok patterns into Painless scripts to extract
  56. data. To test your script, use either the {painless}/painless-execute-api.html#painless-execute-runtime-field-context[field contexts] of the Painless
  57. execute API or create a runtime field that includes the script. Runtime fields
  58. offer greater flexibility and accept multiple documents, but the Painless
  59. execute API is a great option if you don't have write access on a cluster
  60. where you're testing a script.
  61. TIP: If you need help building grok patterns to match your data, use the
  62. {kibana-ref}/xpack-grokdebugger.html[Grok Debugger] tool in {kib}.
  63. For example, if you're working with Apache log data, you can use the
  64. `%{COMMONAPACHELOG}` syntax, which understands the structure of Apache logs. A
  65. sample document might look like this:
  66. // Note to contributors that the line break in the following example is
  67. // intentional to promote better readability in the output
  68. [source,js]
  69. ----
  70. "timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - -
  71. [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
  72. ----
  73. // NOTCONSOLE
  74. To extract the IP address from the `message` field, you can write a Painless
  75. script that incorporates the `%{COMMONAPACHELOG}` syntax. You can test this
  76. script using the {painless}/painless-execute-api.html#painless-runtime-ip[`ip` field context] of the Painless execute API, but let's use a runtime field
  77. instead.
  78. Based on the sample document, index the `@timestamp` and `message` fields. To
  79. remain flexible, use `wildcard` as the field type for `message`:
  80. [source,console]
  81. ----
  82. PUT /my-index/
  83. {
  84. "mappings": {
  85. "properties": {
  86. "@timestamp": {
  87. "format": "strict_date_optional_time||epoch_second",
  88. "type": "date"
  89. },
  90. "message": {
  91. "type": "wildcard"
  92. }
  93. }
  94. }
  95. }
  96. ----
  97. Next, use the <<docs-bulk,bulk API>> to index some log data into
  98. `my-index`.
  99. [source,console]
  100. ----
  101. POST /my-index/_bulk?refresh
  102. {"index":{}}
  103. {"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  104. {"index":{}}
  105. {"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  106. {"index":{}}
  107. {"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  108. {"index":{}}
  109. {"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
  110. {"index":{}}
  111. {"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
  112. {"index":{}}
  113. {"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
  114. {"index":{}}
  115. {"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}
  116. ----
  117. // TEST[continued]
  118. [[grok-patterns-runtime]]
  119. ==== Incorporate grok patterns and scripts in runtime fields
  120. Now you can define a runtime field in the mappings that includes your Painless
  121. script and grok pattern. If the pattern matches, the script emits the value of
  122. the matching IP address. If the pattern doesn't match (`clientip != null`), the
  123. script just returns the field value without crashing.
  124. [source,console]
  125. ----
  126. PUT my-index/_mappings
  127. {
  128. "runtime": {
  129. "http.clientip": {
  130. "type": "ip",
  131. "script": """
  132. String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
  133. if (clientip != null) emit(clientip);
  134. """
  135. }
  136. }
  137. }
  138. ----
  139. // TEST[continued]
  140. Alternatively, you can define the same runtime field but in the context of a
  141. search request. The runtime definition and the script are exactly the same as
  142. the one defined previously in the index mapping. Just copy that definition into
  143. the search request under the `runtime_mappings` section and include a query
  144. that matches on the runtime field. This query returns the same results as if
  145. you <<grok-pattern-results,defined a search query>> for the `http.clientip`
  146. runtime field in your index mappings, but only in the context of this specific
  147. search:
  148. [source,console]
  149. ----
  150. GET my-index/_search
  151. {
  152. "runtime_mappings": {
  153. "http.clientip": {
  154. "type": "ip",
  155. "script": """
  156. String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
  157. if (clientip != null) emit(clientip);
  158. """
  159. }
  160. },
  161. "query": {
  162. "match": {
  163. "http.clientip": "40.135.0.0"
  164. }
  165. },
  166. "fields" : ["http.clientip"]
  167. }
  168. ----
  169. // TEST[continued]
  170. [[grok-pattern-results]]
  171. ==== Return calculated results
  172. Using the `http.clientip` runtime field, you can define a simple query to run a
  173. search for a specific IP address and return all related fields. The <<search-fields,`fields`>> parameter on the `_search` API works for all fields,
  174. even those that weren't sent as part of the original `_source`:
  175. [source,console]
  176. ----
  177. GET my-index/_search
  178. {
  179. "query": {
  180. "match": {
  181. "http.clientip": "40.135.0.0"
  182. }
  183. },
  184. "fields" : ["http.clientip"]
  185. }
  186. ----
  187. // TEST[continued]
  188. // TEST[s/_search/_search\?filter_path=hits/]
  189. The response includes the specific IP address indicated in your search query.
  190. The grok pattern within the Painless script extracted this value from the
  191. `message` field at runtime.
  192. [source,console-result]
  193. ----
  194. {
  195. "hits" : {
  196. "total" : {
  197. "value" : 1,
  198. "relation" : "eq"
  199. },
  200. "max_score" : 1.0,
  201. "hits" : [
  202. {
  203. "_index" : "my-index",
  204. "_id" : "1iN2a3kBw4xTzEDqyYE0",
  205. "_score" : 1.0,
  206. "_source" : {
  207. "timestamp" : "2020-04-30T14:30:17-05:00",
  208. "message" : "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
  209. },
  210. "fields" : {
  211. "http.clientip" : [
  212. "40.135.0.0"
  213. ]
  214. }
  215. }
  216. ]
  217. }
  218. }
  219. ----
  220. // TESTRESPONSE[s/"_id" : "1iN2a3kBw4xTzEDqyYE0"/"_id": $body.hits.hits.0._id/]