123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301 |
- [[dissect]]
- === Dissecting data
- Dissect matches a single text field against a defined pattern. A dissect
- pattern is defined by the parts of the string you want to discard. Paying
- special attention to each part of a string helps to build successful dissect
- patterns.
- If you don't need the power of regular expressions, use dissect patterns instead
- of grok. Dissect uses a much simpler syntax than grok and is typically faster
- overall. The syntax for dissect is transparent: tell dissect what you want and
- it will return those results to you.
- [[dissect-syntax]]
- ==== Dissect patterns
- Dissect patterns are comprised of _variables_ and _separators_. Anything
- defined by a percent sign and curly braces `%{}` is considered a variable,
- such as `%{clientip}`. You can assign variables to any part of data in a field,
- and then return only the parts that you want. Separators are any values between
- variables, which could be spaces, dashes, or other delimiters.
- For example, let's say you have log data with a `message` field that looks like
- this:
- [source,js]
- ----
- "message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
- ----
- // NOTCONSOLE
- You assign variables to each part of the data to construct a successful
- dissect pattern. Remember, tell dissect _exactly_ what you want you want to
- match on.
- The first part of the data looks like an IP address, so you
- can assign a variable like `%{clientip}`. The next two characters are dashes
- with a space on either side. You can assign a variable for each dash, or a
- single variable to represent the dashes and spaces. Next are a set of brackets
- containing a timestamp. The brackets are a separator, so you include those in
- the dissect pattern. Thus far, the data and matching dissect pattern look like
- this:
- [source,js]
- ----
- 247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] <1>
- %{clientip} %{ident} %{auth} [%{@timestamp}] <2>
- ----
- // NOTCONSOLE
- <1> The first chunks of data from the `message` field
- <2> Dissect pattern to match on the selected data chunks
- Using that same logic, you can create variables for the remaining chunks of
- data. Double quotation marks are separators, so include those in your dissect
- pattern. The pattern replaces `GET` with a `%{verb}` variable, but keeps `HTTP`
- as part of the pattern.
- [source,js]
- ----
- \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0
- "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}
- ----
- // NOTCONSOLE
- Combining the two patterns results in a dissect pattern that looks like this:
- [source,js]
- ----
- %{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{status} %{size}
- ----
- // NOTCONSOLE
- Now that you have a dissect pattern, how do you test and use it?
- [[dissect-patterns-test]]
- ==== Test dissect patterns with Painless
- You can incorporate dissect patterns into Painless scripts to extract
- data. To test your script, use either the {painless}/painless-execute-api.html#painless-execute-runtime-field-context[field contexts] of the Painless
- execute API or create a runtime field that includes the script. Runtime fields
- offer greater flexibility and accept multiple documents, but the Painless execute
- API is a great option if you don't have write access on a cluster where you're
- testing a script.
- For example, test your dissect pattern with the Painless execute API by
- including your Painless script and a single document that matches your data.
- Start by indexing the `message` field as a `wildcard` data type:
- [source,console]
- ----
- PUT my-index
- {
- "mappings": {
- "properties": {
- "message": {
- "type": "wildcard"
- }
- }
- }
- }
- ----
- If you want to retrieve the HTTP response code, add your dissect pattern to a
- Painless script that extracts the `response` value. To extract values from a
- field, use this function:
- [source,painless]
- ----
- `.extract(doc["<field_name>"].value)?.<field_value>`
- ----
- In this example, `message` is the `<field_name>` and `response` is the
- `<field_value>`:
- [source,console]
- ----
- POST /_scripts/painless/_execute
- {
- "script": {
- "source": """
- String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
- if (response != null) emit(Integer.parseInt(response)); <1>
- """
- },
- "context": "long_field", <2>
- "context_setup": {
- "index": "my-index",
- "document": { <3>
- "message": """247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] "GET /images/hm_nbg.jpg HTTP/1.0" 304 0"""
- }
- }
- }
- ----
- // TEST[continued]
- <1> Runtime fields require the `emit` method to return values.
- <2> Because the response code is an integer, use the `long_field` context.
- <3> Include a sample document that matches your data.
- The result includes the HTTP response code:
- [source,console-result]
- ----
- {
- "result" : [
- 304
- ]
- }
- ----
- [[dissect-patterns-runtime]]
- ==== Use dissect patterns and scripts in runtime fields
- If you have a functional dissect pattern, you can add it to a runtime field to
- manipulate data. Because runtime fields don't require you to index fields, you
- have incredible flexibility to modify your script and how it functions. If you
- already <<dissect-patterns-test,tested your dissect pattern>> using the Painless
- execute API, you can use that _exact_ Painless script in your runtime field.
- To start, add the `message` field as a `wildcard` type like in the previous
- section, but also add `@timestamp` as a `date` in case you want to operate on
- that field for <<common-script-uses,other use cases>>:
- [source,console]
- ----
- PUT /my-index/
- {
- "mappings": {
- "properties": {
- "@timestamp": {
- "format": "strict_date_optional_time||epoch_second",
- "type": "date"
- },
- "message": {
- "type": "wildcard"
- }
- }
- }
- }
- ----
- If you want to extract the HTTP response code using your dissect pattern, you
- can create a runtime field like `http.response`:
- [source,console]
- ----
- PUT my-index/_mappings
- {
- "runtime": {
- "http.response": {
- "type": "long",
- "script": """
- String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
- if (response != null) emit(Integer.parseInt(response));
- """
- }
- }
- }
- ----
- // TEST[continued]
- After mapping the fields you want to retrieve, index a few records from
- your log data into {es}. The following request uses the <<docs-bulk,bulk API>>
- to index raw log data into `my-index`:
- [source,console]
- ----
- POST /my-index/_bulk?refresh=true
- {"index":{}}
- {"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
- {"index":{}}
- {"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
- {"index":{}}
- {"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
- {"index":{}}
- {"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
- {"index":{}}
- {"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
- {"index":{}}
- {"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
- {"index":{}}
- {"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}
- ----
- // TEST[continued]
- You can define a simple query to run a search for a specific HTTP response and
- return all related fields. Use the `fields` parameter of the search API to
- retrieve the `http.response` runtime field.
- [source,console]
- ----
- GET my-index/_search
- {
- "query": {
- "match": {
- "http.response": "304"
- }
- },
- "fields" : ["http.response"]
- }
- ----
- // TEST[continued]
- Alternatively, you can define the same runtime field but in the context of a
- search request. The runtime definition and the script are exactly the same as
- the one defined previously in the index mapping. Just copy that definition into
- the search request under the `runtime_mappings` section and include a query
- that matches on the runtime field. This query returns the same results as the
- search query previously defined for the `http.response` runtime field in your
- index mappings, but only in the context of this specific search:
- [source,console]
- ----
- GET my-index/_search
- {
- "runtime_mappings": {
- "http.response": {
- "type": "long",
- "script": """
- String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
- if (response != null) emit(Integer.parseInt(response));
- """
- }
- },
- "query": {
- "match": {
- "http.response": "304"
- }
- },
- "fields" : ["http.response"]
- }
- ----
- // TEST[continued]
- // TEST[s/_search/_search\?filter_path=hits/]
- [source,console-result]
- ----
- {
- "hits" : {
- "total" : {
- "value" : 1,
- "relation" : "eq"
- },
- "max_score" : 1.0,
- "hits" : [
- {
- "_index" : "my-index",
- "_id" : "D47UqXkBByC8cgZrkbOm",
- "_score" : 1.0,
- "_source" : {
- "timestamp" : "2020-04-30T14:31:22-05:00",
- "message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
- },
- "fields" : {
- "http.response" : [
- 304
- ]
- }
- }
- ]
- }
- }
- ----
- // TESTRESPONSE[s/"_id" : "D47UqXkBByC8cgZrkbOm"/"_id": $body.hits.hits.0._id/]
|