123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587 |
- [role="xpack"]
- [[ml-configuring-transform]]
- = Transforming data with script fields
- If you use {dfeeds}, you can add scripts to transform your data before
- it is analyzed. {dfeeds-cap} contain an optional `script_fields` property, where
- you can specify scripts that evaluate custom expressions and return script
- fields.
- If your {dfeed} defines script fields, you can use those fields in your
- {anomaly-job}. For example, you can use the script fields in the analysis
- functions in one or more detectors.
- * <<ml-configuring-transform1>>
- * <<ml-configuring-transform2>>
- * <<ml-configuring-transform3>>
- * <<ml-configuring-transform4>>
- * <<ml-configuring-transform5>>
- * <<ml-configuring-transform6>>
- * <<ml-configuring-transform7>>
- * <<ml-configuring-transform8>>
- * <<ml-configuring-transform9>>
- The following index APIs create and add content to an index that is used in
- subsequent examples:
- [source,console]
- ----------------------------------
- PUT /my-index-000001
- {
- "mappings":{
- "properties": {
- "@timestamp": {
- "type": "date"
- },
- "aborted_count": {
- "type": "long"
- },
- "another_field": {
- "type": "keyword" <1>
- },
- "clientip": {
- "type": "keyword"
- },
- "coords": {
- "properties": {
- "lat": {
- "type": "keyword"
- },
- "lon": {
- "type": "keyword"
- }
- }
- },
- "error_count": {
- "type": "long"
- },
- "query": {
- "type": "keyword"
- },
- "some_field": {
- "type": "keyword"
- },
- "tokenstring1":{
- "type":"keyword"
- },
- "tokenstring2":{
- "type":"keyword"
- },
- "tokenstring3":{
- "type":"keyword"
- }
- }
- }
- }
- PUT /my-index-000001/_doc/1
- {
- "@timestamp":"2017-03-23T13:00:00",
- "error_count":36320,
- "aborted_count":4156,
- "some_field":"JOE",
- "another_field":"SMITH ",
- "tokenstring1":"foo-bar-baz",
- "tokenstring2":"foo bar baz",
- "tokenstring3":"foo-bar-19",
- "query":"www.ml.elastic.co",
- "clientip":"123.456.78.900",
- "coords": {
- "lat" : 41.44,
- "lon":90.5
- }
- }
- ----------------------------------
- // TEST[skip:SETUP]
- <1> In this example, string fields are mapped as `keyword` fields to support
- aggregation. If you want both a full text (`text`) and a keyword (`keyword`)
- version of the same field, use multi-fields. For more information, see
- {ref}/multi-fields.html[fields].
- [[ml-configuring-transform1]]
- .Example 1: Adding two numerical fields
- [source,console]
- ----------------------------------
- PUT _ml/anomaly_detectors/test1
- {
- "analysis_config":{
- "bucket_span": "10m",
- "detectors":[
- {
- "function":"mean",
- "field_name": "total_error_count", <1>
- "detector_description": "Custom script field transformation"
- }
- ]
- },
- "data_description": {
- "time_field":"@timestamp",
- "time_format":"epoch_ms"
- }
- }
- PUT _ml/datafeeds/datafeed-test1
- {
- "job_id": "test1",
- "indices": ["my-index-000001"],
- "query": {
- "match_all": {
- "boost": 1
- }
- },
- "script_fields": {
- "total_error_count": { <2>
- "script": {
- "lang": "expression",
- "source": "doc['error_count'].value + doc['aborted_count'].value"
- }
- }
- }
- }
- ----------------------------------
- // TEST[skip:needs-licence]
- <1> A script field named `total_error_count` is referenced in the detector
- within the job.
- <2> The script field is defined in the {dfeed}.
- This `test1` {anomaly-job} contains a detector that uses a script field in a
- mean analysis function. The `datafeed-test1` {dfeed} defines the script field.
- It contains a script that adds two fields in the document to produce a "total"
- error count.
- The syntax for the `script_fields` property is identical to that used by {es}.
- For more information, see
- {ref}/search-fields.html#script-fields[Script fields].
- You can preview the contents of the {dfeed} by using the following API:
- [source,console]
- ----------------------------------
- GET _ml/datafeeds/datafeed-test1/_preview
- ----------------------------------
- // TEST[skip:continued]
- In this example, the API returns the following results, which contain a sum of
- the `error_count` and `aborted_count` values:
- [source,js]
- ----------------------------------
- [
- {
- "@timestamp": 1490274000000,
- "total_error_count": 40476
- }
- ]
- ----------------------------------
- NOTE: This example demonstrates how to use script fields, but it contains
- insufficient data to generate meaningful results.
- //For a full demonstration of
- //how to create jobs with sample data, see <<ml-getting-started>>.
- You can alternatively use {kib} to create an advanced {anomaly-job} that uses
- script fields. To add the `script_fields` property to your {dfeed}, you must use
- the **Edit JSON** tab. For example:
- [role="screenshot"]
- image::images/ml-scriptfields.jpg[Adding script fields to a {dfeed} in {kib}]
- [[ml-configuring-transform-examples]]
- == Common script field examples
- While the possibilities are limitless, there are a number of common scenarios
- where you might use script fields in your {dfeeds}.
- [NOTE]
- ===============================
- Some of these examples use regular expressions. By default, regular
- expressions are disabled because they circumvent the protection that Painless
- provides against long running and memory hungry scripts. For more information,
- see {ref}/modules-scripting-painless.html[Painless scripting language].
- Machine learning analysis is case sensitive. For example, "John" is considered
- to be different than "john". This is one reason you might consider using scripts
- that convert your strings to upper or lowercase letters.
- ===============================
- [[ml-configuring-transform2]]
- .Example 2: Concatenating strings
- [source,console]
- --------------------------------------------------
- PUT _ml/anomaly_detectors/test2
- {
- "analysis_config":{
- "bucket_span": "10m",
- "detectors":[
- {
- "function":"low_info_content",
- "field_name":"my_script_field", <1>
- "detector_description": "Custom script field transformation"
- }
- ]
- },
- "data_description": {
- "time_field":"@timestamp",
- "time_format":"epoch_ms"
- }
- }
- PUT _ml/datafeeds/datafeed-test2
- {
- "job_id": "test2",
- "indices": ["my-index-000001"],
- "query": {
- "match_all": {
- "boost": 1
- }
- },
- "script_fields": {
- "my_script_field": {
- "script": {
- "lang": "painless",
- "source": "doc['some_field'].value + '_' + doc['another_field'].value" <2>
- }
- }
- }
- }
- GET _ml/datafeeds/datafeed-test2/_preview
- --------------------------------------------------
- // TEST[skip:needs-licence]
- <1> The script field has a rather generic name in this case, since it will
- be used for various tests in the subsequent examples.
- <2> The script field uses the plus (+) operator to concatenate strings.
- The preview {dfeed} API returns the following results, which show that "JOE"
- and "SMITH " have been concatenated and an underscore was added:
- [source,js]
- ----------------------------------
- [
- {
- "@timestamp": 1490274000000,
- "my_script_field": "JOE_SMITH "
- }
- ]
- ----------------------------------
- [[ml-configuring-transform3]]
- .Example 3: Trimming strings
- [source,console]
- --------------------------------------------------
- POST _ml/datafeeds/datafeed-test2/_update
- {
- "script_fields": {
- "my_script_field": {
- "script": {
- "lang": "painless",
- "source": "doc['another_field'].value.trim()" <1>
- }
- }
- }
- }
- GET _ml/datafeeds/datafeed-test2/_preview
- --------------------------------------------------
- // TEST[skip:continued]
- <1> This script field uses the `trim()` function to trim extra white space from a
- string.
- The preview {dfeed} API returns the following results, which show that "SMITH "
- has been trimmed to "SMITH":
- [source,js]
- ----------------------------------
- [
- {
- "@timestamp": 1490274000000,
- "my_script_field": "SMITH"
- }
- ]
- ----------------------------------
- [[ml-configuring-transform4]]
- .Example 4: Converting strings to lowercase
- [source,console]
- --------------------------------------------------
- POST _ml/datafeeds/datafeed-test2/_update
- {
- "script_fields": {
- "my_script_field": {
- "script": {
- "lang": "painless",
- "source": "doc['some_field'].value.toLowerCase()" <1>
- }
- }
- }
- }
- GET _ml/datafeeds/datafeed-test2/_preview
- --------------------------------------------------
- // TEST[skip:continued]
- <1> This script field uses the `toLowerCase` function to convert a string to all
- lowercase letters. Likewise, you can use the `toUpperCase{}` function to convert
- a string to uppercase letters.
- The preview {dfeed} API returns the following results, which show that "JOE"
- has been converted to "joe":
- [source,js]
- ----------------------------------
- [
- {
- "@timestamp": 1490274000000,
- "my_script_field": "joe"
- }
- ]
- ----------------------------------
- [[ml-configuring-transform5]]
- .Example 5: Converting strings to mixed case formats
- [source,console]
- --------------------------------------------------
- POST _ml/datafeeds/datafeed-test2/_update
- {
- "script_fields": {
- "my_script_field": {
- "script": {
- "lang": "painless",
- "source": "doc['some_field'].value.substring(0, 1).toUpperCase() + doc['some_field'].value.substring(1).toLowerCase()" <1>
- }
- }
- }
- }
- GET _ml/datafeeds/datafeed-test2/_preview
- --------------------------------------------------
- // TEST[skip:continued]
- <1> This script field is a more complicated example of case manipulation. It uses
- the `subString()` function to capitalize the first letter of a string and
- converts the remaining characters to lowercase.
- The preview {dfeed} API returns the following results, which show that "JOE"
- has been converted to "Joe":
- [source,js]
- ----------------------------------
- [
- {
- "@timestamp": 1490274000000,
- "my_script_field": "Joe"
- }
- ]
- ----------------------------------
- [[ml-configuring-transform6]]
- .Example 6: Replacing tokens
- [source,console]
- --------------------------------------------------
- POST _ml/datafeeds/datafeed-test2/_update
- {
- "script_fields": {
- "my_script_field": {
- "script": {
- "lang": "painless",
- "source": "/\\s/.matcher(doc['tokenstring2'].value).replaceAll('_')" <1>
- }
- }
- }
- }
- GET _ml/datafeeds/datafeed-test2/_preview
- --------------------------------------------------
- // TEST[skip:continued]
- <1> This script field uses regular expressions to replace white
- space with underscores.
- The preview {dfeed} API returns the following results, which show that
- "foo bar baz" has been converted to "foo_bar_baz":
- [source,js]
- ----------------------------------
- [
- {
- "@timestamp": 1490274000000,
- "my_script_field": "foo_bar_baz"
- }
- ]
- ----------------------------------
- [[ml-configuring-transform7]]
- .Example 7: Regular expression matching and concatenation
- [source,console]
- --------------------------------------------------
- POST _ml/datafeeds/datafeed-test2/_update
- {
- "script_fields": {
- "my_script_field": {
- "script": {
- "lang": "painless",
- "source": "def m = /(.*)-bar-([0-9][0-9])/.matcher(doc['tokenstring3'].value); return m.find() ? m.group(1) + '_' + m.group(2) : '';" <1>
- }
- }
- }
- }
- GET _ml/datafeeds/datafeed-test2/_preview
- --------------------------------------------------
- // TEST[skip:continued]
- <1> This script field looks for a specific regular expression pattern and emits the
- matched groups as a concatenated string. If no match is found, it emits an empty
- string.
- The preview {dfeed} API returns the following results, which show that
- "foo-bar-19" has been converted to "foo_19":
- [source,js]
- ----------------------------------
- [
- {
- "@timestamp": 1490274000000,
- "my_script_field": "foo_19"
- }
- ]
- ----------------------------------
- [[ml-configuring-transform8]]
- .Example 8: Splitting strings by domain name
- [source,console]
- --------------------------------------------------
- PUT _ml/anomaly_detectors/test3
- {
- "description":"DNS tunneling",
- "analysis_config":{
- "bucket_span": "30m",
- "influencers": ["clientip","hrd"],
- "detectors":[
- {
- "function":"high_info_content",
- "field_name": "sub",
- "over_field_name": "hrd",
- "exclude_frequent":"all"
- }
- ]
- },
- "data_description": {
- "time_field":"@timestamp",
- "time_format":"epoch_ms"
- }
- }
- PUT _ml/datafeeds/datafeed-test3
- {
- "job_id": "test3",
- "indices": ["my-index-000001"],
- "query": {
- "match_all": {
- "boost": 1
- }
- },
- "script_fields":{
- "sub":{
- "script":"return domainSplit(doc['query'].value).get(0);"
- },
- "hrd":{
- "script":"return domainSplit(doc['query'].value).get(1);"
- }
- }
- }
- GET _ml/datafeeds/datafeed-test3/_preview
- --------------------------------------------------
- // TEST[skip:needs-licence]
- If you have a single field that contains a well-formed DNS domain name, you can
- use the `domainSplit()` function to split the string into its highest registered
- domain and the sub-domain, which is everything to the left of the highest
- registered domain. For example, the highest registered domain of
- `www.ml.elastic.co` is `elastic.co` and the sub-domain is `www.ml`. The
- `domainSplit()` function returns an array of two values: the first value is the
- subdomain; the second value is the highest registered domain.
- The preview {dfeed} API returns the following results, which show that
- "www.ml.elastic.co" has been split into "elastic.co" and "www.ml":
- [source,js]
- ----------------------------------
- [
- {
- "@timestamp": 1490274000000,
- "clientip.keyword": "123.456.78.900",
- "hrd": "elastic.co",
- "sub": "www.ml"
- }
- ]
- ----------------------------------
- [[ml-configuring-transform9]]
- .Example 9: Transforming geo_point data
- [source,console]
- --------------------------------------------------
- PUT _ml/anomaly_detectors/test4
- {
- "analysis_config":{
- "bucket_span": "10m",
- "detectors":[
- {
- "function":"lat_long",
- "field_name": "my_coordinates"
- }
- ]
- },
- "data_description": {
- "time_field":"@timestamp",
- "time_format":"epoch_ms"
- }
- }
- PUT _ml/datafeeds/datafeed-test4
- {
- "job_id": "test4",
- "indices": ["my-index-000001"],
- "query": {
- "match_all": {
- "boost": 1
- }
- },
- "script_fields": {
- "my_coordinates": {
- "script": {
- "source": "doc['coords.lat'].value + ',' + doc['coords.lon'].value",
- "lang": "painless"
- }
- }
- }
- }
- GET _ml/datafeeds/datafeed-test4/_preview
- --------------------------------------------------
- // TEST[skip:needs-licence]
- In {es}, location data can be stored in `geo_point` fields but this data type is
- not supported natively in {ml} analytics. This example of a script field
- transforms the data into an appropriate format. For more information,
- see <<ml-geo-functions>>.
- The preview {dfeed} API returns the following results, which show that
- `41.44` and `90.5` have been combined into "41.44,90.5":
- [source,js]
- ----------------------------------
- [
- {
- "@timestamp": 1490274000000,
- "my_coordinates": "41.44,90.5"
- }
- ]
- ----------------------------------
|