transforms.asciidoc 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586
  1. [role="xpack"]
  2. [[ml-configuring-transform]]
  3. === Transforming data with script fields
  4. If you use {dfeeds}, you can add scripts to transform your data before
  5. it is analyzed. {dfeeds-cap} contain an optional `script_fields` property, where
  6. you can specify scripts that evaluate custom expressions and return script
  7. fields.
  8. If your {dfeed} defines script fields, you can use those fields in your job.
  9. For example, you can use the script fields in the analysis functions in one or
  10. more detectors.
  11. * <<ml-configuring-transform1>>
  12. * <<ml-configuring-transform2>>
  13. * <<ml-configuring-transform3>>
  14. * <<ml-configuring-transform4>>
  15. * <<ml-configuring-transform5>>
  16. * <<ml-configuring-transform6>>
  17. * <<ml-configuring-transform7>>
  18. * <<ml-configuring-transform8>>
  19. * <<ml-configuring-transform9>>
  20. The following indices APIs create and add content to an index that is used in
  21. subsequent examples:
  22. [source,js]
  23. ----------------------------------
  24. PUT /my_index
  25. {
  26. "mappings":{
  27. "properties": {
  28. "@timestamp": {
  29. "type": "date"
  30. },
  31. "aborted_count": {
  32. "type": "long"
  33. },
  34. "another_field": {
  35. "type": "keyword" <1>
  36. },
  37. "clientip": {
  38. "type": "keyword"
  39. },
  40. "coords": {
  41. "properties": {
  42. "lat": {
  43. "type": "keyword"
  44. },
  45. "lon": {
  46. "type": "keyword"
  47. }
  48. }
  49. },
  50. "error_count": {
  51. "type": "long"
  52. },
  53. "query": {
  54. "type": "keyword"
  55. },
  56. "some_field": {
  57. "type": "keyword"
  58. },
  59. "tokenstring1":{
  60. "type":"keyword"
  61. },
  62. "tokenstring2":{
  63. "type":"keyword"
  64. },
  65. "tokenstring3":{
  66. "type":"keyword"
  67. }
  68. }
  69. }
  70. }
  71. PUT /my_index/_doc/1
  72. {
  73. "@timestamp":"2017-03-23T13:00:00",
  74. "error_count":36320,
  75. "aborted_count":4156,
  76. "some_field":"JOE",
  77. "another_field":"SMITH ",
  78. "tokenstring1":"foo-bar-baz",
  79. "tokenstring2":"foo bar baz",
  80. "tokenstring3":"foo-bar-19",
  81. "query":"www.ml.elastic.co",
  82. "clientip":"123.456.78.900",
  83. "coords": {
  84. "lat" : 41.44,
  85. "lon":90.5
  86. }
  87. }
  88. ----------------------------------
  89. // CONSOLE
  90. // TEST[skip:SETUP]
  91. <1> In this example, string fields are mapped as `keyword` fields to support
  92. aggregation. If you want both a full text (`text`) and a keyword (`keyword`)
  93. version of the same field, use multi-fields. For more information, see
  94. {ref}/multi-fields.html[fields].
  95. [[ml-configuring-transform1]]
  96. .Example 1: Adding two numerical fields
  97. [source,js]
  98. ----------------------------------
  99. PUT _ml/anomaly_detectors/test1
  100. {
  101. "analysis_config":{
  102. "bucket_span": "10m",
  103. "detectors":[
  104. {
  105. "function":"mean",
  106. "field_name": "total_error_count", <1>
  107. "detector_description": "Custom script field transformation"
  108. }
  109. ]
  110. },
  111. "data_description": {
  112. "time_field":"@timestamp",
  113. "time_format":"epoch_ms"
  114. }
  115. }
  116. PUT _ml/datafeeds/datafeed-test1
  117. {
  118. "job_id": "test1",
  119. "indices": ["my_index"],
  120. "query": {
  121. "match_all": {
  122. "boost": 1
  123. }
  124. },
  125. "script_fields": {
  126. "total_error_count": { <2>
  127. "script": {
  128. "lang": "expression",
  129. "source": "doc['error_count'].value + doc['aborted_count'].value"
  130. }
  131. }
  132. }
  133. }
  134. ----------------------------------
  135. // CONSOLE
  136. // TEST[skip:needs-licence]
  137. <1> A script field named `total_error_count` is referenced in the detector
  138. within the job.
  139. <2> The script field is defined in the {dfeed}.
  140. This `test1` job contains a detector that uses a script field in a mean analysis
  141. function. The `datafeed-test1` {dfeed} defines the script field. It contains a
  142. script that adds two fields in the document to produce a "total" error count.
  143. The syntax for the `script_fields` property is identical to that used by {es}.
  144. For more information, see {ref}/search-request-script-fields.html[Script Fields].
  145. You can preview the contents of the {dfeed} by using the following API:
  146. [source,js]
  147. ----------------------------------
  148. GET _ml/datafeeds/datafeed-test1/_preview
  149. ----------------------------------
  150. // CONSOLE
  151. // TEST[skip:continued]
  152. In this example, the API returns the following results, which contain a sum of
  153. the `error_count` and `aborted_count` values:
  154. [source,js]
  155. ----------------------------------
  156. [
  157. {
  158. "@timestamp": 1490274000000,
  159. "total_error_count": 40476
  160. }
  161. ]
  162. ----------------------------------
  163. NOTE: This example demonstrates how to use script fields, but it contains
  164. insufficient data to generate meaningful results. For a full demonstration of
  165. how to create jobs with sample data, see <<ml-getting-started>>.
  166. You can alternatively use {kib} to create an advanced job that uses script
  167. fields. To add the `script_fields` property to your {dfeed}, you must use the
  168. **Edit JSON** tab. For example:
  169. [role="screenshot"]
  170. image::images/ml-scriptfields.jpg[Adding script fields to a {dfeed} in {kib}]
  171. [[ml-configuring-transform-examples]]
  172. ==== Common Script Field Examples
  173. While the possibilities are limitless, there are a number of common scenarios
  174. where you might use script fields in your {dfeeds}.
  175. [NOTE]
  176. ===============================
  177. Some of these examples use regular expressions. By default, regular
  178. expressions are disabled because they circumvent the protection that Painless
  179. provides against long running and memory hungry scripts. For more information,
  180. see {ref}/modules-scripting-painless.html[Painless Scripting Language].
  181. Machine learning analysis is case sensitive. For example, "John" is considered
  182. to be different than "john". This is one reason you might consider using scripts
  183. that convert your strings to upper or lowercase letters.
  184. ===============================
  185. [[ml-configuring-transform2]]
  186. .Example 2: Concatenating strings
  187. [source,js]
  188. --------------------------------------------------
  189. PUT _ml/anomaly_detectors/test2
  190. {
  191. "analysis_config":{
  192. "bucket_span": "10m",
  193. "detectors":[
  194. {
  195. "function":"low_info_content",
  196. "field_name":"my_script_field", <1>
  197. "detector_description": "Custom script field transformation"
  198. }
  199. ]
  200. },
  201. "data_description": {
  202. "time_field":"@timestamp",
  203. "time_format":"epoch_ms"
  204. }
  205. }
  206. PUT _ml/datafeeds/datafeed-test2
  207. {
  208. "job_id": "test2",
  209. "indices": ["my_index"],
  210. "query": {
  211. "match_all": {
  212. "boost": 1
  213. }
  214. },
  215. "script_fields": {
  216. "my_script_field": {
  217. "script": {
  218. "lang": "painless",
  219. "source": "doc['some_field'].value + '_' + doc['another_field'].value" <2>
  220. }
  221. }
  222. }
  223. }
  224. GET _ml/datafeeds/datafeed-test2/_preview
  225. --------------------------------------------------
  226. // CONSOLE
  227. // TEST[skip:needs-licence]
  228. <1> The script field has a rather generic name in this case, since it will
  229. be used for various tests in the subsequent examples.
  230. <2> The script field uses the plus (+) operator to concatenate strings.
  231. The preview {dfeed} API returns the following results, which show that "JOE"
  232. and "SMITH " have been concatenated and an underscore was added:
  233. [source,js]
  234. ----------------------------------
  235. [
  236. {
  237. "@timestamp": 1490274000000,
  238. "my_script_field": "JOE_SMITH "
  239. }
  240. ]
  241. ----------------------------------
  242. [[ml-configuring-transform3]]
  243. .Example 3: Trimming strings
  244. [source,js]
  245. --------------------------------------------------
  246. POST _ml/datafeeds/datafeed-test2/_update
  247. {
  248. "script_fields": {
  249. "my_script_field": {
  250. "script": {
  251. "lang": "painless",
  252. "source": "doc['another_field'].value.trim()" <1>
  253. }
  254. }
  255. }
  256. }
  257. GET _ml/datafeeds/datafeed-test2/_preview
  258. --------------------------------------------------
  259. // CONSOLE
  260. // TEST[skip:continued]
  261. <1> This script field uses the `trim()` function to trim extra white space from a
  262. string.
  263. The preview {dfeed} API returns the following results, which show that "SMITH "
  264. has been trimmed to "SMITH":
  265. [source,js]
  266. ----------------------------------
  267. [
  268. {
  269. "@timestamp": 1490274000000,
  270. "my_script_field": "SMITH"
  271. }
  272. ]
  273. ----------------------------------
  274. [[ml-configuring-transform4]]
  275. .Example 4: Converting strings to lowercase
  276. [source,js]
  277. --------------------------------------------------
  278. POST _ml/datafeeds/datafeed-test2/_update
  279. {
  280. "script_fields": {
  281. "my_script_field": {
  282. "script": {
  283. "lang": "painless",
  284. "source": "doc['some_field'].value.toLowerCase()" <1>
  285. }
  286. }
  287. }
  288. }
  289. GET _ml/datafeeds/datafeed-test2/_preview
  290. --------------------------------------------------
  291. // CONSOLE
  292. // TEST[skip:continued]
  293. <1> This script field uses the `toLowerCase` function to convert a string to all
  294. lowercase letters. Likewise, you can use the `toUpperCase{}` function to convert
  295. a string to uppercase letters.
  296. The preview {dfeed} API returns the following results, which show that "JOE"
  297. has been converted to "joe":
  298. [source,js]
  299. ----------------------------------
  300. [
  301. {
  302. "@timestamp": 1490274000000,
  303. "my_script_field": "joe"
  304. }
  305. ]
  306. ----------------------------------
  307. [[ml-configuring-transform5]]
  308. .Example 5: Converting strings to mixed case formats
  309. [source,js]
  310. --------------------------------------------------
  311. POST _ml/datafeeds/datafeed-test2/_update
  312. {
  313. "script_fields": {
  314. "my_script_field": {
  315. "script": {
  316. "lang": "painless",
  317. "source": "doc['some_field'].value.substring(0, 1).toUpperCase() + doc['some_field'].value.substring(1).toLowerCase()" <1>
  318. }
  319. }
  320. }
  321. }
  322. GET _ml/datafeeds/datafeed-test2/_preview
  323. --------------------------------------------------
  324. // CONSOLE
  325. // TEST[skip:continued]
  326. <1> This script field is a more complicated example of case manipulation. It uses
  327. the `subString()` function to capitalize the first letter of a string and
  328. converts the remaining characters to lowercase.
  329. The preview {dfeed} API returns the following results, which show that "JOE"
  330. has been converted to "Joe":
  331. [source,js]
  332. ----------------------------------
  333. [
  334. {
  335. "@timestamp": 1490274000000,
  336. "my_script_field": "Joe"
  337. }
  338. ]
  339. ----------------------------------
  340. [[ml-configuring-transform6]]
  341. .Example 6: Replacing tokens
  342. [source,js]
  343. --------------------------------------------------
  344. POST _ml/datafeeds/datafeed-test2/_update
  345. {
  346. "script_fields": {
  347. "my_script_field": {
  348. "script": {
  349. "lang": "painless",
  350. "source": "/\\s/.matcher(doc['tokenstring2'].value).replaceAll('_')" <1>
  351. }
  352. }
  353. }
  354. }
  355. GET _ml/datafeeds/datafeed-test2/_preview
  356. --------------------------------------------------
  357. // CONSOLE
  358. // TEST[skip:continued]
  359. <1> This script field uses regular expressions to replace white
  360. space with underscores.
  361. The preview {dfeed} API returns the following results, which show that
  362. "foo bar baz" has been converted to "foo_bar_baz":
  363. [source,js]
  364. ----------------------------------
  365. [
  366. {
  367. "@timestamp": 1490274000000,
  368. "my_script_field": "foo_bar_baz"
  369. }
  370. ]
  371. ----------------------------------
  372. [[ml-configuring-transform7]]
  373. .Example 7: Regular expression matching and concatenation
  374. [source,js]
  375. --------------------------------------------------
  376. POST _ml/datafeeds/datafeed-test2/_update
  377. {
  378. "script_fields": {
  379. "my_script_field": {
  380. "script": {
  381. "lang": "painless",
  382. "source": "def m = /(.*)-bar-([0-9][0-9])/.matcher(doc['tokenstring3'].value); return m.find() ? m.group(1) + '_' + m.group(2) : '';" <1>
  383. }
  384. }
  385. }
  386. }
  387. GET _ml/datafeeds/datafeed-test2/_preview
  388. --------------------------------------------------
  389. // CONSOLE
  390. // TEST[skip:continued]
  391. <1> This script field looks for a specific regular expression pattern and emits the
  392. matched groups as a concatenated string. If no match is found, it emits an empty
  393. string.
  394. The preview {dfeed} API returns the following results, which show that
  395. "foo-bar-19" has been converted to "foo_19":
  396. [source,js]
  397. ----------------------------------
  398. [
  399. {
  400. "@timestamp": 1490274000000,
  401. "my_script_field": "foo_19"
  402. }
  403. ]
  404. ----------------------------------
  405. [[ml-configuring-transform8]]
  406. .Example 8: Splitting strings by domain name
  407. [source,js]
  408. --------------------------------------------------
  409. PUT _ml/anomaly_detectors/test3
  410. {
  411. "description":"DNS tunneling",
  412. "analysis_config":{
  413. "bucket_span": "30m",
  414. "influencers": ["clientip","hrd"],
  415. "detectors":[
  416. {
  417. "function":"high_info_content",
  418. "field_name": "sub",
  419. "over_field_name": "hrd",
  420. "exclude_frequent":"all"
  421. }
  422. ]
  423. },
  424. "data_description": {
  425. "time_field":"@timestamp",
  426. "time_format":"epoch_ms"
  427. }
  428. }
  429. PUT _ml/datafeeds/datafeed-test3
  430. {
  431. "job_id": "test3",
  432. "indices": ["my_index"],
  433. "query": {
  434. "match_all": {
  435. "boost": 1
  436. }
  437. },
  438. "script_fields":{
  439. "sub":{
  440. "script":"return domainSplit(doc['query'].value).get(0);"
  441. },
  442. "hrd":{
  443. "script":"return domainSplit(doc['query'].value).get(1);"
  444. }
  445. }
  446. }
  447. GET _ml/datafeeds/datafeed-test3/_preview
  448. --------------------------------------------------
  449. // CONSOLE
  450. // TEST[skip:needs-licence]
  451. If you have a single field that contains a well-formed DNS domain name, you can
  452. use the `domainSplit()` function to split the string into its highest registered
  453. domain and the sub-domain, which is everything to the left of the highest
  454. registered domain. For example, the highest registered domain of
  455. `www.ml.elastic.co` is `elastic.co` and the sub-domain is `www.ml`. The
  456. `domainSplit()` function returns an array of two values: the first value is the
  457. subdomain; the second value is the highest registered domain.
  458. The preview {dfeed} API returns the following results, which show that
  459. "www.ml.elastic.co" has been split into "elastic.co" and "www.ml":
  460. [source,js]
  461. ----------------------------------
  462. [
  463. {
  464. "@timestamp": 1490274000000,
  465. "clientip.keyword": "123.456.78.900",
  466. "hrd": "elastic.co",
  467. "sub": "www.ml"
  468. }
  469. ]
  470. ----------------------------------
  471. [[ml-configuring-transform9]]
  472. .Example 9: Transforming geo_point data
  473. [source,js]
  474. --------------------------------------------------
  475. PUT _ml/anomaly_detectors/test4
  476. {
  477. "analysis_config":{
  478. "bucket_span": "10m",
  479. "detectors":[
  480. {
  481. "function":"lat_long",
  482. "field_name": "my_coordinates"
  483. }
  484. ]
  485. },
  486. "data_description": {
  487. "time_field":"@timestamp",
  488. "time_format":"epoch_ms"
  489. }
  490. }
  491. PUT _ml/datafeeds/datafeed-test4
  492. {
  493. "job_id": "test4",
  494. "indices": ["my_index"],
  495. "query": {
  496. "match_all": {
  497. "boost": 1
  498. }
  499. },
  500. "script_fields": {
  501. "my_coordinates": {
  502. "script": {
  503. "source": "doc['coords.lat'].value + ',' + doc['coords.lon'].value",
  504. "lang": "painless"
  505. }
  506. }
  507. }
  508. }
  509. GET _ml/datafeeds/datafeed-test4/_preview
  510. --------------------------------------------------
  511. // CONSOLE
  512. // TEST[skip:needs-licence]
  513. In {es}, location data can be stored in `geo_point` fields but this data type is
  514. not supported natively in {ml} analytics. This example of a script field
  515. transforms the data into an appropriate format. For more information,
  516. see <<ml-geo-functions>>.
  517. The preview {dfeed} API returns the following results, which show that
  518. `41.44` and `90.5` have been combined into "41.44,90.5":
  519. [source,js]
  520. ----------------------------------
  521. [
  522. {
  523. "@timestamp": 1490274000000,
  524. "my_coordinates": "41.44,90.5"
  525. }
  526. ]
  527. ----------------------------------