ml-configuring-aggregations.asciidoc 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527
  1. [role="xpack"]
  2. [[ml-configuring-aggregation]]
  3. = Aggregating data for faster performance
  4. When you aggregate data, {es} automatically distributes the calculations across
  5. your cluster. Then you can feed this aggregated data into the {ml-features}
  6. instead of raw results. It reduces the volume of data that must be analyzed.
  7. [discrete]
  8. [[aggs-requs-dfeeds]]
  9. == Requirements
  10. There are a number of requirements for using aggregations in {dfeeds}.
  11. [discrete]
  12. [[aggs-aggs]]
  13. === Aggregations
  14. * Your aggregation must include a `date_histogram` aggregation or a top level
  15. `composite` aggregation, which in turn must contain a `max` aggregation on the
  16. time field. It ensures that the aggregated data is a time series and the
  17. timestamp of each bucket is the time of the last record in the bucket.
  18. * The `time_zone` parameter in the date histogram aggregation must be set to
  19. `UTC`, which is the default value.
  20. * The name of the aggregation and the name of the field that it operates on need
  21. to match. For example, if you use a `max` aggregation on a time field called
  22. `responsetime`, the name of the aggregation must also be `responsetime`.
  23. * For `composite` aggregation support, there must be exactly one
  24. `date_histogram` value source. That value source must not be sorted in
  25. descending order. Additional `composite` aggregation value sources are allowed,
  26. such as `terms`.
  27. * The `size` parameter of the non-composite aggregations must match the
  28. cardinality of your data. A greater value of the `size` parameter increases the
  29. memory requirement of the aggregation.
  30. * If you set the `summary_count_field_name` property to a non-null value, the
  31. {anomaly-job} expects to receive aggregated input. The property must be set to
  32. the name of the field that contains the count of raw data points that have been
  33. aggregated. It applies to all detectors in the job.
  34. * The influencers or the partition fields must be included in the aggregation of
  35. your {dfeed}, otherwise they are not included in the job analysis. For more
  36. information on influencers, refer to <<ml-ad-influencers>>.
  37. [discrete]
  38. [[aggs-interval]]
  39. === Intervals
  40. * The bucket span of your {anomaly-job} must be divisible by the value of the
  41. `calendar_interval` or `fixed_interval` in your aggregation (with no remainder).
  42. * If you specify a `frequency` for your {dfeed}, it must be divisible by the
  43. `calendar_interval` or the `fixed_interval`.
  44. * {anomaly-jobs-cap} cannot use `date_histogram` or `composite` aggregations
  45. with an interval measured in months because the length of the month is not
  46. fixed; they can use weeks or smaller units.
  47. [discrete]
  48. [[aggs-limits-dfeeds]]
  49. == Limitations
  50. * If your <<aggs-dfeeds,{dfeed} uses aggregations with nested `terms` aggs>> and
  51. model plot is not enabled for the {anomaly-job}, neither the
  52. **Single Metric Viewer** nor the **Anomaly Explorer** can plot and display an
  53. anomaly chart. In these cases, an explanatory message is shown instead of the
  54. chart.
  55. * Your {dfeed} can contain multiple aggregations, but only the ones with names
  56. that match values in the job configuration are fed to the job.
  57. [discrete]
  58. [[aggs-recommendations-dfeeds]]
  59. == Recommendations
  60. * When your detectors use <<ml-metric-functions,metric>> or
  61. <<ml-sum-functions,sum>> analytical functions, it's recommended to set the
  62. `date_histogram` or `composite` aggregation interval to a tenth of the bucket
  63. span. This creates finer, more granular time buckets, which are ideal for this
  64. type of analysis.
  65. * When your detectors use <<ml-count-functions,count>> or
  66. <<ml-rare-functions,rare>> functions, set the interval to the same value as the
  67. bucket span.
  68. * If you have multiple influencers or partition fields or if your field
  69. cardinality is more than 1000, use
  70. {ref}/search-aggregations-bucket-composite-aggregation.html[composite aggregations].
  71. +
  72. --
  73. To determine the cardinality of your data, you can run searches such as:
  74. [source,js]
  75. --------------------------------------------------
  76. GET .../_search
  77. {
  78. "aggs": {
  79. "service_cardinality": {
  80. "cardinality": {
  81. "field": "service"
  82. }
  83. }
  84. }
  85. }
  86. --------------------------------------------------
  87. // NOTCONSOLE
  88. --
  89. [discrete]
  90. [[aggs-using-date-histogram]]
  91. == Including aggregations in {anomaly-jobs}
  92. When you create or update an {anomaly-job}, you can include aggregated fields in
  93. the analysis configuration. In the {dfeed} configuration object, you can define
  94. the aggregations.
  95. [source,console]
  96. ----------------------------------
  97. PUT _ml/anomaly_detectors/kibana-sample-data-flights
  98. {
  99. "analysis_config": {
  100. "bucket_span": "60m",
  101. "detectors": [{
  102. "function": "mean",
  103. "field_name": "responsetime", <1>
  104. "by_field_name": "airline" <1>
  105. }],
  106. "summary_count_field_name": "doc_count" <2>
  107. },
  108. "data_description": {
  109. "time_field":"time" <1>
  110. },
  111. "datafeed_config":{
  112. "indices": ["kibana-sample-data-flights"],
  113. "aggregations": {
  114. "buckets": {
  115. "date_histogram": {
  116. "field": "time",
  117. "fixed_interval": "360s",
  118. "time_zone": "UTC"
  119. },
  120. "aggregations": {
  121. "time": { <3>
  122. "max": {"field": "time"}
  123. },
  124. "airline": { <4>
  125. "terms": {
  126. "field": "airline",
  127. "size": 100
  128. },
  129. "aggregations": {
  130. "responsetime": { <5>
  131. "avg": {
  132. "field": "responsetime"
  133. }
  134. }
  135. }
  136. }
  137. }
  138. }
  139. }
  140. }
  141. }
  142. ----------------------------------
  143. // TEST[skip:setup:farequote_data]
  144. <1> The `airline`, `responsetime`, and `time` fields are aggregations. Only the
  145. aggregated fields defined in the `analysis_config` object are analyzed by the
  146. {anomaly-job}.
  147. <2> The `summary_count_field_name` property is set to the `doc_count` field that
  148. is an aggregated field and contains the count of the aggregated data points.
  149. <3> The aggregations have names that match the fields that they operate on. The
  150. `max` aggregation is named `time` and its field also needs to be `time`.
  151. <4> The `term` aggregation is named `airline` and its field is also named
  152. `airline`.
  153. <5> The `avg` aggregation is named `responsetime` and its field is also named
  154. `responsetime`.
  155. Use the following format to define a `date_histogram` aggregation to bucket by
  156. time in your {dfeed}:
  157. [source,js]
  158. ----------------------------------
  159. "aggregations": {
  160. ["bucketing_aggregation": {
  161. "bucket_agg": {
  162. ...
  163. },
  164. "aggregations": {
  165. "data_histogram_aggregation": {
  166. "date_histogram": {
  167. "field": "time",
  168. },
  169. "aggregations": {
  170. "timestamp": {
  171. "max": {
  172. "field": "time"
  173. }
  174. },
  175. [,"<first_term>": {
  176. "terms":{...
  177. }
  178. [,"aggregations" : {
  179. [<sub_aggregation>]+
  180. } ]
  181. }]
  182. }
  183. }
  184. }
  185. }
  186. }
  187. ----------------------------------
  188. // NOTCONSOLE
  189. [discrete]
  190. [[aggs-using-composite]]
  191. == Composite aggregations
  192. Composite aggregations are optimized for queries that are either `match_all` or
  193. `range` filters. Use composite aggregations in your {dfeeds} for these cases.
  194. Other types of queries may cause the `composite` aggregation to be inefficient.
  195. The following is an example of a job with a {dfeed} that uses a `composite`
  196. aggregation to bucket the metrics based on time and terms:
  197. [source,console]
  198. ----------------------------------
  199. PUT _ml/anomaly_detectors/kibana-sample-data-flights-composite
  200. {
  201. "analysis_config": {
  202. "bucket_span": "60m",
  203. "detectors": [{
  204. "function": "mean",
  205. "field_name": "responsetime",
  206. "by_field_name": "airline"
  207. }],
  208. "summary_count_field_name": "doc_count"
  209. },
  210. "data_description": {
  211. "time_field":"time"
  212. },
  213. "datafeed_config":{
  214. "indices": ["kibana-sample-data-flights"],
  215. "aggregations": {
  216. "buckets": {
  217. "composite": {
  218. "size": 1000, <1>
  219. "sources": [
  220. {
  221. "time_bucket": { <2>
  222. "date_histogram": {
  223. "field": "time",
  224. "fixed_interval": "360s",
  225. "time_zone": "UTC"
  226. }
  227. }
  228. },
  229. {
  230. "airline": { <3>
  231. "terms": {
  232. "field": "airline"
  233. }
  234. }
  235. }
  236. ]
  237. },
  238. "aggregations": {
  239. "time": { <4>
  240. "max": {
  241. "field": "time"
  242. }
  243. },
  244. "responsetime": { <5>
  245. "avg": {
  246. "field": "responsetime"
  247. }
  248. }
  249. }
  250. }
  251. }
  252. }
  253. }
  254. ----------------------------------
  255. <1> The number of resources to use when aggregating the data. A larger `size`
  256. means a faster {dfeed} but more cluster resources are used when searching.
  257. <2> The required `date_histogram` composite aggregation source. Make sure it
  258. is named differently than your desired time field.
  259. <3> Instead of using a regular `term` aggregation, adding a composite
  260. aggregation `term` source with the name `airline` works. Note its name
  261. is the same as the field.
  262. <4> The required `max` aggregation whose name is the time field in the
  263. job analysis config.
  264. <5> The `avg` aggregation is named `responsetime` and its field is also named
  265. `responsetime`.
  266. Use the following format to define a composite aggregation in your {dfeed}:
  267. [source,js]
  268. ----------------------------------
  269. "aggregations": {
  270. "composite_agg": {
  271. "sources": [
  272. {
  273. "date_histogram_agg": {
  274. "field": "time",
  275. ...settings...
  276. }
  277. },
  278. ...other valid sources...
  279. ],
  280. ...composite agg settings...,
  281. "aggregations": {
  282. "timestamp": {
  283. "max": {
  284. "field": "time"
  285. }
  286. },
  287. ...other aggregations...
  288. [
  289. [,"aggregations" : {
  290. [<sub_aggregation>]+
  291. } ]
  292. }]
  293. }
  294. }
  295. }
  296. ----------------------------------
  297. // NOTCONSOLE
  298. [discrete]
  299. [[aggs-dfeeds]]
  300. == Nested aggregations
  301. You can also use complex nested aggregations in {dfeeds}.
  302. The next example uses the
  303. {ref}/search-aggregations-pipeline-derivative-aggregation.html[`derivative` pipeline aggregation]
  304. to find the first order derivative of the counter `system.network.out.bytes` for
  305. each value of the field `beat.name`.
  306. NOTE: `derivative` or other pipeline aggregations may not work within
  307. `composite` aggregations. See
  308. {ref}/search-aggregations-bucket-composite-aggregation.html#search-aggregations-bucket-composite-aggregation-pipeline-aggregations[composite aggregations and pipeline aggregations].
  309. [source,js]
  310. ----------------------------------
  311. "aggregations": {
  312. "beat.name": {
  313. "terms": {
  314. "field": "beat.name"
  315. },
  316. "aggregations": {
  317. "buckets": {
  318. "date_histogram": {
  319. "field": "@timestamp",
  320. "fixed_interval": "5m"
  321. },
  322. "aggregations": {
  323. "@timestamp": {
  324. "max": {
  325. "field": "@timestamp"
  326. }
  327. },
  328. "bytes_out_average": {
  329. "avg": {
  330. "field": "system.network.out.bytes"
  331. }
  332. },
  333. "bytes_out_derivative": {
  334. "derivative": {
  335. "buckets_path": "bytes_out_average"
  336. }
  337. }
  338. }
  339. }
  340. }
  341. }
  342. }
  343. ----------------------------------
  344. // NOTCONSOLE
  345. [discrete]
  346. [[aggs-single-dfeeds]]
  347. == Single bucket aggregations
  348. You can also use single bucket aggregations in {dfeeds}. The following example
  349. shows two `filter` aggregations, each gathering the number of unique entries for
  350. the `error` field.
  351. [source,js]
  352. ----------------------------------
  353. {
  354. "job_id":"servers-unique-errors",
  355. "indices": ["logs-*"],
  356. "aggregations": {
  357. "buckets": {
  358. "date_histogram": {
  359. "field": "time",
  360. "interval": "360s",
  361. "time_zone": "UTC"
  362. },
  363. "aggregations": {
  364. "time": {
  365. "max": {"field": "time"}
  366. }
  367. "server1": {
  368. "filter": {"term": {"source": "server-name-1"}},
  369. "aggregations": {
  370. "server1_error_count": {
  371. "value_count": {
  372. "field": "error"
  373. }
  374. }
  375. }
  376. },
  377. "server2": {
  378. "filter": {"term": {"source": "server-name-2"}},
  379. "aggregations": {
  380. "server2_error_count": {
  381. "value_count": {
  382. "field": "error"
  383. }
  384. }
  385. }
  386. }
  387. }
  388. }
  389. }
  390. }
  391. ----------------------------------
  392. // NOTCONSOLE
  393. [discrete]
  394. [[aggs-amd-dfeeds]]
  395. == Using `aggregate_metric_double` field type in {dfeeds}
  396. NOTE: It is not currently possible to use `aggregate_metric_double` type fields
  397. in {dfeeds} without aggregations.
  398. You can use fields with the
  399. {ref}/aggregate-metric-double.html[`aggregate_metric_double`] field type in a
  400. {dfeed} with aggregations. It is required to retrieve the `value_count` of the
  401. `aggregate_metric_double` filed in an aggregation and then use it as the
  402. `summary_count_field_name` to provide the correct count that represents the
  403. aggregation value.
  404. In the following example, `presum` is an `aggregate_metric_double` type field
  405. that has all the possible metrics: `[ min, max, sum, value_count ]`. To use an
  406. `avg` aggregation on this field, you need to perform a `value_count` aggregation
  407. on `presum` and then set the field that contains the aggregated values
  408. `my_count` as the `summary_count_field_name`:
  409. [source,js]
  410. ----------------------------------
  411. {
  412. "analysis_config": {
  413. "bucket_span": "1h",
  414. "detectors": [
  415. {
  416. "function": "avg",
  417. "field_name": "my_avg"
  418. }
  419. ],
  420. "summary_count_field_name": "my_count" <1>
  421. },
  422. "data_description": {
  423. "time_field": "timestamp"
  424. },
  425. "datafeed_config": {
  426. "indices": [
  427. "my_index"
  428. ],
  429. "datafeed_id": "datafeed-id",
  430. "aggregations": {
  431. "buckets": {
  432. "date_histogram": {
  433. "field": "time",
  434. "fixed_interval": "360s",
  435. "time_zone": "UTC"
  436. },
  437. "aggregations": {
  438. "timestamp": {
  439. "max": {"field": "timestamp"}
  440. },
  441. "my_avg": { <2>
  442. "avg": {
  443. "field": "presum"
  444. }
  445. },
  446. "my_count": { <3>
  447. "value_count": {
  448. "field": "presum"
  449. }
  450. }
  451. }
  452. }
  453. }
  454. }
  455. }
  456. ----------------------------------
  457. // NOTCONSOLE
  458. <1> The field `my_count` is set as the `summary_count_field_name`. This field
  459. contains aggregated values from the `presum` `aggregate_metric_double` type
  460. field (refer to footnote 3).
  461. <2> The `avg` aggregation to use on the `presum` `aggregate_metric_double` type
  462. field.
  463. <3> The `value_count` aggregation on the `presum` `aggregate_metric_double` type
  464. field. This aggregated field must be set as the `summary_count_field_name`
  465. (refer to footnote 1) to make it possible to use the `aggregate_metric_double`
  466. type field in another aggregation.