| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240 | [role="xpack"][[ml-configuring-aggregation]]=== Aggregating data for faster performanceBy default, {dfeeds} fetch data from {es} using search and scroll requests.It can be significantly more efficient, however, to aggregate data in {es}and to configure your jobs to analyze aggregated data.One of the benefits of aggregating data this way is that {es} automaticallydistributes these calculations across your cluster. You can then feed thisaggregated data into {xpackml} instead of raw results, whichreduces the volume of data that must be considered while detecting anomalies.There are some limitations to using aggregations in {dfeeds}, however.Your aggregation must include a `date_histogram` aggregation, which in turn mustcontain a `max` aggregation on the time field. This requirement ensures that theaggregated data is a time series and the timestamp of each bucket is the timeof the last record in the bucket. If you use a terms aggregation and thecardinality of a term is high, then the aggregation might not be effective andyou might want to just use the default search and scroll behavior.When you create or update a job, you can include the names of aggregations, forexample:[source,js]----------------------------------PUT _xpack/ml/anomaly_detectors/farequote{  "analysis_config": {    "bucket_span": "60m",    "detectors": [{      "function": "mean",      "field_name": "responsetime",      "by_field_name": "airline"    }],    "summary_count_field_name": "doc_count"  },  "data_description": {    "time_field":"time"  }}----------------------------------// CONSOLE// TEST[skip:setup:farequote_data]In this example, the `airline`, `responsetime`, and `time` fields areaggregations.NOTE: When the `summary_count_field_name` property is set to a non-null value,the job expects to receive aggregated input. The property must be set to thename of the field that contains the count of raw data points that have beenaggregated. It applies to all detectors in the job.The aggregations are defined in the {dfeed} as follows:[source,js]----------------------------------PUT _xpack/ml/datafeeds/datafeed-farequote{  "job_id":"farequote",  "indices": ["farequote"],  "types": ["response"],  "aggregations": {    "buckets": {      "date_histogram": {        "field": "time",        "interval": "360s",        "time_zone": "UTC"      },      "aggregations": {        "time": {          "max": {"field": "time"}        },        "airline": {          "terms": {            "field": "airline",            "size": 100          },          "aggregations": {            "responsetime": {              "avg": {                "field": "responsetime"              }            }          }        }      }    }  }}----------------------------------// CONSOLE// TEST[skip:setup:farequote_job]In this example, the aggregations have names that match the fields that theyoperate on. That is to say, the `max` aggregation is named `time` and itsfield is also `time`. The same is true for the aggregations with the names`airline` and `responsetime`. Since you must create the job before you cancreate the {dfeed}, synchronizing your aggregation and field names can simplifythese configuration steps.IMPORTANT: If you use a `max` aggregation on a time field, the aggregation namein the {dfeed} must match the name of the time field, as in the previous example.For all other aggregations, if the aggregation name doesn't match the field name,there are limitations in the drill-down functionality within the {ml} page in{kib}.{dfeeds-cap} support complex nested aggregations, this example uses the `derivative`pipeline aggregation to find the first order derivative of the counter`system.network.out.bytes` for each value of the field `beat.name`.[source,js]----------------------------------"aggregations": {  "beat.name": {    "terms": {      "field": "beat.name"    },    "aggregations": {      "buckets": {        "date_histogram": {          "field": "@timestamp",          "interval": "5m"        },        "aggregations": {          "@timestamp": {            "max": {              "field": "@timestamp"            }          },          "bytes_out_average": {            "avg": {              "field": "system.network.out.bytes"            }          },          "bytes_out_derivative": {            "derivative": {              "buckets_path": "bytes_out_average"            }          }        }      }    }  }}----------------------------------// NOTCONSOLEWhen you define an aggregation in a {dfeed}, it must have the following form:[source,js]----------------------------------"aggregations": {  ["bucketing_aggregation": {    "bucket_agg": {      ...    },    "aggregations": {]      "data_histogram_aggregation": {        "date_histogram": {          "field": "time",        },        "aggregations": {          "timestamp": {            "max": {              "field": "time"            }          },          [,"<first_term>": {            "terms":{...            }            [,"aggregations" : {              [<sub_aggregation>]+            } ]          }]        }      }    }  }}----------------------------------// NOTCONSOLEThe top level aggregation must be either a {ref}/search-aggregations-bucket.html[Bucket Aggregation]containing as single sub-aggregation that is a `date_histogram` or the top level aggregationis the required `date_histogram`. There must be exactly 1 `date_histogram` aggregation.For more information, see{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date Histogram Aggregation].NOTE: The `time_zone` parameter in the date histogram aggregation must be set to `UTC`,which is the default value.Each histogram bucket has a key, which is the bucket start time. This key cannotbe used for aggregations in {dfeeds}, however, because they need to know thetime of the latest record within a bucket. Otherwise, when you restart a {dfeed},it continues from the start time of the histogram bucket and possibly fetchesthe same data twice. The max aggregation for the time field is thereforenecessary to provide the time of the latest record within a bucket.You can optionally specify a terms aggregation, which creates buckets fordifferent values of a field.IMPORTANT: If you use a terms aggregation, by default it returns buckets forthe top ten terms. Thus if the cardinality of the term is greater than 10, notall terms are analyzed.You can change this behavior by setting the `size` parameter. Todetermine the cardinality of your data, you can run searches such as:[source,js]--------------------------------------------------GET .../_search {  "aggs": {    "service_cardinality": {      "cardinality": {        "field": "service"        }    }  }}--------------------------------------------------// NOTCONSOLEBy default, {es} limits the maximum number of terms returned to 10000. For highcardinality fields, the query might not run. It might return errors related tocircuit breaking exceptions that indicate that the data is too large. In suchcases, do not use aggregations in your {dfeed}. For moreinformation, see {ref}/search-aggregations-bucket-terms-aggregation.html[Terms Aggregation].You can also optionally specify multiple sub-aggregations.The sub-aggregations are aggregated for the buckets that were created by theirparent aggregation. For more information, see{ref}/search-aggregations.html[Aggregations].TIP: If your detectors use metric or sum analytical functions, set the`interval` of the date histogram aggregation to a tenth of the `bucket_span`that was defined in the job. This suggestion creates finer, more granular timebuckets, which are ideal for this type of analysis. If your detectors use countor rare functions, set `interval` to the same value as `bucket_span`. For moreinformation about analytical functions, see <<ml-functions>>.
 |