| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201 | [role="xpack"][testenv="basic"][[rollup-understanding-groups]]== Understanding Groupsexperimental[]To preserve flexibility, Rollup Jobs are defined based on how future queries may need to use the data.  Traditionally, systems forcethe admin to make decisions about what metrics to rollup and on what interval.  E.g. The average of `cpu_time` on an hourly basis.  Thisis limiting; if, at a future date, the admin wishes to see the average of `cpu_time` on an hourly basis _and partitioned by `host_name`_,they are out of luck.Of course, the admin can decide to rollup the `[hour, host]` tuple on an hourly basis, but as the number of grouping keys grows, so do thenumber of tuples the admin needs to configure.  Furthermore, these `[hours, host]` tuples are only useful for hourly rollups... daily, weekly,or monthly rollups all require new configurations.Rather than force the admin to decide ahead of time which individual tuples should be rolled up, Elasticsearch's Rollup jobs are configuredbased on which groups are potentially useful to future queries.  For example, this configuration:[source,js]--------------------------------------------------"groups" : {  "date_histogram": {    "field": "timestamp",    "interval": "1h",    "delay": "7d"  },  "terms": {    "fields": ["hostname", "datacenter"]  },  "histogram": {    "fields": ["load", "net_in", "net_out"],    "interval": 5  }}--------------------------------------------------// NOTCONSOLEAllows `date_histogram`'s to be used on the `"timestamp"` field, `terms` aggregations to be used on the `"hostname"` and `"datacenter"`fields, and `histograms` to be used on any of `"load"`, `"net_in"`, `"net_out"` fields.Importantly, these aggs/fields can be used in any combination.  This aggregation:[source,js]--------------------------------------------------"aggs" : {  "hourly": {    "date_histogram": {      "field": "timestamp",      "interval": "1h"    },    "aggs": {      "host_names": {        "terms": {          "field": "hostname"        }      }    }  }}--------------------------------------------------// NOTCONSOLEis just as valid as this aggregation:[source,js]--------------------------------------------------"aggs" : {  "hourly": {    "date_histogram": {      "field": "timestamp",      "interval": "1h"    },    "aggs": {      "data_center": {        "terms": {          "field": "datacenter"        }      },      "aggs": {        "host_names": {          "terms": {            "field": "hostname"          }        },        "aggs": {          "load_values": {            "histogram": {              "field": "load",              "interval": 5            }          }        }      }    }  }}--------------------------------------------------// NOTCONSOLEYou'll notice that the second aggregation is not only substantially larger, it also swapped the position of the terms aggregation on`"hostname"`, illustrating how the order of aggregations does not matter to rollups.  Similarly, while the `date_histogram` is requiredfor rolling up data, it isn't required while querying (although often used).  For example, this is a valid aggregation forRollup Search to execute:[source,js]--------------------------------------------------"aggs" : {  "host_names": {    "terms": {      "field": "hostname"    }  }}--------------------------------------------------// NOTCONSOLEUltimately, when configuring `groups` for a job, think in terms of how you might wish to partition data in a query at a future date...then include those in the config.  Because Rollup Search allows any order or combination of the grouped fields, you just need to decideif a field is useful for aggregating later, and how you might wish to use it (terms, histogram, etc)=== Grouping Limitations with heterogeneous indicesThere was previously a limitation in how Rollup could handle indices that had heterogeneous mappings (multiple, unrelated/non-overlappingmappings).  The recommendation at the time was to configure a separate job per data "type".  For example, you might configure a separatejob for each Beats module that you had enabled (one for `process`, another for `filesystem`, etc).This recommendation was driven by internal implementation details that caused document counts to be potentially incorrect if a single "merged"job was used.This limitation has since been alleviated.  As of 6.4.0, it is now considered best practice to combine all rollup configurationsinto a single job.As an example, if your index has two types of documents:[source,js]--------------------------------------------------{  "timestamp": 1516729294000,  "temperature": 200,  "voltage": 5.2,  "node": "a"}--------------------------------------------------// NOTCONSOLEand[source,js]--------------------------------------------------{  "timestamp": 1516729294000,  "price": 123,  "title": "Foo"}--------------------------------------------------// NOTCONSOLEthe best practice is to combine them into a single rollup job which covers both of these document types, like this:[source,js]--------------------------------------------------PUT _rollup/job/combined{    "index_pattern": "data-*",    "rollup_index": "data_rollup",    "cron": "*/30 * * * * ?",    "page_size" :1000,    "groups" : {      "date_histogram": {        "field": "timestamp",        "interval": "1h",        "delay": "7d"      },      "terms": {        "fields": ["node", "title"]      }    },    "metrics": [        {            "field": "temperature",            "metrics": ["min", "max", "sum"]        },        {            "field": "price",            "metrics": ["avg"]        }    ]}--------------------------------------------------// NOTCONSOLE=== Doc counts and overlapping jobsThere was previously an issue with document counts on "overlapping" job configurations, driven by the same internal implementation detail.If there were  two Rollup jobs saving to the same index, where one job is a "subset" of another job, it was possible that document countscould be incorrect for certain aggregation arrangements.This issue has also since been eliminated in 6.4.0.
 |