123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280 |
- [role="xpack"]
- [testenv="basic"]
- [[rollup-job-config]]
- === Rollup Job Configuration
- experimental[]
- The Rollup Job Configuration contains all the details about how the rollup job should run, when it indexes documents,
- and what future queries will be able to execute against the rollup index.
- There are three main sections to the Job Configuration; the logistical details about the job (cron schedule, etc), what fields
- should be grouped on, and what metrics to collect for each group.
- A full job configuration might look like this:
- [source,js]
- --------------------------------------------------
- PUT _rollup/job/sensor
- {
- "index_pattern": "sensor-*",
- "rollup_index": "sensor_rollup",
- "cron": "*/30 * * * * ?",
- "page_size" :1000,
- "groups" : {
- "date_histogram": {
- "field": "timestamp",
- "interval": "60m",
- "delay": "7d"
- },
- "terms": {
- "fields": ["hostname", "datacenter"]
- },
- "histogram": {
- "fields": ["load", "net_in", "net_out"],
- "interval": 5
- }
- },
- "metrics": [
- {
- "field": "temperature",
- "metrics": ["min", "max", "sum"]
- },
- {
- "field": "voltage",
- "metrics": ["avg"]
- }
- ]
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[setup:sensor_index]
- ==== Logistical Details
- In the above example, there are several pieces of logistical configuration for the job itself.
- `{job_id}` (required)::
- (string) In the endpoint URL, you specify the name of the job (`sensor` in the above example). This can be any alphanumeric string,
- and uniquely identifies the data that is associated with the rollup job. The ID is persistent, in that it is stored with the rolled
- up data. So if you create a job, let it run for a while, then delete the job... the data that the job rolled up will still be
- associated with this job ID. You will be unable to create a new job with the same ID, as that could lead to problems with mismatched
- job configurations
- `index_pattern` (required)::
- (string) The index, or index pattern, that you wish to rollup. Supports wildcard-style patterns (`logstash-*`). The job will
- attempt to rollup the entire index or index-pattern. Once the "backfill" is finished, it will periodically (as defined by the cron)
- look for new data and roll that up too.
- `rollup_index` (required)::
- (string) The index that you wish to store rollup results into. All the rollup data that is generated by the job will be
- stored in this index. When searching the rollup data, this index will be used in the <<rollup-search,Rollup Search>> endpoint's URL.
- The rollup index be shared with other rollup jobs. The data is stored so that it doesn't interfere with unrelated jobs.
- `cron` (required)::
- (string) A cron string which defines when the rollup job should be executed. The cron string defines an interval of when to run
- the job's indexer. When the interval triggers, the indexer will attempt to rollup the data in the index pattern. The cron pattern
- is unrelated to the time interval of the data being rolled up. For example, you may wish to create hourly rollups of your document (as
- defined in the <<rollup-groups-config,grouping configuration>>) but to only run the indexer on a daily basis at midnight, as defined by the cron.
- The cron pattern is defined just like Watcher's Cron Schedule.
- `page_size` (required)::
- (int) The number of bucket results that should be processed on each iteration of the rollup indexer. A larger value
- will tend to execute faster, but will require more memory during processing. This has no effect on how the data is rolled up, it is
- merely used for tweaking the speed/memory cost of the indexer.
- [NOTE]
- The `index_pattern` cannot be a pattern that would also match the destination `rollup_index`. E.g. the pattern
- `"foo-*"` would match the rollup index `"foo-rollup"`. This causes problems because the rollup job would attempt
- to rollup it's own data at runtime. If you attempt to configure a pattern that matches the `rollup_index`, an exception
- will be thrown to prevent this behavior.
- [[rollup-groups-config]]
- ==== Grouping Config
- The `groups` section of the configuration is where you decide which fields should be grouped on, and with what aggregations. These
- fields will then be available later for aggregating into buckets. For example, this configuration:
- [source,js]
- --------------------------------------------------
- "groups" : {
- "date_histogram": {
- "field": "timestamp",
- "interval": "60m",
- "delay": "7d"
- },
- "terms": {
- "fields": ["hostname", "datacenter"]
- },
- "histogram": {
- "fields": ["load", "net_in", "net_out"],
- "interval": 5
- }
- }
- --------------------------------------------------
- // NOTCONSOLE
- Allows `date_histogram`'s to be used on the `"timestamp"` field, `terms` aggregations to be used on the `"hostname"` and `"datacenter"`
- fields, and `histograms` to be used on any of `"load"`, `"net_in"`, `"net_out"` fields.
- Importantly, these aggs/fields can be used in any combination. Think of the `groups` configuration as defining a set of tools that can
- later be used in aggregations to partition the data. Unlike raw data, we have to think ahead to which fields and aggregations might be used.
- But Rollups provide enough flexibility that you simply need to determine _which_ fields are needed, not _in what order_ they are needed.
- There are three types of groupings currently available:
- ===== Date Histogram
- A `date_histogram` group aggregates a `date` field into time-based buckets. The `date_histogram` group is *mandatory* -- you currently
- cannot rollup documents without a timestamp and a `date_histogram` group.
- The `date_histogram` group has several parameters:
- `field` (required)::
- The date field that is to be rolled up.
- `interval` (required)::
- The interval of time buckets to be generated when rolling up. E.g. `"60m"` will produce 60 minute (hourly) rollups. This follows standard time formatting
- syntax as used elsewhere in Elasticsearch. The `interval` defines the _minimum_ interval that can be aggregated only. If hourly (`"60m"`)
- intervals are configured, <<rollup-search,Rollup Search>> can execute aggregations with 60m or greater (weekly, monthly, etc) intervals.
- So define the interval as the smallest unit that you wish to later query.
- Note: smaller, more granular intervals take up proportionally more space.
- `delay`::
- How long to wait before rolling up new documents. By default, the indexer attempts to roll up all data that is available. However, it
- is not uncommon for data to arrive out of order, sometimes even a few days late. The indexer is unable to deal with data that arrives
- after a time-span has been rolled up (e.g. there is no provision to update already-existing rollups).
- Instead, you should specify a `delay` that matches the longest period of time you expect out-of-order data to arrive. E.g. a `delay` of
- `"1d"` will instruct the indexer to roll up documents up to `"now - 1d"`, which provides a day of buffer time for out-of-order documents
- to arrive.
- `time_zone`::
- Defines what time_zone the rollup documents are stored as. Unlike raw data, which can shift timezones on the fly, rolled documents have
- to be stored with a specific timezone. By default, rollup documents are stored in `UTC`, but this can be changed with the `time_zone`
- parameter.
- .Calendar vs Fixed time intervals
- **********************************
- Elasticsearch understands both "calendar" and "fixed" time intervals. Fixed time intervals are fairly easy to understand;
- `"60s"` means sixty seconds. But what does `"1M` mean? One month of time depends on which month we are talking about,
- some months are longer or shorter than others. This is an example of "calendar" time, and the duration of that unit
- depends on context. Calendar units are also affected by leap-seconds, leap-years, etc.
- This is important because the buckets generated by Rollup will be in either calendar or fixed intervals, and will limit
- how you can query them later (see <<rollup-search-limitations-intervals, Requests must be multiples of the config>>.
- We recommend sticking with "fixed" time intervals, since they are easier to understand and are more flexible at query
- time. It will introduce some drift in your data during leap-events, and you will have to think about months in a fixed
- quantity (30 days) instead of the actual calendar length... but it is often easier than dealing with calendar units
- at query time.
- Multiples of units are always "fixed" (e.g. `"2h"` is always the fixed quantity `7200` seconds. Single units can be
- fixed or calendar depending on the unit:
- [options="header"]
- |=======
- |Unit |Calendar |Fixed
- |millisecond |NA |`1ms`, `10ms`, etc
- |second |NA |`1s`, `10s`, etc
- |minute |`1m` |`2m`, `10m`, etc
- |hour |`1h` |`2h`, `10h`, etc
- |day |`1d` |`2d`, `10d`, etc
- |week |`1w` |NA
- |month |`1M` |NA
- |quarter |`1q` |NA
- |year |`1y` |NA
- |=======
- For some units where there are both fixed and calendar, you may need to express the quantity in terms of the next
- smaller unit. For example, if you want a fixed day (not a calendar day), you should specify `24h` instead of `1d`.
- Similarly, if you want fixed hours, specify `60m` instead of `1h`. This is because the single quantity entails
- calendar time, and limits you to querying by calendar time in the future.
- **********************************
- ===== Terms
- The `terms` group can be used on `keyword` or numeric fields, to allow bucketing via the `terms` aggregation at a later point. The `terms`
- group is optional. If defined, the indexer will enumerate and store _all_ values of a field for each time-period. This can be potentially
- costly for high-cardinality groups such as IP addresses, especially if the time-bucket is particularly sparse.
- While it is unlikely that a rollup will ever be larger in size than the raw data, defining `terms` groups on multiple high-cardinality fields
- can effectively reduce the compression of a rollup to a large extent. You should be judicious which high-cardinality fields are included
- for that reason.
- The `terms` group has a single parameter:
- `fields` (required)::
- The set of fields that you wish to collect terms for. This array can contain fields that are both `keyword` and numerics. Order
- does not matter
- ===== Histogram
- The `histogram` group aggregates one or more numeric fields into numeric histogram intervals. This group is optional
- The `histogram` group has a two parameters:
- `fields` (required)::
- The set of fields that you wish to build histograms for. All fields specified must be some kind of numeric. Order does not matter
- `interval` (required)::
- The interval of histogram buckets to be generated when rolling up. E.g. `5` will create buckets that are five units wide
- (`0-5`, `5-10`, etc). Note that only one interval can be specified in the `histogram` group, meaning that all fields being grouped via
- the histogram must share the same interval.
- [[rollup-metrics-config]]
- ==== Metrics Config
- After defining which groups should be generated for the data, you next configure which metrics should be collected. By default, only
- the doc_counts are collected for each group. To make rollup useful, you will often add metrics like averages, mins, maxes, etc.
- Metrics are defined on a per-field basis, and for each field you configure which metric should be collected. For example:
- [source,js]
- --------------------------------------------------
- "metrics": [
- {
- "field": "temperature",
- "metrics": ["min", "max", "sum"]
- },
- {
- "field": "voltage",
- "metrics": ["avg"]
- }
- ]
- --------------------------------------------------
- // NOTCONSOLE
- This configuration defines metrics over two fields, `"temperature` and `"voltage"`. For the `"temperature"` field, we are collecting
- the min, max and sum of the temperature. For `"voltage"`, we are collecting the average. These metrics are collected in a way that makes
- them compatible with any combination of defined groups.
- The `metrics` configuration accepts an array of objects, where each object has two parameters:
- `field` (required)::
- The field to collect metrics for. This must be a numeric of some kind
- `metrics` (required)::
- An array of metrics to collect for the field. At least one metric must be configured. Acceptable metrics are min/max/sum/avg/value_count.
- .Averages aren't composable?!
- **********************************
- If you've worked with rollups before, you may be cautious around averages. If an average is saved for a 10 minute
- interval, it usually isn't useful for larger intervals. You cannot average six 10-minute averages to find a
- hourly average (average of averages is not equal to the total average).
- For this reason, other systems tend to either omit the ability to average, or store the average at multiple intervals
- to support more flexible querying.
- Instead, the Rollup feature saves the `count` and `sum` for the defined time interval. This allows us to reconstruct
- the average at any interval greater-than or equal to the defined interval. This gives maximum flexibility for
- minimal storage costs... and you don't have to worry about average accuracies (no average of averages here!)
- **********************************
|