|
@@ -1,279 +0,0 @@
|
|
|
-[role="xpack"]
|
|
|
-[testenv="basic"]
|
|
|
-[[rollup-job-config]]
|
|
|
-=== Rollup job configuration
|
|
|
-
|
|
|
-experimental[]
|
|
|
-
|
|
|
-The Rollup Job Configuration contains all the details about how the rollup job should run, when it indexes documents,
|
|
|
-and what future queries will be able to execute against the rollup index.
|
|
|
-
|
|
|
-There are three main sections to the Job Configuration; the logistical details about the job (cron schedule, etc), what fields
|
|
|
-should be grouped on, and what metrics to collect for each group.
|
|
|
-
|
|
|
-A full job configuration might look like this:
|
|
|
-
|
|
|
-[source,console]
|
|
|
---------------------------------------------------
|
|
|
-PUT _rollup/job/sensor
|
|
|
-{
|
|
|
- "index_pattern": "sensor-*",
|
|
|
- "rollup_index": "sensor_rollup",
|
|
|
- "cron": "*/30 * * * * ?",
|
|
|
- "page_size" :1000,
|
|
|
- "groups" : {
|
|
|
- "date_histogram": {
|
|
|
- "field": "timestamp",
|
|
|
- "fixed_interval": "60m",
|
|
|
- "delay": "7d"
|
|
|
- },
|
|
|
- "terms": {
|
|
|
- "fields": ["hostname", "datacenter"]
|
|
|
- },
|
|
|
- "histogram": {
|
|
|
- "fields": ["load", "net_in", "net_out"],
|
|
|
- "interval": 5
|
|
|
- }
|
|
|
- },
|
|
|
- "metrics": [
|
|
|
- {
|
|
|
- "field": "temperature",
|
|
|
- "metrics": ["min", "max", "sum"]
|
|
|
- },
|
|
|
- {
|
|
|
- "field": "voltage",
|
|
|
- "metrics": ["avg"]
|
|
|
- }
|
|
|
- ]
|
|
|
-}
|
|
|
---------------------------------------------------
|
|
|
-// TEST[setup:sensor_index]
|
|
|
-
|
|
|
-==== Logistical Details
|
|
|
-
|
|
|
-In the above example, there are several pieces of logistical configuration for the job itself.
|
|
|
-
|
|
|
-`{job_id}` (required)::
|
|
|
- (string) In the endpoint URL, you specify the name of the job (`sensor` in the above example). This can be any alphanumeric string,
|
|
|
- and uniquely identifies the data that is associated with the rollup job. The ID is persistent, in that it is stored with the rolled
|
|
|
- up data. So if you create a job, let it run for a while, then delete the job... the data that the job rolled up will still be
|
|
|
- associated with this job ID. You will be unable to create a new job with the same ID, as that could lead to problems with mismatched
|
|
|
- job configurations
|
|
|
-
|
|
|
-`index_pattern` (required)::
|
|
|
- (string) The index, or index pattern, that you wish to rollup. Supports wildcard-style patterns (`logstash-*`). The job will
|
|
|
- attempt to rollup the entire index or index-pattern. Once the "backfill" is finished, it will periodically (as defined by the cron)
|
|
|
- look for new data and roll that up too.
|
|
|
-
|
|
|
-`rollup_index` (required)::
|
|
|
- (string) The index that you wish to store rollup results into. All the rollup data that is generated by the job will be
|
|
|
- stored in this index. When searching the rollup data, this index will be used in the <<rollup-search,Rollup Search>> endpoint's URL.
|
|
|
- The rollup index can be shared with other rollup jobs. The data is stored so that it doesn't interfere with unrelated jobs.
|
|
|
-
|
|
|
-`cron` (required)::
|
|
|
- (string) A cron string which defines when the rollup job should be executed. The cron string defines an interval of when to run
|
|
|
- the job's indexer. When the interval triggers, the indexer will attempt to rollup the data in the index pattern. The cron pattern
|
|
|
- is unrelated to the time interval of the data being rolled up. For example, you may wish to create hourly rollups of your document (as
|
|
|
- defined in the <<rollup-groups-config,grouping configuration>>) but to only run the indexer on a daily basis at midnight, as defined by the cron.
|
|
|
- The cron pattern is defined just like Watcher's Cron Schedule.
|
|
|
-
|
|
|
-`page_size` (required)::
|
|
|
- (int) The number of bucket results that should be processed on each iteration of the rollup indexer. A larger value
|
|
|
- will tend to execute faster, but will require more memory during processing. This has no effect on how the data is rolled up, it is
|
|
|
- merely used for tweaking the speed/memory cost of the indexer.
|
|
|
-
|
|
|
-[NOTE]
|
|
|
-The `index_pattern` cannot be a pattern that would also match the destination `rollup_index`. E.g. the pattern
|
|
|
-`"foo-*"` would match the rollup index `"foo-rollup"`. This causes problems because the rollup job would attempt
|
|
|
-to rollup it's own data at runtime. If you attempt to configure a pattern that matches the `rollup_index`, an exception
|
|
|
-will be thrown to prevent this behavior.
|
|
|
-
|
|
|
-[[rollup-groups-config]]
|
|
|
-==== Grouping Config
|
|
|
-
|
|
|
-The `groups` section of the configuration is where you decide which fields should be grouped on, and with what aggregations. These
|
|
|
-fields will then be available later for aggregating into buckets. For example, this configuration:
|
|
|
-
|
|
|
-[source,js]
|
|
|
---------------------------------------------------
|
|
|
-"groups" : {
|
|
|
- "date_histogram": {
|
|
|
- "field": "timestamp",
|
|
|
- "fixed_interval": "60m",
|
|
|
- "delay": "7d"
|
|
|
- },
|
|
|
- "terms": {
|
|
|
- "fields": ["hostname", "datacenter"]
|
|
|
- },
|
|
|
- "histogram": {
|
|
|
- "fields": ["load", "net_in", "net_out"],
|
|
|
- "interval": 5
|
|
|
- }
|
|
|
-}
|
|
|
---------------------------------------------------
|
|
|
-// NOTCONSOLE
|
|
|
-
|
|
|
-Allows `date_histogram`'s to be used on the `"timestamp"` field, `terms` aggregations to be used on the `"hostname"` and `"datacenter"`
|
|
|
-fields, and `histograms` to be used on any of `"load"`, `"net_in"`, `"net_out"` fields.
|
|
|
-
|
|
|
-Importantly, these aggs/fields can be used in any combination. Think of the `groups` configuration as defining a set of tools that can
|
|
|
-later be used in aggregations to partition the data. Unlike raw data, we have to think ahead to which fields and aggregations might be used.
|
|
|
-But Rollups provide enough flexibility that you simply need to determine _which_ fields are needed, not _in what order_ they are needed.
|
|
|
-
|
|
|
-There are three types of groupings currently available:
|
|
|
-
|
|
|
-===== Date Histogram
|
|
|
-
|
|
|
-A `date_histogram` group aggregates a `date` field into time-based buckets. The `date_histogram` group is *mandatory* -- you currently
|
|
|
-cannot rollup documents without a timestamp and a `date_histogram` group.
|
|
|
-
|
|
|
-The `date_histogram` group has several parameters:
|
|
|
-
|
|
|
-`field` (required)::
|
|
|
- The date field that is to be rolled up.
|
|
|
-
|
|
|
-`interval` (required)::
|
|
|
- The interval of time buckets to be generated when rolling up. E.g. `"60m"` will produce 60 minute (hourly) rollups. This follows standard time formatting
|
|
|
- syntax as used elsewhere in Elasticsearch. The `interval` defines the _minimum_ interval that can be aggregated only. If hourly (`"60m"`)
|
|
|
- intervals are configured, <<rollup-search,Rollup Search>> can execute aggregations with 60m or greater (weekly, monthly, etc) intervals.
|
|
|
- So define the interval as the smallest unit that you wish to later query.
|
|
|
-
|
|
|
- Note: smaller, more granular intervals take up proportionally more space.
|
|
|
-
|
|
|
-`delay`::
|
|
|
- How long to wait before rolling up new documents. By default, the indexer attempts to roll up all data that is available. However, it
|
|
|
- is not uncommon for data to arrive out of order, sometimes even a few days late. The indexer is unable to deal with data that arrives
|
|
|
- after a time-span has been rolled up (e.g. there is no provision to update already-existing rollups).
|
|
|
-
|
|
|
- Instead, you should specify a `delay` that matches the longest period of time you expect out-of-order data to arrive. E.g. a `delay` of
|
|
|
- `"1d"` will instruct the indexer to roll up documents up to `"now - 1d"`, which provides a day of buffer time for out-of-order documents
|
|
|
- to arrive.
|
|
|
-
|
|
|
-`time_zone`::
|
|
|
- Defines what time_zone the rollup documents are stored as. Unlike raw data, which can shift timezones on the fly, rolled documents have
|
|
|
- to be stored with a specific timezone. By default, rollup documents are stored in `UTC`, but this can be changed with the `time_zone`
|
|
|
- parameter.
|
|
|
-
|
|
|
-.Calendar vs Fixed time intervals
|
|
|
-**********************************
|
|
|
-Elasticsearch understands both "calendar" and "fixed" time intervals. Fixed time intervals are fairly easy to understand;
|
|
|
-`"60s"` means sixty seconds. But what does `"1M` mean? One month of time depends on which month we are talking about,
|
|
|
-some months are longer or shorter than others. This is an example of "calendar" time, and the duration of that unit
|
|
|
-depends on context. Calendar units are also affected by leap-seconds, leap-years, etc.
|
|
|
-
|
|
|
-This is important because the buckets generated by Rollup will be in either calendar or fixed intervals, and will limit
|
|
|
-how you can query them later (see <<rollup-search-limitations-intervals, Requests must be multiples of the config>>.
|
|
|
-
|
|
|
-We recommend sticking with "fixed" time intervals, since they are easier to understand and are more flexible at query
|
|
|
-time. It will introduce some drift in your data during leap-events, and you will have to think about months in a fixed
|
|
|
-quantity (30 days) instead of the actual calendar length... but it is often easier than dealing with calendar units
|
|
|
-at query time.
|
|
|
-
|
|
|
-Multiples of units are always "fixed" (e.g. `"2h"` is always the fixed quantity `7200` seconds. Single units can be
|
|
|
-fixed or calendar depending on the unit:
|
|
|
-
|
|
|
-[options="header"]
|
|
|
-|=======
|
|
|
-|Unit |Calendar |Fixed
|
|
|
-|millisecond |NA |`1ms`, `10ms`, etc
|
|
|
-|second |NA |`1s`, `10s`, etc
|
|
|
-|minute |`1m` |`2m`, `10m`, etc
|
|
|
-|hour |`1h` |`2h`, `10h`, etc
|
|
|
-|day |`1d` |`2d`, `10d`, etc
|
|
|
-|week |`1w` |NA
|
|
|
-|month |`1M` |NA
|
|
|
-|quarter |`1q` |NA
|
|
|
-|year |`1y` |NA
|
|
|
-|=======
|
|
|
-
|
|
|
-For some units where there are both fixed and calendar, you may need to express the quantity in terms of the next
|
|
|
-smaller unit. For example, if you want a fixed day (not a calendar day), you should specify `24h` instead of `1d`.
|
|
|
-Similarly, if you want fixed hours, specify `60m` instead of `1h`. This is because the single quantity entails
|
|
|
-calendar time, and limits you to querying by calendar time in the future.
|
|
|
-
|
|
|
-
|
|
|
-**********************************
|
|
|
-
|
|
|
-===== Terms
|
|
|
-
|
|
|
-The `terms` group can be used on `keyword` or numeric fields, to allow bucketing via the `terms` aggregation at a later point. The `terms`
|
|
|
-group is optional. If defined, the indexer will enumerate and store _all_ values of a field for each time-period. This can be potentially
|
|
|
-costly for high-cardinality groups such as IP addresses, especially if the time-bucket is particularly sparse.
|
|
|
-
|
|
|
-While it is unlikely that a rollup will ever be larger in size than the raw data, defining `terms` groups on multiple high-cardinality fields
|
|
|
-can effectively reduce the compression of a rollup to a large extent. You should be judicious which high-cardinality fields are included
|
|
|
-for that reason.
|
|
|
-
|
|
|
-The `terms` group has a single parameter:
|
|
|
-
|
|
|
-`fields` (required)::
|
|
|
- The set of fields that you wish to collect terms for. This array can contain fields that are both `keyword` and numerics. Order
|
|
|
- does not matter
|
|
|
-
|
|
|
-
|
|
|
-===== Histogram
|
|
|
-
|
|
|
-The `histogram` group aggregates one or more numeric fields into numeric histogram intervals. This group is optional
|
|
|
-
|
|
|
-
|
|
|
-The `histogram` group has a two parameters:
|
|
|
-
|
|
|
-`fields` (required)::
|
|
|
- The set of fields that you wish to build histograms for. All fields specified must be some kind of numeric. Order does not matter
|
|
|
-
|
|
|
-`interval` (required)::
|
|
|
- The interval of histogram buckets to be generated when rolling up. E.g. `5` will create buckets that are five units wide
|
|
|
- (`0-5`, `5-10`, etc). Note that only one interval can be specified in the `histogram` group, meaning that all fields being grouped via
|
|
|
- the histogram must share the same interval.
|
|
|
-
|
|
|
-[[rollup-metrics-config]]
|
|
|
-==== Metrics Config
|
|
|
-
|
|
|
-After defining which groups should be generated for the data, you next configure which metrics should be collected. By default, only
|
|
|
-the doc_counts are collected for each group. To make rollup useful, you will often add metrics like averages, mins, maxes, etc.
|
|
|
-
|
|
|
-Metrics are defined on a per-field basis, and for each field you configure which metric should be collected. For example:
|
|
|
-
|
|
|
-[source,js]
|
|
|
---------------------------------------------------
|
|
|
-"metrics": [
|
|
|
- {
|
|
|
- "field": "temperature",
|
|
|
- "metrics": ["min", "max", "sum"]
|
|
|
- },
|
|
|
- {
|
|
|
- "field": "voltage",
|
|
|
- "metrics": ["avg"]
|
|
|
- }
|
|
|
-]
|
|
|
---------------------------------------------------
|
|
|
-// NOTCONSOLE
|
|
|
-
|
|
|
-This configuration defines metrics over two fields, `"temperature` and `"voltage"`. For the `"temperature"` field, we are collecting
|
|
|
-the min, max and sum of the temperature. For `"voltage"`, we are collecting the average. These metrics are collected in a way that makes
|
|
|
-them compatible with any combination of defined groups.
|
|
|
-
|
|
|
-The `metrics` configuration accepts an array of objects, where each object has two parameters:
|
|
|
-
|
|
|
-`field` (required)::
|
|
|
- The field to collect metrics for. This must be a numeric of some kind
|
|
|
-
|
|
|
-`metrics` (required)::
|
|
|
- An array of metrics to collect for the field. At least one metric must be configured. Acceptable metrics are min/max/sum/avg/value_count.
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
-.Averages aren't composable?!
|
|
|
-**********************************
|
|
|
-If you've worked with rollups before, you may be cautious around averages. If an average is saved for a 10 minute
|
|
|
-interval, it usually isn't useful for larger intervals. You cannot average six 10-minute averages to find a
|
|
|
-hourly average (average of averages is not equal to the total average).
|
|
|
-
|
|
|
-For this reason, other systems tend to either omit the ability to average, or store the average at multiple intervals
|
|
|
-to support more flexible querying.
|
|
|
-
|
|
|
-Instead, the Rollup feature saves the `count` and `sum` for the defined time interval. This allows us to reconstruct
|
|
|
-the average at any interval greater-than or equal to the defined interval. This gives maximum flexibility for
|
|
|
-minimal storage costs... and you don't have to worry about average accuracies (no average of averages here!)
|
|
|
-**********************************
|
|
|
-
|