|
@@ -0,0 +1,42 @@
|
|
|
+[role="xpack"]
|
|
|
+[[ml-delayed-data-detection]]
|
|
|
+=== Handling delayed data
|
|
|
+
|
|
|
+Delayed data are documents that are indexed late. That is to say, it is data
|
|
|
+related to a time that the {dfeed} has already processed.
|
|
|
+
|
|
|
+When you create a datafeed, you can specify a {ref}/ml-datafeed-resource.html[`query_delay`] setting.
|
|
|
+This setting enables the datafeed to wait for some time past real-time, which means any "late" data in this period
|
|
|
+is fully indexed before the datafeed tries to gather it. However, if the setting is set too low, the datafeed may query
|
|
|
+for data before it has been indexed and consequently miss that document. Conversely, if it is set too high,
|
|
|
+analysis drifts farther away from real-time. The balance that is struck depends upon each use case and
|
|
|
+the environmental factors of the cluster.
|
|
|
+
|
|
|
+==== Why worry about delayed data?
|
|
|
+
|
|
|
+This is a particularly prescient question. If data are delayed randomly (and consequently missing from analysis),
|
|
|
+the results of certain types of functions are not really affected. It all comes out ok in the end
|
|
|
+as the delayed data is distributed randomly. An example would be a `mean` metric for a field in a large collection of data.
|
|
|
+In this case, checking for delayed data may not provide much benefit. If data are consistently delayed, however, jobs with a `low_count` function may
|
|
|
+provide false positives. In this situation, it would be useful to see if data
|
|
|
+comes in after an anomaly is recorded so that you can determine a next course of action.
|
|
|
+
|
|
|
+==== How do we detect delayed data?
|
|
|
+
|
|
|
+In addition to the `query_delay` field, there is a
|
|
|
+{ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config], which enables you to
|
|
|
+configure the datafeed to look in the past for delayed data. Every 15 minutes or every `check_window`,
|
|
|
+whichever is smaller, the datafeed triggers a document search over the configured indices. This search looks over a
|
|
|
+time span with a length of `check_window` ending with the latest finalized bucket. That time span is partitioned into buckets,
|
|
|
+whose length equals the bucket span of the associated job. The `doc_count` of those buckets are then compared with the
|
|
|
+job's finalized analysis buckets to see whether any data has arrived since the analysis. If there is indeed missing data
|
|
|
+due to their ingest delay, the end user is notified.
|
|
|
+
|
|
|
+==== What to do about delayed data?
|
|
|
+
|
|
|
+The most common course of action is to simply to do nothing. For many functions and situations ignoring the data is
|
|
|
+acceptable. However, if the amount of delayed data is too great or the situation calls for it, the next course
|
|
|
+of action to consider is to increase the `query_delay` of the datafeed. This increased delay allows more time for data to be
|
|
|
+indexed. If you have real-time constraints, however, an increased delay might not be desirable.
|
|
|
+In which case, you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed.]
|
|
|
+
|