| 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253 | [role="xpack"][[ml-delayed-data-detection]]=== Handling delayed dataDelayed data are documents that are indexed late. That is to say, it is datarelated to a time that the {dfeed} has already processed.When you create a datafeed, you can specify a{ref}/ml-datafeed-resource.html[`query_delay`] setting. This setting enables thedatafeed to wait for some time past real-time, which means any "late" data inthis period is fully indexed before the datafeed tries to gather it. However, ifthe setting is set too low, the datafeed may query for data before it has beenindexed and consequently miss that document. Conversely, if it is set too high,analysis drifts farther away from real-time. The balance that is struck dependsupon each use case and the environmental factors of the cluster.==== Why worry about delayed data?This is a particularly prescient question. If data are delayed randomly (andconsequently are missing from analysis), the results of certain types offunctions are not really affected. In these situations, it all comes out okay inthe end as the delayed data is distributed randomly. An example would be a `mean`metric for a field in a large collection of data. In this case, checking fordelayed data may not provide much benefit. If data are consistently delayed,however, jobs with a `low_count` function may provide false positives. In thissituation, it would be useful to see if data comes in after an anomaly isrecorded so that you can determine a next course of action.==== How do we detect delayed data?In addition to the `query_delay` field, there is a{ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config],which enables you to configure the datafeed to look in the past for delayed data.Every 15 minutes or every `check_window`, whichever is smaller, the datafeedtriggers a document search over the configured indices. This search looks over atime span with a length of `check_window` ending with the latest finalized bucket.That time span is partitioned into buckets, whose length equals the bucket spanof the associated job. The `doc_count` of those buckets are then compared withthe job's finalized analysis buckets to see whether any data has arrived sincethe analysis. If there is indeed missing data due to their ingest delay, the enduser is notified. For example, you can see annotations in {kib} for the periodswhere these delays occur.==== What to do about delayed data?The most common course of action is to simply to do nothing. For many functionsand situations, ignoring the data is acceptable. However, if the amount ofdelayed data is too great or the situation calls for it, the next course ofaction to consider is to increase the `query_delay` of the datafeed. Thisincreased delay allows more time for data to be indexed. If you have real-timeconstraints, however, an increased delay might not be desirable. In which case,you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed].
 |