6 years ago · 19529da2db
--- a/docs/reference/ml/delayed-data-detection.asciidoc
+++ b/docs/reference/ml/delayed-data-detection.asciidoc
@@ -5,38 +5,49 @@
 
				 Delayed data are documents that are indexed late. That is to say, it is data
			
 
				 related to a time that the {dfeed} has already processed.
			
 
				 
			
 
				-When you create a datafeed, you can specify a {ref}/ml-datafeed-resource.html[`query_delay`] setting.
			
 
				-This setting enables the datafeed to wait for some time past real-time, which means any "late" data in this period
			
 
				-is fully indexed before the datafeed tries to gather it. However, if the setting is set too low, the datafeed may query
			
 
				-for data before it has been indexed and consequently miss that document. Conversely, if it is set too high,
			
 
				-analysis drifts farther away from real-time. The balance that is struck depends upon each use case and
			
 
				-the environmental factors of the cluster.
			
 
				+When you create a datafeed, you can specify a
			
 
				+{ref}/ml-datafeed-resource.html[`query_delay`] setting. This setting enables the
			
 
				+datafeed to wait for some time past real-time, which means any "late" data in
			
 
				+this period is fully indexed before the datafeed tries to gather it. However, if
			
 
				+the setting is set too low, the datafeed may query for data before it has been
			
 
				+indexed and consequently miss that document. Conversely, if it is set too high,
			
 
				+analysis drifts farther away from real-time. The balance that is struck depends
			
 
				+upon each use case and the environmental factors of the cluster.
			
 
				 
			
 
				 ==== Why worry about delayed data?
			
 
				 
			
 
				-This is a particularly prescient question. If data are delayed randomly (and consequently missing from analysis),
			
 
				-the results of certain types of functions are not really affected. It all comes out ok in the end 
			
 
				-as the delayed data is distributed randomly. An example would be a `mean` metric for a field in a large collection of data.
			
 
				-In this case, checking for delayed data may not provide much benefit. If data are consistently delayed, however, jobs with a `low_count` function may
			
 
				-provide false positives. In this situation, it would be useful to see if data
			
 
				-comes in after an anomaly is recorded so that you can determine a next course of action.
			
 
				+This is a particularly prescient question. If data are delayed randomly (and
			
 
				+consequently are missing from analysis), the results of certain types of
			
 
				+functions are not really affected. In these situations, it all comes out okay in
			
 
				+the end as the delayed data is distributed randomly. An example would be a `mean`
			
 
				+metric for a field in a large collection of data. In this case, checking for
			
 
				+delayed data may not provide much benefit. If data are consistently delayed,
			
 
				+however, jobs with a `low_count` function may provide false positives. In this
			
 
				+situation, it would be useful to see if data comes in after an anomaly is
			
 
				+recorded so that you can determine a next course of action.
			
 
				 
			
 
				 ==== How do we detect delayed data?
			
 
				 
			
 
				 In addition to the `query_delay` field, there is a
			
 
				-{ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config], which enables you to
			
 
				-configure the datafeed to look in the past for delayed data. Every 15 minutes or every `check_window`,
			
 
				-whichever is smaller, the datafeed triggers a document search over the configured indices. This search looks over a
			
 
				-time span with a length of `check_window` ending with the latest finalized bucket. That time span is partitioned into buckets,
			
 
				-whose length equals the bucket span of the associated job. The `doc_count` of those buckets are then compared with the
			
 
				-job's finalized analysis buckets to see whether any data has arrived since the analysis. If there is indeed missing data
			
 
				-due to their ingest delay, the end user is notified.
			
 
				+{ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config],
			
 
				+which enables you to configure the datafeed to look in the past for delayed data.
			
 
				+Every 15 minutes or every `check_window`, whichever is smaller, the datafeed
			
 
				+triggers a document search over the configured indices. This search looks over a
			
 
				+time span with a length of `check_window` ending with the latest finalized bucket.
			
 
				+That time span is partitioned into buckets, whose length equals the bucket span
			
 
				+of the associated job. The `doc_count` of those buckets are then compared with
			
 
				+the job's finalized analysis buckets to see whether any data has arrived since
			
 
				+the analysis. If there is indeed missing data due to their ingest delay, the end
			
 
				+user is notified. For example, you can see annotations in {kib} for the periods
			
 
				+where these delays occur.
			
 
				 
			
 
				 ==== What to do about delayed data?
			
 
				 
			
 
				-The most common course of action is to simply to do nothing. For many functions and situations ignoring the data is
			
 
				-acceptable. However, if the amount of delayed data is too great or the situation calls for it, the next course
			
 
				-of action to consider is to increase the `query_delay` of the datafeed. This increased delay allows more time for data to be
			
 
				-indexed. If you have real-time constraints, however, an increased delay might not be desirable.
			
 
				-In which case, you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed.]
			
 
				+The most common course of action is to simply to do nothing. For many functions
			
 
				+and situations, ignoring the data is acceptable. However, if the amount of
			
 
				+delayed data is too great or the situation calls for it, the next course of
			
 
				+action to consider is to increase the `query_delay` of the datafeed. This
			
 
				+increased delay allows more time for data to be indexed. If you have real-time
			
 
				+constraints, however, an increased delay might not be desirable. In which case,
			
 
				+you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed].