ml-delayed-data-detection.asciidoc 2.9 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
  1. [role="xpack"]
  2. [[ml-delayed-data-detection]]
  3. = Handling delayed data
  4. Delayed data are documents that are indexed late. That is to say, it is data
  5. related to a time that the {dfeed} has already processed.
  6. When you create a {dfeed}, you can specify a
  7. {ref}/ml-put-datafeed.html#ml-put-datafeed-request-body[`query_delay`] setting.
  8. This setting enables the {dfeed} to wait for some time past real-time, which
  9. means any "late" data in this period is fully indexed before the {dfeed} tries
  10. to gather it. However, if the setting is set too low, the {dfeed} may query for
  11. data before it has been indexed and consequently miss that document. Conversely,
  12. if it is set too high, analysis drifts farther away from real-time. The balance
  13. that is struck depends upon each use case and the environmental factors of the
  14. cluster.
  15. == Why worry about delayed data?
  16. This is a particularly prescient question. If data are delayed randomly (and
  17. consequently are missing from analysis), the results of certain types of
  18. functions are not really affected. In these situations, it all comes out okay in
  19. the end as the delayed data is distributed randomly. An example would be a `mean`
  20. metric for a field in a large collection of data. In this case, checking for
  21. delayed data may not provide much benefit. If data are consistently delayed,
  22. however, {anomaly-jobs} with a `low_count` function may provide false positives.
  23. In this situation, it would be useful to see if data comes in after an anomaly is
  24. recorded so that you can determine a next course of action.
  25. == How do we detect delayed data?
  26. In addition to the `query_delay` field, there is a delayed data check config,
  27. which enables you to configure the datafeed to look in the past for delayed data.
  28. Every 15 minutes or every `check_window`, whichever is smaller, the datafeed
  29. triggers a document search over the configured indices. This search looks over a
  30. time span with a length of `check_window` ending with the latest finalized bucket.
  31. That time span is partitioned into buckets, whose length equals the bucket span
  32. of the associated {anomaly-job}. The `doc_count` of those buckets are then
  33. compared with the job's finalized analysis buckets to see whether any data has
  34. arrived since the analysis. If there is indeed missing data due to their ingest
  35. delay, the end user is notified. For example, you can see annotations in {kib}
  36. for the periods where these delays occur.
  37. == What to do about delayed data?
  38. The most common course of action is to simply to do nothing. For many functions
  39. and situations, ignoring the data is acceptable. However, if the amount of
  40. delayed data is too great or the situation calls for it, the next course of
  41. action to consider is to increase the `query_delay` of the datafeed. This
  42. increased delay allows more time for data to be indexed. If you have real-time
  43. constraints, however, an increased delay might not be desirable. In which case,
  44. you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed].