ml-delayed-data-detection.asciidoc 4.3 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
  1. [role="xpack"]
  2. [[ml-delayed-data-detection]]
  3. = Handling delayed data
  4. Delayed data are documents that are indexed late. That is to say, it is data
  5. related to a time that your {dfeed} has already processed and it is therefore
  6. never analyzed by your {anomaly-job}.
  7. When you create a {dfeed}, you can specify a
  8. {ref}/ml-put-datafeed.html#ml-put-datafeed-request-body[`query_delay`] setting.
  9. This setting enables the {dfeed} to wait for some time past real-time, which
  10. means any "late" data in this period is fully indexed before the {dfeed} tries
  11. to gather it. However, if the setting is set too low, the {dfeed} may query for
  12. data before it has been indexed and consequently miss that document. Conversely,
  13. if it is set too high, analysis drifts farther away from real-time. The balance
  14. that is struck depends upon each use case and the environmental factors of the
  15. cluster.
  16. IMPORTANT: If you get an error that says
  17. `Datafeed missed XXXX documents due to ingest latency`, consider increasing
  18. the value of `query_delay'. If it doesn't help, investigate the ingest latency and its
  19. cause. You can do this by comparing event and ingest timestamps. High latency
  20. is often caused by bursts of ingested documents, misconfiguration of the ingest
  21. pipeline, or misalignment of system clocks.
  22. == Why worry about delayed data?
  23. If data are delayed randomly (and consequently are missing from analysis), the
  24. results of certain types of functions are not really affected. In these
  25. situations, it all comes out okay in the end as the delayed data is distributed
  26. randomly. An example would be a `mean` metric for a field in a large collection
  27. of data. In this case, checking for delayed data may not provide much benefit.
  28. If data are consistently delayed, however, {anomaly-jobs} with a `low_count`
  29. function may provide false positives. In this situation, it would be useful to
  30. see if data comes in after an anomaly is recorded so that you can determine a
  31. next course of action.
  32. == How do we detect delayed data?
  33. In addition to the `query_delay` field, there is a delayed data check config,
  34. which enables you to configure the datafeed to look in the past for delayed data.
  35. Every 15 minutes or every `check_window`, whichever is smaller, the datafeed
  36. triggers a document search over the configured indices. This search looks over a
  37. time span with a length of `check_window` ending with the latest finalized bucket.
  38. That time span is partitioned into buckets, whose length equals the bucket span
  39. of the associated {anomaly-job}. The `doc_count` of those buckets are then
  40. compared with the job's finalized analysis buckets to see whether any data has
  41. arrived since the analysis. If there is indeed missing data due to their ingest
  42. delay, the end user is notified. For example, you can see annotations in {kib}
  43. for the periods where these delays occur:
  44. [role="screenshot"]
  45. image::images/ml-annotations.png["Delayed data annotations in the Single Metric Viewer"]
  46. [IMPORTANT]
  47. ====
  48. As the `doc_count` from an aggregation is compared with the
  49. bucket results of the job, the delayed data check will not work correctly in the
  50. following cases:
  51. * if the datafeed uses aggregations and the job's `analysis_config` does not have its
  52. `summary_count_field_name` set to `doc_count`,
  53. * if the datafeed is _not_ using aggregations and `summary_count_field_name` is set to
  54. any value.
  55. If the datafeed is using aggregations then it's highly likely that the job's
  56. `summary_count_field_name` should be set to `doc_count`. If
  57. `summary_count_field_name` is set to any value other than `doc_count`, the
  58. delayed data check for the datafeed must be disabled.
  59. ====
  60. There is another tool for visualizing the delayed data on the *Annotations* tab
  61. in the {anomaly-detect} job management page:
  62. [role="screenshot"]
  63. image::images/ml-datafeed-chart.png["Delayed data in the {dfeed} chart"]
  64. == What to do about delayed data?
  65. The most common course of action is to simply to do nothing. For many functions
  66. and situations, ignoring the data is acceptable. However, if the amount of
  67. delayed data is too great or the situation calls for it, the next course of
  68. action to consider is to increase the `query_delay` of the datafeed. This
  69. increased delay allows more time for data to be indexed. If you have real-time
  70. constraints, however, an increased delay might not be desirable. In which case,
  71. you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed].