ml-configuring-populations.asciidoc 3.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
  1. [role="xpack"]
  2. [[ml-configuring-populations]]
  3. = Performing population analysis
  4. Entities or events in your data can be considered anomalous when:
  5. * Their behavior changes over time, relative to their own previous behavior, or
  6. * Their behavior is different than other entities in a specified population.
  7. The latter method of detecting anomalies is known as _population analysis_. The
  8. {ml} analytics build a profile of what a "typical" user, machine, or other
  9. entity does over a specified time period and then identify when one is behaving
  10. abnormally compared to the population.
  11. This type of analysis is most useful when the behavior of the population as a
  12. whole is mostly homogeneous and you want to identify unusual behavior. In
  13. general, population analysis is not useful when members of the population
  14. inherently have vastly different behavior. You can, however, segment your data
  15. into groups that behave similarly and run these as separate jobs. For example,
  16. you can use a query filter in the {dfeed} to segment your data or you can use
  17. the `partition_field_name` to split the analysis for the different groups.
  18. Population analysis scales well and has a lower resource footprint than
  19. individual analysis of each series. For example, you can analyze populations
  20. of hundreds of thousands or millions of entities.
  21. To specify the population, use the `over_field_name` property. For example:
  22. [source,console]
  23. ----------------------------------
  24. PUT _ml/anomaly_detectors/population
  25. {
  26. "description" : "Population analysis",
  27. "analysis_config" : {
  28. "bucket_span":"15m",
  29. "influencers": [
  30. "clientip"
  31. ],
  32. "detectors": [
  33. {
  34. "function": "mean",
  35. "field_name": "bytes",
  36. "over_field_name": "clientip" <1>
  37. }
  38. ]
  39. },
  40. "data_description" : {
  41. "time_field":"timestamp",
  42. "time_format": "epoch_ms"
  43. }
  44. }
  45. ----------------------------------
  46. // TEST[skip:needs-licence]
  47. <1> This `over_field_name` property indicates that the metrics for each client
  48. (as identified by their IP address) are analyzed relative to other clients
  49. in each bucket.
  50. If your data is stored in {es}, you can use the population job wizard in {kib}
  51. to create an {anomaly-job} with these same properties. For example, if you add
  52. the sample web logs in {kib}, you can use the following job settings in the
  53. population job wizard:
  54. [role="screenshot"]
  55. image::images/ml-population-job.png["Job settings in the population job wizard]
  56. After you open the job and start the {dfeed} or supply data to the job, you can
  57. view the results in {kib}. For example, you can view the results in the
  58. **Anomaly Explorer**:
  59. [role="screenshot"]
  60. image::images/ml-population-results.png["Population analysis results in the Anomaly Explorer"]
  61. As in this case, the results are often quite sparse. There might be just a few
  62. data points for the selected time period. Population analysis is particularly
  63. useful when you have many entities and the data for specific entitles is
  64. sporadic or sparse.
  65. If you click on a section in the timeline or swim lanes, you can see more
  66. details about the anomalies:
  67. [role="screenshot"]
  68. image::images/ml-population-anomaly.png["Anomaly details for a specific user"]
  69. In this example, the client IP address `30.156.16.164` received a low volume of
  70. bytes on the date and time shown. This event is anomalous because the mean is
  71. three times lower than the expected behavior of the population.