index.asciidoc 3.5 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
  1. [role="xpack"]
  2. [[ml-dataframes]]
  3. = {transforms-cap}
  4. [partintro]
  5. --
  6. beta[]
  7. {es} aggregations are a powerful and flexible feature that enable you to
  8. summarize and retrieve complex insights about your data. You can summarize
  9. complex things like the number of web requests per day on a busy website, broken
  10. down by geography and browser type. If you use the same data set to try to
  11. calculate something as simple as a single number for the average duration of
  12. visitor web sessions, however, you can quickly run out of memory.
  13. Why does this occur? A web session duration is an example of a behavioral
  14. attribute not held on any one log record; it has to be derived by finding the
  15. first and last records for each session in our weblogs. This derivation requires
  16. some complex query expressions and a lot of memory to connect all the data
  17. points. If you have an ongoing background process that fuses related events from
  18. one index into entity-centric summaries in another index, you get a more useful,
  19. joined-up picture--this is essentially what _{dataframes}_ are.
  20. [discrete]
  21. [[ml-dataframes-usage]]
  22. == When to use {dataframes}
  23. You might want to consider using {dataframes} instead of aggregations when:
  24. * You need a complete _feature index_ rather than a top-N set of items.
  25. +
  26. In {ml}, you often need a complete set of behavioral features rather just the
  27. top-N. For example, if you are predicting customer churn, you might look at
  28. features such as the number of website visits in the last week, the total number
  29. of sales, or the number of emails sent. The {stack} {ml-features} create models
  30. based on this multi-dimensional feature space, so they benefit from full feature
  31. indices ({dataframes}).
  32. +
  33. This scenario also applies when you are trying to search across the results of
  34. an aggregation or multiple aggregations. Aggregation results can be ordered or
  35. filtered, but there are
  36. {ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
  37. and
  38. {ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
  39. is constrained by the maximum number of buckets returned. If you want to search
  40. all aggregation results, you need to create the complete {dataframe}. If you
  41. need to sort or filter the aggregation results by multiple fields, {dataframes}
  42. are particularly useful.
  43. * You need to sort aggregation results by a pipeline aggregation.
  44. +
  45. {ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
  46. for sorting. Technically, this is because pipeline aggregations are run during
  47. the reduce phase after all other aggregations have already completed. If you
  48. create a {dataframe}, you can effectively perform multiple passes over the data.
  49. * You want to create summary tables to optimize queries.
  50. +
  51. For example, if you
  52. have a high level dashboard that is accessed by a large number of users and it
  53. uses a complex aggregation over a large dataset, it may be more efficient to
  54. create a {dataframe} to cache results. Thus, each user doesn't need to run the
  55. aggregation query.
  56. Though there are multiple ways to create {dataframes}, this content pertains
  57. to one specific method: _{transforms}_.
  58. * <<ml-transform-overview>>
  59. * <<df-api-quickref>>
  60. * <<dataframe-examples>>
  61. * <<dataframe-troubleshooting>>
  62. * <<dataframe-limitations>>
  63. --
  64. include::overview.asciidoc[]
  65. include::checkpoints.asciidoc[]
  66. include::api-quickref.asciidoc[]
  67. include::dataframe-examples.asciidoc[]
  68. include::troubleshooting.asciidoc[]
  69. include::limitations.asciidoc[]