12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182 |
- [role="xpack"]
- [[ml-dataframes]]
- = {transforms-cap}
- [partintro]
- --
- beta[]
- {es} aggregations are a powerful and flexible feature that enable you to
- summarize and retrieve complex insights about your data. You can summarize
- complex things like the number of web requests per day on a busy website, broken
- down by geography and browser type. If you use the same data set to try to
- calculate something as simple as a single number for the average duration of
- visitor web sessions, however, you can quickly run out of memory.
- Why does this occur? A web session duration is an example of a behavioral
- attribute not held on any one log record; it has to be derived by finding the
- first and last records for each session in our weblogs. This derivation requires
- some complex query expressions and a lot of memory to connect all the data
- points. If you have an ongoing background process that fuses related events from
- one index into entity-centric summaries in another index, you get a more useful,
- joined-up picture--this is essentially what _{dataframes}_ are.
- [discrete]
- [[ml-dataframes-usage]]
- == When to use {dataframes}
- You might want to consider using {dataframes} instead of aggregations when:
- * You need a complete _feature index_ rather than a top-N set of items.
- +
- In {ml}, you often need a complete set of behavioral features rather just the
- top-N. For example, if you are predicting customer churn, you might look at
- features such as the number of website visits in the last week, the total number
- of sales, or the number of emails sent. The {stack} {ml-features} create models
- based on this multi-dimensional feature space, so they benefit from full feature
- indices ({dataframes}).
- +
- This scenario also applies when you are trying to search across the results of
- an aggregation or multiple aggregations. Aggregation results can be ordered or
- filtered, but there are
- {ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
- and
- {ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
- is constrained by the maximum number of buckets returned. If you want to search
- all aggregation results, you need to create the complete {dataframe}. If you
- need to sort or filter the aggregation results by multiple fields, {dataframes}
- are particularly useful.
- * You need to sort aggregation results by a pipeline aggregation.
- +
- {ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
- for sorting. Technically, this is because pipeline aggregations are run during
- the reduce phase after all other aggregations have already completed. If you
- create a {dataframe}, you can effectively perform multiple passes over the data.
- * You want to create summary tables to optimize queries.
- +
- For example, if you
- have a high level dashboard that is accessed by a large number of users and it
- uses a complex aggregation over a large dataset, it may be more efficient to
- create a {dataframe} to cache results. Thus, each user doesn't need to run the
- aggregation query.
- Though there are multiple ways to create {dataframes}, this content pertains
- to one specific method: _{transforms}_.
- * <<ml-transform-overview>>
- * <<df-api-quickref>>
- * <<dataframe-examples>>
- * <<dataframe-troubleshooting>>
- * <<dataframe-limitations>>
- --
- include::overview.asciidoc[]
- include::checkpoints.asciidoc[]
- include::api-quickref.asciidoc[]
- include::dataframe-examples.asciidoc[]
- include::troubleshooting.asciidoc[]
- include::limitations.asciidoc[]
|