|
@@ -1,73 +1,17 @@
|
|
|
[role="xpack"]
|
|
|
[[ml-dataframes]]
|
|
|
-= {transforms-cap}
|
|
|
+= Transforming data
|
|
|
|
|
|
[partintro]
|
|
|
--
|
|
|
|
|
|
-beta[]
|
|
|
-
|
|
|
-{es} aggregations are a powerful and flexible feature that enable you to
|
|
|
-summarize and retrieve complex insights about your data. You can summarize
|
|
|
-complex things like the number of web requests per day on a busy website, broken
|
|
|
-down by geography and browser type. If you use the same data set to try to
|
|
|
-calculate something as simple as a single number for the average duration of
|
|
|
-visitor web sessions, however, you can quickly run out of memory.
|
|
|
-
|
|
|
-Why does this occur? A web session duration is an example of a behavioral
|
|
|
-attribute not held on any one log record; it has to be derived by finding the
|
|
|
-first and last records for each session in our weblogs. This derivation requires
|
|
|
-some complex query expressions and a lot of memory to connect all the data
|
|
|
-points. If you have an ongoing background process that fuses related events from
|
|
|
-one index into entity-centric summaries in another index, you get a more useful,
|
|
|
-joined-up picture--this is essentially what _{dataframes}_ are.
|
|
|
-
|
|
|
-
|
|
|
-[discrete]
|
|
|
-[[ml-dataframes-usage]]
|
|
|
-== When to use {dataframes}
|
|
|
-
|
|
|
-You might want to consider using {dataframes} instead of aggregations when:
|
|
|
-
|
|
|
-* You need a complete _feature index_ rather than a top-N set of items.
|
|
|
-+
|
|
|
-In {ml}, you often need a complete set of behavioral features rather just the
|
|
|
-top-N. For example, if you are predicting customer churn, you might look at
|
|
|
-features such as the number of website visits in the last week, the total number
|
|
|
-of sales, or the number of emails sent. The {stack} {ml-features} create models
|
|
|
-based on this multi-dimensional feature space, so they benefit from full feature
|
|
|
-indices ({dataframes}).
|
|
|
-+
|
|
|
-This scenario also applies when you are trying to search across the results of
|
|
|
-an aggregation or multiple aggregations. Aggregation results can be ordered or
|
|
|
-filtered, but there are
|
|
|
-{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
|
|
|
-and
|
|
|
-{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
|
|
|
-is constrained by the maximum number of buckets returned. If you want to search
|
|
|
-all aggregation results, you need to create the complete {dataframe}. If you
|
|
|
-need to sort or filter the aggregation results by multiple fields, {dataframes}
|
|
|
-are particularly useful.
|
|
|
-
|
|
|
-* You need to sort aggregation results by a pipeline aggregation.
|
|
|
-+
|
|
|
-{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
|
|
|
-for sorting. Technically, this is because pipeline aggregations are run during
|
|
|
-the reduce phase after all other aggregations have already completed. If you
|
|
|
-create a {dataframe}, you can effectively perform multiple passes over the data.
|
|
|
-
|
|
|
-* You want to create summary tables to optimize queries.
|
|
|
-+
|
|
|
-For example, if you
|
|
|
-have a high level dashboard that is accessed by a large number of users and it
|
|
|
-uses a complex aggregation over a large dataset, it may be more efficient to
|
|
|
-create a {dataframe} to cache results. Thus, each user doesn't need to run the
|
|
|
-aggregation query.
|
|
|
-
|
|
|
-Though there are multiple ways to create {dataframes}, this content pertains
|
|
|
-to one specific method: _{transforms}_.
|
|
|
+{transforms-cap} enable you to convert existing {es} indices into summarized
|
|
|
+indices, which provide opportunities for new insights and analytics. For example,
|
|
|
+you can use {transforms} to pivot your data into entity-centric indices that
|
|
|
+summarize the behavior of users or sessions or other entities in your data.
|
|
|
|
|
|
* <<ml-transform-overview>>
|
|
|
+* <<ml-transforms-usage>>
|
|
|
* <<df-api-quickref>>
|
|
|
* <<dataframe-examples>>
|
|
|
* <<dataframe-troubleshooting>>
|
|
@@ -75,6 +19,7 @@ to one specific method: _{transforms}_.
|
|
|
--
|
|
|
|
|
|
include::overview.asciidoc[]
|
|
|
+include::usage.asciidoc[]
|
|
|
include::checkpoints.asciidoc[]
|
|
|
include::api-quickref.asciidoc[]
|
|
|
include::dataframe-examples.asciidoc[]
|