6 years ago · 769ff81548
--- a/docs/reference/transform/index.asciidoc
+++ b/docs/reference/transform/index.asciidoc
@@ -1,73 +1,17 @@
 
				 [role="xpack"]
			
 
				 [[ml-dataframes]]
			
 
				-= {transforms-cap}
			
 
				+= Transforming data
			
 
				 
			
 
				 [partintro]
			
 
				 --
			
 
				 
			
 
				-beta[]
			
 
				-
			
 
				-{es} aggregations are a powerful and flexible feature that enable you to
			
 
				-summarize and retrieve complex insights about your data. You can summarize
			
 
				-complex things like the number of web requests per day on a busy website, broken
			
 
				-down by geography and browser type. If you use the same data set to try to
			
 
				-calculate something as simple as a single number for the average duration of
			
 
				-visitor web sessions, however, you can quickly run out of memory.
			
 
				-
			
 
				-Why does this occur? A web session duration is an example of a behavioral
			
 
				-attribute not held on any one log record; it has to be derived by finding the
			
 
				-first and last records for each session in our weblogs. This derivation requires
			
 
				-some complex query expressions and a lot of memory to connect all the data
			
 
				-points. If you have an ongoing background process that fuses related events from
			
 
				-one index into entity-centric summaries in another index, you get a more useful,
			
 
				-joined-up picture--this is essentially what _{dataframes}_ are.
			
 
				-
			
 
				-
			
 
				-[discrete]
			
 
				-[[ml-dataframes-usage]]
			
 
				-== When to use {dataframes}
			
 
				-
			
 
				-You might want to consider using {dataframes} instead of aggregations when:
			
 
				-
			
 
				-* You need a complete _feature index_ rather than a top-N set of items.
			
 
				-+
			
 
				-In {ml}, you often need a complete set of behavioral features rather just the
			
 
				-top-N. For example, if you are predicting customer churn, you might look at
			
 
				-features such as the number of website visits in the last week, the total number
			
 
				-of sales, or the number of emails sent. The {stack} {ml-features} create models
			
 
				-based on this multi-dimensional feature space, so they benefit from full feature
			
 
				-indices ({dataframes}).
			
 
				-+
			
 
				-This scenario also applies when you are trying to search across the results of
			
 
				-an aggregation or multiple aggregations. Aggregation results can be ordered or
			
 
				-filtered, but there are
			
 
				-{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
			
 
				-and
			
 
				-{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
			
 
				-is constrained by the maximum number of buckets returned. If you want to search
			
 
				-all aggregation results, you need to create the complete {dataframe}. If you
			
 
				-need to sort or filter the aggregation results by multiple fields, {dataframes}
			
 
				-are particularly useful.
			
 
				-
			
 
				-* You need to sort aggregation results by a pipeline aggregation.
			
 
				-+
			
 
				-{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
			
 
				-for sorting. Technically, this is because pipeline aggregations are run during
			
 
				-the reduce phase after all other aggregations have already completed. If you
			
 
				-create a {dataframe}, you can effectively perform multiple passes over the data.
			
 
				-
			
 
				-* You want to create summary tables to optimize queries.
			
 
				-+
			
 
				-For example, if you
			
 
				-have a high level dashboard that is accessed by a large number of users and it
			
 
				-uses a complex aggregation over a large dataset, it may be more efficient to
			
 
				-create a {dataframe} to cache results. Thus, each user doesn't need to run the
			
 
				-aggregation query.
			
 
				-
			
 
				-Though there are multiple ways to create {dataframes}, this content pertains
			
 
				-to one specific method: _{transforms}_.
			
 
				+{transforms-cap} enable you to convert existing {es} indices into summarized
			
 
				+indices, which provide opportunities for new insights and analytics. For example,
			
 
				+you can use {transforms} to pivot your data into entity-centric indices that
			
 
				+summarize the behavior of users or sessions or other entities in your data.
			
 
				 
			
 
				 * <<ml-transform-overview>>
			
 
				+* <<ml-transforms-usage>>
			
 
				 * <<df-api-quickref>>
			
 
				 * <<dataframe-examples>>
			
 
				 * <<dataframe-troubleshooting>>
			
@@ -75,6 +19,7 @@ to one specific method: _{transforms}_.
 
				 --
			
 
				 
			
 
				 include::overview.asciidoc[]
			
 
				+include::usage.asciidoc[]
			
 
				 include::checkpoints.asciidoc[]
			
 
				 include::api-quickref.asciidoc[]
			
 
				 include::dataframe-examples.asciidoc[]
			
--- a/docs/reference/transform/troubleshooting.asciidoc
+++ b/docs/reference/transform/troubleshooting.asciidoc
@@ -1,3 +1,5 @@
 
				+[role="xpack"]
			
 
				+[testenv="basic"]
			
 
				 [[dataframe-troubleshooting]]
			
 
				 == Troubleshooting {transforms}
			
 
				 [subs="attributes"]
			
--- a/docs/reference/transform/usage.asciidoc
+++ b/docs/reference/transform/usage.asciidoc
@@ -0,0 +1,56 @@
 
				+[role="xpack"]
			
 
				+[testenv="basic"]
			
 
				+[[ml-transforms-usage]]
			
 
				+== When to use {transforms}
			
 
				+
			
 
				+{es} aggregations are a powerful and flexible feature that enable you to
			
 
				+summarize and retrieve complex insights about your data. You can summarize
			
 
				+complex things like the number of web requests per day on a busy website, broken
			
 
				+down by geography and browser type. If you use the same data set to try to
			
 
				+calculate something as simple as a single number for the average duration of
			
 
				+visitor web sessions, however, you can quickly run out of memory.
			
 
				+
			
 
				+Why does this occur? A web session duration is an example of a behavioral
			
 
				+attribute not held on any one log record; it has to be derived by finding the
			
 
				+first and last records for each session in our weblogs. This derivation requires
			
 
				+some complex query expressions and a lot of memory to connect all the data
			
 
				+points. If you have an ongoing background process that fuses related events from
			
 
				+one index into entity-centric summaries in another index, you get a more useful,
			
 
				+joined-up picture. This new index is sometimes referred to as a _{dataframe}_.
			
 
				+
			
 
				+You might want to consider using {transforms} instead of aggregations when:
			
 
				+
			
 
				+* You need a complete _feature index_ rather than a top-N set of items.
			
 
				++
			
 
				+In {ml}, you often need a complete set of behavioral features rather just the
			
 
				+top-N. For example, if you are predicting customer churn, you might look at
			
 
				+features such as the number of website visits in the last week, the total number
			
 
				+of sales, or the number of emails sent. The {stack} {ml-features} create models
			
 
				+based on this multi-dimensional feature space, so they benefit from the full
			
 
				+feature indices that are created by {transforms}.
			
 
				++
			
 
				+This scenario also applies when you are trying to search across the results of
			
 
				+an aggregation or multiple aggregations. Aggregation results can be ordered or
			
 
				+filtered, but there are
			
 
				+{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
			
 
				+and
			
 
				+{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
			
 
				+is constrained by the maximum number of buckets returned. If you want to search
			
 
				+all aggregation results, you need to create the complete {dataframe}. If you
			
 
				+need to sort or filter the aggregation results by multiple fields, {transforms}
			
 
				+are particularly useful.
			
 
				+
			
 
				+* You need to sort aggregation results by a pipeline aggregation.
			
 
				++
			
 
				+{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
			
 
				+for sorting. Technically, this is because pipeline aggregations are run during
			
 
				+the reduce phase after all other aggregations have already completed. If you
			
 
				+create a {transform}, you can effectively perform multiple passes over the data.
			
 
				+
			
 
				+* You want to create summary tables to optimize queries.
			
 
				++
			
 
				+For example, if you
			
 
				+have a high level dashboard that is accessed by a large number of users and it
			
 
				+uses a complex aggregation over a large dataset, it may be more efficient to
			
 
				+create a {transform} to cache results. Thus, each user doesn't need to run the
			
 
				+aggregation query.