6 years ago · 318b858114
--- a/docs/reference/transform/api-quickref.asciidoc
+++ b/docs/reference/transform/api-quickref.asciidoc
@@ -0,0 +1,21 @@
 
				+[role="xpack"]
			
 
				+[[df-api-quickref]]
			
 
				+== API quick reference
			
 
				+
			
 
				+All {dataframe-transform} endpoints have the following base:
			
 
				+
			
 
				+[source,js]
			
 
				+----
			
 
				+/_data_frame/transforms/
			
 
				+----
			
 
				+// NOTCONSOLE
			
 
				+
			
 
				+* {ref}/put-data-frame-transform.html[Create {dataframe-transforms}]
			
 
				+* {ref}/delete-data-frame-transform.html[Delete {dataframe-transforms}]
			
 
				+* {ref}/get-data-frame-transform.html[Get {dataframe-transforms}]
			
 
				+* {ref}/get-data-frame-transform-stats.html[Get {dataframe-transforms} statistics]
			
 
				+* {ref}/preview-data-frame-transform.html[Preview {dataframe-transforms}]
			
 
				+* {ref}/start-data-frame-transform.html[Start {dataframe-transforms}]
			
 
				+* {ref}/stop-data-frame-transform.html[Stop {dataframe-transforms}]
			
 
				+
			
 
				+For the full list, see {ref}/data-frame-apis.html[{dataframe-transform-cap} APIs].
			
--- a/docs/reference/transform/checkpoints.asciidoc
+++ b/docs/reference/transform/checkpoints.asciidoc
@@ -0,0 +1,88 @@
 
				+[role="xpack"]
			
 
				+[[ml-transform-checkpoints]]
			
 
				+== How {dataframe-transform} checkpoints work
			
 
				+++++
			
 
				+<titleabbrev>How checkpoints work</titleabbrev>
			
 
				+++++
			
 
				+
			
 
				+beta[]
			
 
				+
			
 
				+Each time a {dataframe-transform} examines the source indices and creates or
			
 
				+updates the destination index, it generates a _checkpoint_.
			
 
				+
			
 
				+If your {dataframe-transform} runs only once, there is logically only one
			
 
				+checkpoint. If your {dataframe-transform} runs continuously, however, it creates
			
 
				+checkpoints as it ingests and transforms new source data.
			
 
				+
			
 
				+To create a checkpoint, the {cdataframe-transform}:
			
 
				+
			
 
				+. Checks for changes to source indices.
			
 
				++
			
 
				+Using a simple periodic timer, the {dataframe-transform} checks for changes to
			
 
				+the source indices. This check is done based on the interval defined in the
			
 
				+transform's `frequency` property.
			
 
				++
			
 
				+If the source indices remain unchanged or if a checkpoint is already in progress
			
 
				+then it waits for the next timer.
			
 
				+
			
 
				+. Identifies which entities have changed.
			
 
				++
			
 
				+The {dataframe-transform} searches to see which entities have changed since the
			
 
				+last time it checked. The transform's `sync` configuration object identifies a
			
 
				+time field in the source indices. The transform uses the values in that field to
			
 
				+synchronize the source and destination indices.
			
 
				+ 
			
 
				+. Updates the destination index (the {dataframe}) with the changed entities.
			
 
				++
			
 
				+--
			
 
				+The {dataframe-transform} applies changes related to either new or changed
			
 
				+entities to the destination index. The set of changed entities is paginated. For
			
 
				+each page, the {dataframe-transform} performs a composite aggregation using a
			
 
				+`terms` query. After all the pages of changes have been applied, the checkpoint
			
 
				+is complete.
			
 
				+--
			
 
				+
			
 
				+This checkpoint process involves both search and indexing activity on the
			
 
				+cluster. We have attempted to favor control over performance while developing
			
 
				+{dataframe-transforms}. We decided it was preferable for the
			
 
				+{dataframe-transform} to take longer to complete, rather than to finish quickly
			
 
				+and take precedence in resource consumption. That being said, the cluster still
			
 
				+requires enough resources to support both the composite aggregation search and
			
 
				+the indexing of its results. 
			
 
				+
			
 
				+TIP: If the cluster experiences unsuitable performance degradation due to the
			
 
				+{dataframe-transform}, stop the transform. Consider whether you can apply a
			
 
				+source query to the {dataframe-transform} to reduce the scope of data it
			
 
				+processes. Also consider whether the cluster has sufficient resources in place
			
 
				+to support both the composite aggregation search and the indexing of its
			
 
				+results.
			
 
				+
			
 
				+[discrete]
			
 
				+[[ml-transform-checkpoint-errors]]
			
 
				+==== Error handling
			
 
				+
			
 
				+Failures in {dataframe-transforms} tend to be related to searching or indexing.
			
 
				+To increase the resiliency of {dataframe-transforms}, the cursor positions of
			
 
				+the aggregated search and the changed entities search are tracked in memory and
			
 
				+persisted periodically.
			
 
				+
			
 
				+Checkpoint failures can be categorized as follows:
			
 
				+
			
 
				+* Temporary failures: The checkpoint is retried. If 10 consecutive failures
			
 
				+occur, the {dataframe-transform} has a failed status. For example, this
			
 
				+situation might occur when there are shard failures and queries return only
			
 
				+partial results.
			
 
				+* Irrecoverable failures: The {dataframe-transform} immediately fails. For
			
 
				+example, this situation occurs when the source index is not found.
			
 
				+* Adjustment failures: The {dataframe-transform} retries with adjusted settings.
			
 
				+For example, if a parent circuit breaker memory errors occur during the
			
 
				+composite aggregation, the transform receives partial results. The aggregated
			
 
				+search is retried with a smaller number of buckets. This retry is performed at
			
 
				+the interval defined in the transform's `frequency` property. If the search
			
 
				+is retried to the point where it reaches a minimal number of buckets, an
			
 
				+irrecoverable failure occurs.
			
 
				+
			
 
				+If the node running the {dataframe-transforms} fails, the transform restarts
			
 
				+from the most recent persisted cursor position. This recovery process might
			
 
				+repeat some of the work the transform had already done, but it ensures data
			
 
				+consistency.
			
--- a/docs/reference/transform/dataframe-examples.asciidoc
+++ b/docs/reference/transform/dataframe-examples.asciidoc
@@ -0,0 +1,335 @@
 
				+[role="xpack"]
			
 
				+[testenv="basic"]
			
 
				+[[dataframe-examples]]
			
 
				+== {dataframe-transform-cap} examples
			
 
				+++++
			
 
				+<titleabbrev>Examples</titleabbrev>
			
 
				+++++
			
 
				+
			
 
				+beta[]
			
 
				+
			
 
				+These examples demonstrate how to use {dataframe-transforms} to derive useful 
			
 
				+insights from your data. All the examples use one of the 
			
 
				+{kibana-ref}/add-sample-data.html[{kib} sample datasets]. For a more detailed, 
			
 
				+step-by-step example, see 
			
 
				+<<ecommerce-dataframes,Transforming your data with {dataframes}>>.
			
 
				+
			
 
				+* <<ecommerce-dataframes>>
			
 
				+* <<example-best-customers>>
			
 
				+* <<example-airline>>
			
 
				+* <<example-clientips>>
			
 
				+
			
 
				+include::ecommerce-example.asciidoc[]
			
 
				+
			
 
				+[[example-best-customers]]
			
 
				+=== Finding your best customers
			
 
				+
			
 
				+In this example, we use the eCommerce orders sample dataset to find the customers 
			
 
				+who spent the most in our hypothetical webshop. Let's transform the data such 
			
 
				+that the destination index contains the number of orders, the total price of 
			
 
				+the orders, the amount of unique products and the average price per order, 
			
 
				+and the total amount of ordered products for each customer.
			
 
				+
			
 
				+[source,console]
			
 
				+----------------------------------
			
 
				+POST _data_frame/transforms/_preview
			
 
				+{
			
 
				+  "source": {
			
 
				+    "index": "kibana_sample_data_ecommerce"
			
 
				+  },
			
 
				+  "dest" : { <1>
			
 
				+    "index" : "sample_ecommerce_orders_by_customer"
			
 
				+  },
			
 
				+  "pivot": {
			
 
				+    "group_by": { <2>
			
 
				+      "user": { "terms": { "field": "user" }}, 
			
 
				+      "customer_id": { "terms": { "field": "customer_id" }}
			
 
				+    },
			
 
				+    "aggregations": {
			
 
				+      "order_count": { "value_count": { "field": "order_id" }},
			
 
				+      "total_order_amt": { "sum": { "field": "taxful_total_price" }},
			
 
				+      "avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
			
 
				+      "avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
			
 
				+      "total_unique_products": { "cardinality": { "field": "products.product_id" }}
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------
			
 
				+// TEST[skip:setup kibana sample data]
			
 
				+
			
 
				+<1> This is the destination index for the {dataframe}. It is ignored by 
			
 
				+`_preview`.
			
 
				+<2> Two `group_by` fields have been selected. This means the {dataframe} will 
			
 
				+contain a unique row per `user` and `customer_id` combination. Within this 
			
 
				+dataset both these fields are unique. By including both in the {dataframe} it 
			
 
				+gives more context to the final results.
			
 
				+
			
 
				+NOTE: In the example above, condensed JSON formatting has been used for easier 
			
 
				+readability of the pivot object.
			
 
				+
			
 
				+The preview {dataframe-transforms} API enables you to see the layout of the 
			
 
				+{dataframe} in advance, populated with some sample values. For example:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------
			
 
				+{
			
 
				+  "preview" : [
			
 
				+    {
			
 
				+      "total_order_amt" : 3946.9765625,
			
 
				+      "order_count" : 59.0,
			
 
				+      "total_unique_products" : 116.0,
			
 
				+      "avg_unique_products_per_order" : 2.0,
			
 
				+      "customer_id" : "10",
			
 
				+      "user" : "recip",
			
 
				+      "avg_amt_per_order" : 66.89790783898304
			
 
				+    },
			
 
				+    ...
			
 
				+    ]
			
 
				+  }
			
 
				+----------------------------------
			
 
				+// NOTCONSOLE
			
 
				+
			
 
				+This {dataframe} makes it easier to answer questions such as:
			
 
				+
			
 
				+* Which customers spend the most?
			
 
				+
			
 
				+* Which customers spend the most per order?
			
 
				+
			
 
				+* Which customers order most often?
			
 
				+
			
 
				+* Which customers ordered the least number of different products?
			
 
				+
			
 
				+It's possible to answer these questions using aggregations alone, however 
			
 
				+{dataframes} allow us to persist this data as a customer centric index. This 
			
 
				+enables us to analyze data at scale and gives more flexibility to explore and 
			
 
				+navigate data from a customer centric perspective. In some cases, it can even 
			
 
				+make creating visualizations much simpler.
			
 
				+
			
 
				+[[example-airline]]
			
 
				+=== Finding air carriers with the most delays
			
 
				+
			
 
				+In this example, we use the Flights sample dataset to find out which air carrier 
			
 
				+had the most delays. First, we filter the source data such that it excludes all 
			
 
				+the cancelled flights by using a query filter. Then we transform the data to 
			
 
				+contain the distinct number of flights, the sum of delayed minutes, and the sum 
			
 
				+of the flight minutes by air carrier. Finally, we use a 
			
 
				+{ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_script`] 
			
 
				+to determine what percentage of the flight time was actually delay.
			
 
				+
			
 
				+[source,console]
			
 
				+----------------------------------
			
 
				+POST _data_frame/transforms/_preview
			
 
				+{
			
 
				+  "source": {
			
 
				+    "index": "kibana_sample_data_flights",
			
 
				+    "query": { <1>
			
 
				+      "bool": {
			
 
				+        "filter": [
			
 
				+          { "term":  { "Cancelled": false } }
			
 
				+        ]
			
 
				+      }
			
 
				+    }
			
 
				+  },
			
 
				+  "dest" : { <2>
			
 
				+    "index" : "sample_flight_delays_by_carrier"
			
 
				+  },
			
 
				+  "pivot": {
			
 
				+    "group_by": { <3>
			
 
				+      "carrier": { "terms": { "field": "Carrier" }}
			
 
				+    },
			
 
				+    "aggregations": {
			
 
				+      "flights_count": { "value_count": { "field": "FlightNum" }},
			
 
				+      "delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
			
 
				+      "flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
			
 
				+      "delay_time_percentage": { <4>
			
 
				+        "bucket_script": {
			
 
				+          "buckets_path": {
			
 
				+            "delay_time": "delay_mins_total.value",
			
 
				+            "flight_time": "flight_mins_total.value"
			
 
				+          },
			
 
				+          "script": "(params.delay_time / params.flight_time) * 100"
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------
			
 
				+// TEST[skip:setup kibana sample data]
			
 
				+
			
 
				+<1> Filter the source data to select only flights that were not cancelled.
			
 
				+<2> This is the destination index for the {dataframe}. It is ignored by 
			
 
				+`_preview`.
			
 
				+<3> The data is grouped by the `Carrier` field which contains the airline name.
			
 
				+<4> This `bucket_script` performs calculations on the results that are returned 
			
 
				+by the aggregation. In this particular example, it calculates what percentage of 
			
 
				+travel time was taken up by delays.
			
 
				+
			
 
				+The preview shows you that the new index would contain data like this for each 
			
 
				+carrier:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------
			
 
				+{
			
 
				+  "preview" : [
			
 
				+    {
			
 
				+      "carrier" : "ES-Air",
			
 
				+      "flights_count" : 2802.0,
			
 
				+      "flight_mins_total" : 1436927.5130677223,
			
 
				+      "delay_time_percentage" : 9.335543983955839,
			
 
				+      "delay_mins_total" : 134145.0
			
 
				+    },
			
 
				+    ...
			
 
				+  ]
			
 
				+}
			
 
				+----------------------------------
			
 
				+// NOTCONSOLE
			
 
				+
			
 
				+This {dataframe} makes it easier to answer questions such as:
			
 
				+
			
 
				+* Which air carrier has the most delays as a percentage of flight time?
			
 
				+
			
 
				+NOTE: This data is fictional and does not reflect actual delays 
			
 
				+or flight stats for any of the featured destination or origin airports.
			
 
				+
			
 
				+
			
 
				+[[example-clientips]]
			
 
				+=== Finding suspicious client IPs by using scripted metrics
			
 
				+
			
 
				+With {dataframe-transforms}, you can use 
			
 
				+{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[scripted 
			
 
				+metric aggregations] on your data. These aggregations are flexible and make 
			
 
				+it possible to perform very complex processing. Let's use scripted metrics to 
			
 
				+identify suspicious client IPs in the web log sample dataset.
			
 
				+
			
 
				+We transform the data such that the new index contains the sum of bytes and the 
			
 
				+number of distinct URLs, agents, incoming requests by location, and geographic 
			
 
				+destinations for each client IP. We also use a scripted field to count the 
			
 
				+specific types of HTTP responses that each client IP receives. Ultimately, the 
			
 
				+example below transforms web log data into an entity centric index where the 
			
 
				+entity is `clientip`.
			
 
				+
			
 
				+[source,console]
			
 
				+----------------------------------
			
 
				+POST _data_frame/transforms/_preview
			
 
				+{
			
 
				+  "source": {
			
 
				+    "index": "kibana_sample_data_logs",
			
 
				+    "query": { <1>
			
 
				+      "range" : {
			
 
				+        "timestamp" : {
			
 
				+          "gte" : "now-30d/d"
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  },
			
 
				+  "dest" : { <2>
			
 
				+    "index" : "sample_weblogs_by_clientip"
			
 
				+  },  
			
 
				+  "pivot": {
			
 
				+    "group_by": {  <3>
			
 
				+      "clientip": { "terms": { "field": "clientip" } }
			
 
				+      },
			
 
				+    "aggregations": {
			
 
				+      "url_dc": { "cardinality": { "field": "url.keyword" }},
			
 
				+      "bytes_sum": { "sum": { "field": "bytes" }},
			
 
				+      "geo.src_dc": { "cardinality": { "field": "geo.src" }},
			
 
				+      "agent_dc": { "cardinality": { "field": "agent.keyword" }},
			
 
				+      "geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
			
 
				+      "responses.total": { "value_count": { "field": "timestamp" }},
			
 
				+      "responses.counts": { <4>
			
 
				+        "scripted_metric": { 
			
 
				+          "init_script": "state.responses = ['error':0L,'success':0L,'other':0L]",
			
 
				+          "map_script": """
			
 
				+            def code = doc['response.keyword'].value;
			
 
				+            if (code.startsWith('5') || code.startsWith('4')) {
			
 
				+              state.responses.error += 1 ;
			
 
				+            } else if(code.startsWith('2')) {
			
 
				+              state.responses.success += 1;
			
 
				+            } else {
			
 
				+              state.responses.other += 1;
			
 
				+            }
			
 
				+            """,
			
 
				+          "combine_script": "state.responses",
			
 
				+          "reduce_script": """
			
 
				+            def counts = ['error': 0L, 'success': 0L, 'other': 0L];
			
 
				+            for (responses in states) {
			
 
				+              counts.error += responses['error'];
			
 
				+              counts.success += responses['success'];
			
 
				+              counts.other += responses['other'];
			
 
				+            }
			
 
				+            return counts;
			
 
				+            """
			
 
				+          }
			
 
				+        },
			
 
				+      "timestamp.min": { "min": { "field": "timestamp" }},
			
 
				+      "timestamp.max": { "max": { "field": "timestamp" }},
			
 
				+      "timestamp.duration_ms": { <5>
			
 
				+        "bucket_script": {
			
 
				+          "buckets_path": {
			
 
				+            "min_time": "timestamp.min.value",
			
 
				+            "max_time": "timestamp.max.value"
			
 
				+          },
			
 
				+          "script": "(params.max_time - params.min_time)"
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------
			
 
				+// TEST[skip:setup kibana sample data]
			
 
				+
			
 
				+<1> This range query limits the transform to documents that are within the last 
			
 
				+30 days at the point in time the {dataframe-transform} checkpoint is processed. 
			
 
				+For batch {dataframes} this occurs once.
			
 
				+<2> This is the destination index for the {dataframe}. It is ignored by 
			
 
				+`_preview`.
			
 
				+<3> The data is grouped by the `clientip` field. 
			
 
				+<4> This `scripted_metric` performs a distributed operation on the web log data 
			
 
				+to count specific types of HTTP responses (error, success, and other).
			
 
				+<5> This `bucket_script` calculates the duration of the `clientip` access based 
			
 
				+on the results of the aggregation.
			
 
				+
			
 
				+The preview shows you that the new index would contain data like this for each 
			
 
				+client IP:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------
			
 
				+{
			
 
				+  "preview" : [
			
 
				+    {
			
 
				+      "geo" : {
			
 
				+        "src_dc" : 12.0,
			
 
				+        "dest_dc" : 9.0
			
 
				+      },
			
 
				+      "clientip" : "0.72.176.46",
			
 
				+      "agent_dc" : 3.0,
			
 
				+      "responses" : {
			
 
				+        "total" : 14.0,
			
 
				+        "counts" : {
			
 
				+          "other" : 0,
			
 
				+          "success" : 14,
			
 
				+          "error" : 0
			
 
				+        }
			
 
				+      },
			
 
				+      "bytes_sum" : 74808.0,
			
 
				+      "timestamp" : {
			
 
				+        "duration_ms" : 4.919943239E9,
			
 
				+        "min" : "2019-06-17T07:51:57.333Z",
			
 
				+        "max" : "2019-08-13T06:31:00.572Z"
			
 
				+      },
			
 
				+      "url_dc" : 11.0
			
 
				+    },
			
 
				+    ...
			
 
				+  }
			
 
				+----------------------------------  
			
 
				+// NOTCONSOLE
			
 
				+
			
 
				+This {dataframe} makes it easier to answer questions such as:
			
 
				+
			
 
				+* Which client IPs are transferring the most amounts of data?
			
 
				+
			
 
				+* Which client IPs are interacting with a high number of different URLs?
			
 
				+  
			
 
				+* Which client IPs have high error rates?
			
 
				+  
			
 
				+* Which client IPs are interacting with a high number of destination countries?
			
--- a/docs/reference/transform/ecommerce-example.asciidoc
+++ b/docs/reference/transform/ecommerce-example.asciidoc
@@ -0,0 +1,262 @@
 
				+[role="xpack"]
			
 
				+[testenv="basic"]
			
 
				+[[ecommerce-dataframes]]
			
 
				+=== Transforming the eCommerce sample data
			
 
				+
			
 
				+beta[]
			
 
				+
			
 
				+<<ml-dataframes,{dataframe-transforms-cap}>> enable you to retrieve information
			
 
				+from an {es} index, transform it, and store it in another index. Let's use the
			
 
				+{kibana-ref}/add-sample-data.html[{kib} sample data] to demonstrate how you can
			
 
				+pivot and summarize your data with {dataframe-transforms}.
			
 
				+
			
 
				+
			
 
				+. If the {es} {security-features} are enabled, obtain a user ID with sufficient
			
 
				+privileges to complete these steps. 
			
 
				++
			
 
				+--
			
 
				+You need `manage_data_frame_transforms` cluster privileges to preview and create
			
 
				+{dataframe-transforms}. Members of the built-in `data_frame_transforms_admin`
			
 
				+role have these privileges.
			
 
				+
			
 
				+You also need `read` and `view_index_metadata` index privileges on the source
			
 
				+index and `read`, `create_index`, and `index` privileges on the destination
			
 
				+index. 
			
 
				+
			
 
				+For more information, see <<security-privileges>> and <<built-in-roles>>.
			
 
				+--
			
 
				+
			
 
				+. Choose your _source index_.
			
 
				++
			
 
				+--
			
 
				+In this example, we'll use the eCommerce orders sample data. If you're not
			
 
				+already familiar with the `kibana_sample_data_ecommerce` index, use the
			
 
				+*Revenue* dashboard in {kib} to explore the data. Consider what insights you
			
 
				+might want to derive from this eCommerce data.
			
 
				+--
			
 
				+
			
 
				+. Play with various options for grouping and aggregating the data. 
			
 
				++
			
 
				+--
			
 
				+For example, you might want to group the data by product ID and calculate the
			
 
				+total number of sales for each product and its average price. Alternatively, you
			
 
				+might want to look at the behavior of individual customers and calculate how
			
 
				+much each customer spent in total and how many different categories of products
			
 
				+they purchased. Or you might want to take the currencies or geographies into
			
 
				+consideration. What are the most interesting ways you can transform and
			
 
				+interpret this data?
			
 
				+
			
 
				+_Pivoting_ your data involves using at least one field to group it and applying
			
 
				+at least one aggregation. You can preview what the transformed data will look
			
 
				+like, so go ahead and play with it!
			
 
				+
			
 
				+For example, go to *Machine Learning* > *Data Frames* in {kib} and use the
			
 
				+wizard to create a {dataframe-transform}:
			
 
				+
			
 
				+[role="screenshot"]
			
 
				+image::images/ecommerce-pivot1.jpg["Creating a simple {dataframe-transform} in {kib}"]
			
 
				+
			
 
				+In this case, we grouped the data by customer ID and calculated the sum of
			
 
				+products each customer purchased.
			
 
				+
			
 
				+Let's add some more aggregations to learn more about our customers' orders. For
			
 
				+example, let's calculate the total sum of their purchases, the maximum number of
			
 
				+products that they purchased in a single order, and their total number of orders.
			
 
				+We'll accomplish this by using the
			
 
				+{ref}/search-aggregations-metrics-sum-aggregation.html[`sum` aggregation] on the
			
 
				+`taxless_total_price` field, the
			
 
				+{ref}/search-aggregations-metrics-max-aggregation.html[`max` aggregation] on the
			
 
				+`total_quantity` field, and the
			
 
				+{ref}/search-aggregations-metrics-cardinality-aggregation.html[`cardinality` aggregation]
			
 
				+on the `order_id` field:
			
 
				+
			
 
				+[role="screenshot"]
			
 
				+image::images/ecommerce-pivot2.jpg["Adding multiple aggregations to a {dataframe-transform} in {kib}"]
			
 
				+
			
 
				+TIP: If you're interested in a subset of the data, you can optionally include a
			
 
				+{ref}/search-request-body.html#request-body-search-query[query] element. In this
			
 
				+example, we've filtered the data so that we're only looking at orders with a
			
 
				+`currency` of `EUR`. Alternatively, we could group the data by that field too.
			
 
				+If you want to use more complex queries, you can create your {dataframe} from a
			
 
				+{kibana-ref}/save-open-search.html[saved search].
			
 
				+
			
 
				+If you prefer, you can use the
			
 
				+{ref}/preview-data-frame-transform.html[preview {dataframe-transforms} API]:
			
 
				+
			
 
				+[source,js]
			
 
				+--------------------------------------------------
			
 
				+POST _data_frame/transforms/_preview
			
 
				+{
			
 
				+  "source": {
			
 
				+    "index": "kibana_sample_data_ecommerce",
			
 
				+    "query": {
			
 
				+      "bool": {
			
 
				+        "filter": {
			
 
				+          "term": {"currency": "EUR"}
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  },
			
 
				+  "pivot": {
			
 
				+    "group_by": {
			
 
				+      "customer_id": {
			
 
				+        "terms": {
			
 
				+          "field": "customer_id"
			
 
				+        }
			
 
				+      }
			
 
				+    },
			
 
				+    "aggregations": {
			
 
				+      "total_quantity.sum": {
			
 
				+        "sum": {
			
 
				+          "field": "total_quantity"
			
 
				+        }
			
 
				+      },
			
 
				+      "taxless_total_price.sum": {
			
 
				+        "sum": {
			
 
				+          "field": "taxless_total_price"
			
 
				+        }
			
 
				+      },
			
 
				+      "total_quantity.max": {
			
 
				+        "max": {
			
 
				+          "field": "total_quantity"
			
 
				+        }
			
 
				+      },
			
 
				+      "order_id.cardinality": {
			
 
				+        "cardinality": {
			
 
				+          "field": "order_id"
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+--------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[skip:set up sample data]
			
 
				+--
			
 
				+
			
 
				+. When you are satisfied with what you see in the preview, create the
			
 
				+{dataframe-transform}. 
			
 
				++
			
 
				+--
			
 
				+.. Supply a job ID and the name of the target (or _destination_) index.
			
 
				+
			
 
				+.. Decide whether you want the {dataframe-transform} to run once or continuously.
			
 
				+--
			
 
				++
			
 
				+--
			
 
				+Since this sample data index is unchanging, let's use the default behavior and
			
 
				+just run the {dataframe-transform} once.
			
 
				+
			
 
				+[role="screenshot"]
			
 
				+image::images/ecommerce-batch.jpg["Specifying the {dataframe-transform} options in {kib}"]
			
 
				+
			
 
				+If you want to try it out, however, go ahead and click on *Continuous mode*. 
			
 
				+You must choose a field that the {dataframe-transform} can use to check which
			
 
				+entities have changed. In general, it's a good idea to use the ingest timestamp
			
 
				+field. In this example, however, you can use the `order_date` field.
			
 
				+
			
 
				+If you prefer, you can use the
			
 
				+{ref}/put-data-frame-transform.html[create {dataframe-transforms} API]. For
			
 
				+example:
			
 
				+
			
 
				+[source,js]
			
 
				+--------------------------------------------------
			
 
				+PUT _data_frame/transforms/ecommerce-customer-transform
			
 
				+{
			
 
				+  "source": {
			
 
				+    "index": [
			
 
				+      "kibana_sample_data_ecommerce"
			
 
				+    ],
			
 
				+    "query": {
			
 
				+      "bool": {
			
 
				+        "filter": {
			
 
				+          "term": {
			
 
				+            "currency": "EUR"
			
 
				+          }
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  },
			
 
				+  "pivot": {
			
 
				+    "group_by": {
			
 
				+      "customer_id": {
			
 
				+        "terms": {
			
 
				+          "field": "customer_id"
			
 
				+        }
			
 
				+      }
			
 
				+    },
			
 
				+    "aggregations": {
			
 
				+      "total_quantity.sum": {
			
 
				+        "sum": {
			
 
				+          "field": "total_quantity"
			
 
				+        }
			
 
				+      },
			
 
				+      "taxless_total_price.sum": {
			
 
				+        "sum": {
			
 
				+          "field": "taxless_total_price"
			
 
				+        }
			
 
				+      },
			
 
				+      "total_quantity.max": {
			
 
				+        "max": {
			
 
				+          "field": "total_quantity"
			
 
				+        }
			
 
				+      },
			
 
				+      "order_id.cardinality": {
			
 
				+        "cardinality": {
			
 
				+          "field": "order_id"
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  },
			
 
				+  "dest": {
			
 
				+    "index": "ecommerce-customers"
			
 
				+  }
			
 
				+}
			
 
				+--------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[skip:setup kibana sample data]
			
 
				+--
			
 
				+
			
 
				+. Start the {dataframe-transform}.
			
 
				++
			
 
				+--
			
 
				+
			
 
				+TIP: Even though resource utilization is automatically adjusted based on the
			
 
				+cluster load, a {dataframe-transform} increases search and indexing load on your
			
 
				+cluster while it runs. If you're experiencing an excessive load, however, you
			
 
				+can stop it.
			
 
				+
			
 
				+You can start, stop, and manage {dataframe-transforms} in {kib}:
			
 
				+
			
 
				+[role="screenshot"]
			
 
				+image::images/dataframe-transforms.jpg["Managing {dataframe-transforms} in {kib}"]
			
 
				+
			
 
				+Alternatively, you can use the
			
 
				+{ref}/start-data-frame-transform.html[start {dataframe-transforms}] and
			
 
				+{ref}/stop-data-frame-transform.html[stop {dataframe-transforms}] APIs. For
			
 
				+example:
			
 
				+
			
 
				+[source,js]
			
 
				+--------------------------------------------------
			
 
				+POST _data_frame/transforms/ecommerce-customer-transform/_start
			
 
				+--------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[skip:setup kibana sample data]
			
 
				+
			
 
				+--
			
 
				+
			
 
				+. Explore the data in your new index.
			
 
				++
			
 
				+--
			
 
				+For example, use the *Discover* application in {kib}:
			
 
				+
			
 
				+[role="screenshot"]
			
 
				+image::images/ecommerce-results.jpg["Exploring the new index in {kib}"]
			
 
				+
			
 
				+--
			
 
				+
			
 
				+TIP: If you do not want to keep the {dataframe-transform}, you can delete it in
			
 
				+{kib} or use the
			
 
				+{ref}/delete-data-frame-transform.html[delete {dataframe-transform} API]. When
			
 
				+you delete a {dataframe-transform}, its destination index and {kib} index
			
 
				+patterns remain.
			
--- a/docs/reference/transform/images/dataframe-transforms.jpg
+++ b/docs/reference/transform/images/dataframe-transforms.jpg
--- a/docs/reference/transform/images/ecommerce-batch.jpg
+++ b/docs/reference/transform/images/ecommerce-batch.jpg
--- a/docs/reference/transform/images/ecommerce-continuous.jpg
+++ b/docs/reference/transform/images/ecommerce-continuous.jpg
--- a/docs/reference/transform/images/ecommerce-pivot1.jpg
+++ b/docs/reference/transform/images/ecommerce-pivot1.jpg
--- a/docs/reference/transform/images/ecommerce-pivot2.jpg
+++ b/docs/reference/transform/images/ecommerce-pivot2.jpg
--- a/docs/reference/transform/images/ecommerce-results.jpg
+++ b/docs/reference/transform/images/ecommerce-results.jpg
--- a/docs/reference/transform/images/ml-dataframepivot.jpg
+++ b/docs/reference/transform/images/ml-dataframepivot.jpg
--- a/docs/reference/transform/index.asciidoc
+++ b/docs/reference/transform/index.asciidoc
@@ -0,0 +1,82 @@
 
				+[role="xpack"]
			
 
				+[[ml-dataframes]]
			
 
				+= {dataframe-transforms-cap}
			
 
				+
			
 
				+[partintro]
			
 
				+--
			
 
				+
			
 
				+beta[]
			
 
				+
			
 
				+{es} aggregations are a powerful and flexible feature that enable you to
			
 
				+summarize and retrieve complex insights about your data. You can summarize
			
 
				+complex things like the number of web requests per day on a busy website, broken
			
 
				+down by geography and browser type. If you use the same data set to try to
			
 
				+calculate something as simple as a single number for the average duration of
			
 
				+visitor web sessions, however, you can quickly run out of memory.
			
 
				+
			
 
				+Why does this occur? A web session duration is an example of a behavioral
			
 
				+attribute not held on any one log record; it has to be derived by finding the
			
 
				+first and last records for each session in our weblogs. This derivation requires
			
 
				+some complex query expressions and a lot of memory to connect all the data
			
 
				+points. If you have an ongoing background process that fuses related events from
			
 
				+one index into entity-centric summaries in another index, you get a more useful,
			
 
				+joined-up picture--this is essentially what _{dataframes}_ are.
			
 
				+
			
 
				+
			
 
				+[discrete]
			
 
				+[[ml-dataframes-usage]]
			
 
				+== When to use {dataframes}
			
 
				+
			
 
				+You might want to consider using {dataframes} instead of aggregations when:
			
 
				+
			
 
				+* You need a complete _feature index_ rather than a top-N set of items.
			
 
				++
			
 
				+In {ml}, you often need a complete set of behavioral features rather just the
			
 
				+top-N. For example, if you are predicting customer churn, you might look at
			
 
				+features such as the number of website visits in the last week, the total number
			
 
				+of sales, or the number of emails sent. The {stack} {ml-features} create models
			
 
				+based on this multi-dimensional feature space, so they benefit from full feature
			
 
				+indices ({dataframes}).
			
 
				++
			
 
				+This scenario also applies when you are trying to search across the results of
			
 
				+an aggregation or multiple aggregations. Aggregation results can be ordered or
			
 
				+filtered, but there are
			
 
				+{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
			
 
				+and
			
 
				+{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
			
 
				+is constrained by the maximum number of buckets returned. If you want to search
			
 
				+all aggregation results, you need to create the complete {dataframe}. If you
			
 
				+need to sort or filter the aggregation results by multiple fields, {dataframes}
			
 
				+are particularly useful.
			
 
				+
			
 
				+* You need to sort aggregation results by a pipeline aggregation.
			
 
				++
			
 
				+{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
			
 
				+for sorting. Technically, this is because pipeline aggregations are run during
			
 
				+the reduce phase after all other aggregations have already completed. If you
			
 
				+create a {dataframe}, you can effectively perform multiple passes over the data.
			
 
				+
			
 
				+* You want to create summary tables to optimize queries.
			
 
				++
			
 
				+For example, if you
			
 
				+have a high level dashboard that is accessed by a large number of users and it
			
 
				+uses a complex aggregation over a large dataset, it may be more efficient to
			
 
				+create a {dataframe} to cache results. Thus, each user doesn't need to run the
			
 
				+aggregation query.
			
 
				+
			
 
				+Though there are multiple ways to create {dataframes}, this content pertains
			
 
				+to one specific method: _{dataframe-transforms}_.
			
 
				+
			
 
				+* <<ml-transform-overview>>
			
 
				+* <<df-api-quickref>>
			
 
				+* <<dataframe-examples>>
			
 
				+* <<dataframe-troubleshooting>>
			
 
				+* <<dataframe-limitations>>
			
 
				+--
			
 
				+
			
 
				+include::overview.asciidoc[]
			
 
				+include::checkpoints.asciidoc[]
			
 
				+include::api-quickref.asciidoc[]
			
 
				+include::dataframe-examples.asciidoc[]
			
 
				+include::troubleshooting.asciidoc[]
			
 
				+include::limitations.asciidoc[]
			
--- a/docs/reference/transform/limitations.asciidoc
+++ b/docs/reference/transform/limitations.asciidoc
@@ -0,0 +1,219 @@
 
				+[role="xpack"]
			
 
				+[[dataframe-limitations]]
			
 
				+== {dataframe-transform-cap} limitations
			
 
				+[subs="attributes"]
			
 
				+++++
			
 
				+<titleabbrev>Limitations</titleabbrev>
			
 
				+++++
			
 
				+
			
 
				+beta[]
			
 
				+
			
 
				+The following limitations and known problems apply to the 7.4 release of 
			
 
				+the Elastic {dataframe} feature:
			
 
				+
			
 
				+[float]
			
 
				+[[df-compatibility-limitations]]
			
 
				+=== Beta {dataframe-transforms} do not have guaranteed backwards or forwards compatibility
			
 
				+
			
 
				+Whilst {dataframe-transforms} are beta, it is not guaranteed that a 
			
 
				+{dataframe-transform} created in a previous version of the {stack} will be able 
			
 
				+to start and operate in a future version. Neither can support be provided for 
			
 
				+{dataframe-transform} tasks to be able to operate in a cluster with mixed node 
			
 
				+versions. 
			
 
				+Please note that the output of a {dataframe-transform} is persisted to a 
			
 
				+destination index. This is a normal {es} index and is not affected by the beta 
			
 
				+status. 
			
 
				+
			
 
				+[float]
			
 
				+[[df-ui-limitation]]
			
 
				+=== {dataframe-cap} UI will not work during a rolling upgrade from 7.2
			
 
				+
			
 
				+If your cluster contains mixed version nodes, for example during a rolling 
			
 
				+upgrade from 7.2 to a newer version, and {dataframe-transforms} have been 
			
 
				+created in 7.2, the {dataframe} UI will not work. Please wait until all nodes 
			
 
				+have been upgraded to the newer version before using the {dataframe} UI.
			
 
				+
			
 
				+
			
 
				+[float]
			
 
				+[[df-datatype-limitations]]
			
 
				+=== {dataframe-cap} data type limitation
			
 
				+
			
 
				+{dataframes-cap} do not (yet) support fields containing arrays – in the UI or 
			
 
				+the API. If you try to create one, the UI will fail to show the source index 
			
 
				+table.
			
 
				+
			
 
				+[float]
			
 
				+[[df-ccs-limitations]]
			
 
				+=== {ccs-cap} is not supported
			
 
				+
			
 
				+{ccs-cap} is not supported for {dataframe-transforms}.
			
 
				+
			
 
				+[float]
			
 
				+[[df-kibana-limitations]]
			
 
				+=== Up to 1,000 {dataframe-transforms} are supported
			
 
				+
			
 
				+A single cluster will support up to 1,000 {dataframe-transforms}.
			
 
				+When using the 
			
 
				+{ref}/get-data-frame-transform.html[GET {dataframe-transforms} API] a total 
			
 
				+`count` of transforms is returned. Use the `size` and `from` parameters to 
			
 
				+enumerate through the full list.
			
 
				+
			
 
				+[float]
			
 
				+[[df-aggresponse-limitations]]
			
 
				+=== Aggregation responses may be incompatible with destination index mappings
			
 
				+
			
 
				+When a {dataframe-transform} is first started, it will deduce the mappings 
			
 
				+required for the destination index. This process is based on the field types of 
			
 
				+the source index and the aggregations used. If the fields are derived from 
			
 
				+{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[`scripted_metrics`] 
			
 
				+or {ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_scripts`], 
			
 
				+{ref}/dynamic-mapping.html[dynamic mappings] will be used. In some instances the 
			
 
				+deduced mappings may be incompatible with the actual data. For example, numeric 
			
 
				+overflows might occur or dynamically mapped fields might contain both numbers 
			
 
				+and strings. Please check {es} logs if you think this may have occurred. As a 
			
 
				+workaround, you may define custom mappings prior to starting the 
			
 
				+{dataframe-transform}. For example, 
			
 
				+{ref}/indices-create-index.html[create a custom destination index] or 
			
 
				+{ref}/indices-templates.html[define an index template].
			
 
				+
			
 
				+[float]
			
 
				+[[df-batch-limitations]]
			
 
				+=== Batch {dataframe-transforms} may not account for changed documents
			
 
				+
			
 
				+A batch {dataframe-transform} uses a 
			
 
				+{ref}/search-aggregations-bucket-composite-aggregation.html[composite aggregation]
			
 
				+which allows efficient pagination through all buckets. Composite aggregations 
			
 
				+do not yet support a search context, therefore if the source data is changed 
			
 
				+(deleted, updated, added) while the batch {dataframe} is in progress, then the 
			
 
				+results may not include these changes.
			
 
				+
			
 
				+[float]
			
 
				+[[df-consistency-limitations]]
			
 
				+=== {cdataframe-cap} consistency does not account for deleted or updated documents
			
 
				+
			
 
				+While the process for {cdataframe-transforms} allows the continual recalculation 
			
 
				+of the {dataframe-transform} as new data is being ingested, it does also have 
			
 
				+some limitations.
			
 
				+
			
 
				+Changed entities will only be identified if their time field 
			
 
				+has also been updated and falls within the range of the action to check for 
			
 
				+changes. This has been designed in principle for, and is suited to, the use case 
			
 
				+where new data is given a timestamp for the time of ingest. 
			
 
				+
			
 
				+If the indices that fall within the scope of the source index pattern are 
			
 
				+removed, for example when deleting historical time-based indices, then the 
			
 
				+composite aggregation performed in consecutive checkpoint processing will search 
			
 
				+over different source data, and entities that only existed in the deleted index 
			
 
				+will not be removed from the {dataframe} destination index.
			
 
				+
			
 
				+Depending on your use case, you may wish to recreate the {dataframe-transform} 
			
 
				+entirely after deletions. Alternatively, if your use case is tolerant to 
			
 
				+historical archiving, you may wish to include a max ingest timestamp in your 
			
 
				+aggregation. This will allow you to exclude results that have not been recently 
			
 
				+updated when viewing the {dataframe} destination index.
			
 
				+
			
 
				+
			
 
				+[float]
			
 
				+[[df-deletion-limitations]]
			
 
				+=== Deleting a {dataframe-transform} does not delete the {dataframe} destination index or {kib} index pattern
			
 
				+
			
 
				+When deleting a {dataframe-transform} using `DELETE _data_frame/transforms/index` 
			
 
				+neither the {dataframe} destination index nor the {kib} index pattern, should 
			
 
				+one have been created, are deleted. These objects must be deleted separately.
			
 
				+
			
 
				+[float]
			
 
				+[[df-aggregation-page-limitations]]
			
 
				+=== Handling dynamic adjustment of aggregation page size
			
 
				+
			
 
				+During the development of {dataframe-transforms}, control was favoured over 
			
 
				+performance. In the design considerations, it is preferred for the 
			
 
				+{dataframe-transform} to take longer to complete quietly in the background 
			
 
				+rather than to finish quickly and take precedence in resource consumption.
			
 
				+
			
 
				+Composite aggregations are well suited for high cardinality data enabling 
			
 
				+pagination through results. If a {ref}/circuit-breaker.html[circuit breaker] 
			
 
				+memory exception occurs when performing the composite aggregated search then we 
			
 
				+try again reducing the number of buckets requested. This circuit breaker is 
			
 
				+calculated based upon all activity within the cluster, not just activity from 
			
 
				+{dataframe-transforms}, so it therefore may only be a temporary resource 
			
 
				+availability issue.
			
 
				+
			
 
				+For a batch {dataframe-transform}, the number of buckets requested is only ever 
			
 
				+adjusted downwards. The lowering of value may result in a longer duration for the 
			
 
				+transform checkpoint to complete. For {cdataframes}, the number of 
			
 
				+buckets requested is reset back to its default at the start of every checkpoint 
			
 
				+and it is possible for circuit breaker exceptions to occur repeatedly in the 
			
 
				+{es} logs. 
			
 
				+
			
 
				+The {dataframe-transform} retrieves data in batches which means it calculates 
			
 
				+several buckets at once. Per default this is 500 buckets per search/index 
			
 
				+operation. The default can be changed using `max_page_search_size` and the 
			
 
				+minimum value is 10. If failures still occur once the number of buckets 
			
 
				+requested has been reduced to its minimum, then the {dataframe-transform} will 
			
 
				+be set to a failed state.
			
 
				+
			
 
				+[float]
			
 
				+[[df-dynamic-adjustments-limitations]]
			
 
				+=== Handling dynamic adjustments for many terms
			
 
				+
			
 
				+For each checkpoint, entities are identified that have changed since the last 
			
 
				+time the check was performed. This list of changed entities is supplied as a 
			
 
				+{ref}/query-dsl-terms-query.html[terms query] to the {dataframe-transform} 
			
 
				+composite aggregation, one page at a time. Then updates are applied to the 
			
 
				+destination index for each page of entities.
			
 
				+
			
 
				+The page `size` is defined by `max_page_search_size` which is also used to 
			
 
				+define the number of buckets returned by the composite aggregation search. The 
			
 
				+default value is 500, the minimum is 10.
			
 
				+
			
 
				+The index setting 
			
 
				+{ref}/index-modules.html#dynamic-index-settings[`index.max_terms_count`] defines 
			
 
				+the maximum number of terms that can be used in a terms query. The default value 
			
 
				+is 65536. If `max_page_search_size` exceeds `index.max_terms_count` the 
			
 
				+transform will fail. 
			
 
				+
			
 
				+Using smaller values for `max_page_search_size` may result in a longer duration 
			
 
				+for the transform checkpoint to complete.
			
 
				+
			
 
				+[float]
			
 
				+[[df-scheduling-limitations]]
			
 
				+=== {cdataframe-cap} scheduling limitations
			
 
				+
			
 
				+A {cdataframe} periodically checks for changes to source data. The functionality 
			
 
				+of the scheduler is currently limited to a basic periodic timer which can be 
			
 
				+within the `frequency` range from 1s to 1h. The default is 1m. This is designed 
			
 
				+to run little and often. When choosing a `frequency` for this timer consider 
			
 
				+your ingest rate along with the impact that the {dataframe-transform} 
			
 
				+search/index operations has other users in your cluster. Also note that retries 
			
 
				+occur at `frequency` interval.
			
 
				+
			
 
				+[float]
			
 
				+[[df-failed-limitations]]
			
 
				+=== Handling of failed {dataframe-transforms}
			
 
				+
			
 
				+Failed {dataframe-transforms} remain as a persistent task and should be handled 
			
 
				+appropriately, either by deleting it or by resolving the root cause of the 
			
 
				+failure and re-starting.
			
 
				+
			
 
				+When using the API to delete a failed {dataframe-transform}, first stop it using 
			
 
				+`_stop?force=true`, then delete it.
			
 
				+
			
 
				+If starting a failed {dataframe-transform}, after the root cause has been 
			
 
				+resolved, the `_start?force=true` parameter must be specified.
			
 
				+
			
 
				+[float]
			
 
				+[[df-availability-limitations]]
			
 
				+=== {cdataframes-cap} may give incorrect results if documents are not yet available to search
			
 
				+
			
 
				+After a document is indexed, there is a very small delay until it is available 
			
 
				+to search.
			
 
				+
			
 
				+A {cdataframe-transform} periodically checks for changed entities between the 
			
 
				+time since it last checked and `now` minus `sync.time.delay`. This time window 
			
 
				+moves without overlapping. If the timestamp of a recently indexed document falls 
			
 
				+within this time window but this document is not yet available to search then 
			
 
				+this entity will not be updated.
			
 
				+
			
 
				+If using a `sync.time.field` that represents the data ingest time and using a 
			
 
				+zero second or very small `sync.time.delay`, then it is more likely that this 
			
 
				+issue will occur.
			
--- a/docs/reference/transform/overview.asciidoc
+++ b/docs/reference/transform/overview.asciidoc
@@ -0,0 +1,71 @@
 
				+[role="xpack"]
			
 
				+[[ml-transform-overview]]
			
 
				+== {dataframe-transform-cap} overview
			
 
				+++++
			
 
				+<titleabbrev>Overview</titleabbrev>
			
 
				+++++
			
 
				+
			
 
				+beta[]
			
 
				+
			
 
				+A _{dataframe}_ is a two-dimensional tabular data structure. In the context of
			
 
				+the {stack}, it is a transformation of data that is indexed in {es}. For
			
 
				+example, you can use {dataframes} to _pivot_ your data into a new entity-centric
			
 
				+index. By transforming and summarizing your data, it becomes possible to
			
 
				+visualize and analyze it in alternative and interesting ways.
			
 
				+
			
 
				+A lot of {es} indices are organized as a stream of events: each event is an 
			
 
				+individual document, for example a single item purchase. {dataframes-cap} enable
			
 
				+you to summarize this data, bringing it into an organized, more
			
 
				+analysis-friendly format. For example, you can summarize all the purchases of a
			
 
				+single customer.
			
 
				+
			
 
				+You can create {dataframes} by using {dataframe-transforms}.
			
 
				+{dataframe-transforms-cap} enable you to define a pivot, which is a set of
			
 
				+features that transform the index into a different, more digestible format.
			
 
				+Pivoting results in a summary of your data, which is the {dataframe}.
			
 
				+
			
 
				+To define a pivot, first you select one or more fields that you will use to
			
 
				+group your data. You can select categorical fields (terms) and numerical fields
			
 
				+for grouping. If you use numerical fields, the field values are bucketed using
			
 
				+an interval that you specify.
			
 
				+
			
 
				+The second step is deciding how you want to aggregate the grouped data. When 
			
 
				+using aggregations, you practically ask questions about the index. There are 
			
 
				+different types of aggregations, each with its own purpose and output. To learn 
			
 
				+more about the supported aggregations and group-by fields, see 
			
 
				+{ref}/data-frame-transform-resource.html[{dataframe-transform-cap} resources].
			
 
				+
			
 
				+As an optional step, you can also add a query to further limit the scope of the
			
 
				+aggregation.
			
 
				+
			
 
				+The {dataframe-transform} performs a composite aggregation that 
			
 
				+paginates through all the data defined by the source index query. The output of
			
 
				+the aggregation is stored in a destination index. Each time the 
			
 
				+{dataframe-transform} queries the source index, it creates a _checkpoint_. You 
			
 
				+can decide whether you want the {dataframe-transform} to run once (batch 
			
 
				+{dataframe-transform}) or continuously ({cdataframe-transform}). A batch 
			
 
				+{dataframe-transform} is a single operation that has a single checkpoint. 
			
 
				+{cdataframe-transforms-cap} continually increment and process checkpoints as new 
			
 
				+source data is ingested.
			
 
				+
			
 
				+.Example
			
 
				+
			
 
				+Imagine that you run a webshop that sells clothes. Every order creates a document 
			
 
				+that contains a unique order ID, the name and the category of the ordered product, 
			
 
				+its price, the ordered quantity, the exact date of the order, and some customer 
			
 
				+information (name, gender, location, etc). Your dataset contains all the transactions 
			
 
				+from last year.
			
 
				+
			
 
				+If you want to check the sales in the different categories in your last fiscal
			
 
				+year, define a {dataframe-transform} that groups the data by the product
			
 
				+categories (women's shoes, men's clothing, etc.) and the order date. Use the
			
 
				+last year as the interval for the order date. Then add a sum aggregation on the
			
 
				+ordered quantity. The result is a {dataframe} that shows the number of sold
			
 
				+items in every product category in the last year.
			
 
				+
			
 
				+[role="screenshot"]
			
 
				+image::images/ml-dataframepivot.jpg["Example of a data frame pivot in {kib}"]
			
 
				+
			
 
				+IMPORTANT: The {dataframe-transform} leaves your source index intact. It
			
 
				+creates a new index that is dedicated to the {dataframe}.
			
 
				+
			
--- a/docs/reference/transform/troubleshooting.asciidoc
+++ b/docs/reference/transform/troubleshooting.asciidoc
@@ -0,0 +1,29 @@
 
				+[[dataframe-troubleshooting]]
			
 
				+== Troubleshooting {dataframe-transforms}
			
 
				+[subs="attributes"]
			
 
				+++++
			
 
				+<titleabbrev>Troubleshooting</titleabbrev>
			
 
				+++++
			
 
				+
			
 
				+Use the information in this section to troubleshoot common problems.
			
 
				+
			
 
				+include::{stack-repo-dir}/help.asciidoc[tag=get-help]
			
 
				+
			
 
				+If you encounter problems with your {dataframe-transforms}, you can gather more
			
 
				+information from the following files and APIs:
			
 
				+
			
 
				+* Lightweight audit messages are stored in `.data-frame-notifications-*`. Search
			
 
				+by your `transform_id`.
			
 
				+* The
			
 
				+{ref}/get-data-frame-transform-stats.html[get {dataframe-transform} statistics API] 
			
 
				+provides information about the transform status and failures.
			
 
				+* If the {dataframe-transform} exists as a task, you can use the
			
 
				+{ref}/tasks.html[task management API] to gather task information. For example:
			
 
				+`GET _tasks?actions=data_frame/transforms*&detailed`. Typically, the task exists
			
 
				+when the transform is in a started or failed state.
			
 
				+* The {es} logs from the node that was running the {dataframe-transform} might
			
 
				+also contain useful information. You can identify the node from the notification
			
 
				+messages. Alternatively, if the task still exists, you can get that information
			
 
				+from the get {dataframe-transform} statistics API. For more information, see
			
 
				+{ref}/logging.html[Logging configuration].
			
 
				+