Browse Source

[DOCS] Adds further details and an example to how transform checkpointing works (#71615)

István Zoltán Szabó 4 years ago
parent
commit
51fe73081d
1 changed files with 35 additions and 11 deletions
  1. 35 11
      docs/reference/transform/checkpoints.asciidoc

+ 35 - 11
docs/reference/transform/checkpoints.asciidoc

@@ -10,7 +10,8 @@ destination index, it generates a _checkpoint_.
 
 If your {transform} runs only once, there is logically only one checkpoint. If 
 your {transform} runs continuously, however, it creates checkpoints as it 
-ingests and transforms new source data.
+ingests and transforms new source data. The `sync` property of the {transform} 
+configures checkpointing by specifying a time field.
 
 To create a checkpoint, the {ctransform}:
 
@@ -22,21 +23,25 @@ indices. This check is done based on the interval defined in the transform's
 +
 If the source indices remain unchanged or if a checkpoint is already in progress
 then it waits for the next timer.
++
+If changes are found a checkpoint is created.
 
-. Identifies which entities have changed.
+. Identifies which entities and/or time buckets have changed.
 +
-The {transform} searches to see which entities have changed since the last time 
-it checked. The `sync` configuration object in the {transform} identifies a time 
-field in the source indices. The {transform} uses the values in that field to 
-synchronize the source and destination indices.
+The {transform} searches to see which entities or time buckets have changed 
+between the last and the new checkpoint. The {transform} uses the values to
+synchronize the source and destination indices with fewer operations than a
+full re-run.
  
-. Updates the destination index (the {dataframe}) with the changed entities.
+. Updates the destination index (the {dataframe}) with the changes.
 +
 --
-The {transform} applies changes related to either new or changed entities to the 
-destination index. The set of changed entities is paginated. For each page, the 
-{transform} performs a composite aggregation using a `terms` query. After all 
-the pages of changes have been applied, the checkpoint is complete.
+The {transform} applies changes related to either new or changed entities or
+time buckets to the destination index. The set of changes can be paginated. The
+{transform} performs a composite aggregation similarly to the batch {transform} 
+operation, however it also injects query filters based on the previous step to 
+reduce the amount work. After all changes have been applied, the checkpoint is 
+complete.
 --
 
 This checkpoint process involves both search and indexing activity on the
@@ -49,6 +54,25 @@ support both the composite aggregation search and the indexing of its results.
 TIP: If the cluster experiences unsuitable performance degradation due to the
 {transform}, stop the {transform} and refer to <<transform-performance>>.
 
+
+[discrete]
+[[ml-transform-checkpoint-heuristics]]
+== Change detection heuristics
+
+When the {transform} runs in continuous mode, it updates the documents in the
+destination index as new data comes in. The {transform} uses a set of heuristics
+called change detection to update the destination index with fewer operations.
+
+In this example, the data is grouped by host names. Change detection detects 
+which host names have changed,  for example, host `A`, `C` and `G` and only 
+updates documents with those hosts but does not update documents that store 
+information about host `B`, `D`, or any other host that are not changed.
+
+Another heuristic can be applied for time buckets when a `date_histogram` is 
+used to group by time buckets. Change detection detects which time buckets have 
+changed and only update those.
+
+
 [discrete]
 [[ml-transform-checkpoint-errors]]
 == Error handling