|
@@ -0,0 +1,201 @@
|
|
|
+[role="xpack"]
|
|
|
+[[transform-scale]]
|
|
|
+= Working with {transforms} at scale
|
|
|
+++++
|
|
|
+<titleabbrev>{transforms-cap} at scale</titleabbrev>
|
|
|
+++++
|
|
|
+
|
|
|
+{transforms-cap} convert existing {es} indices into summarized indices, which
|
|
|
+provide opportunities for new insights and analytics. The search and index
|
|
|
+operations performed by {transforms} use standard {es} features so similar
|
|
|
+considerations for working with {es} at scale are often applicable to
|
|
|
+{transforms}. If you experience performance issues, start by identifying the
|
|
|
+bottleneck areas (search, indexing, processing, or storage) then review the
|
|
|
+relevant considerations in this guide to improve performance. It also helps to
|
|
|
+understand how {transforms} work as different considerations apply depending on
|
|
|
+whether or not your transform is running in continuous mode or in batch.
|
|
|
+
|
|
|
+In this guide, you’ll learn how to:
|
|
|
+
|
|
|
+* Understand the impact of configuration options on the performance of
|
|
|
+ {transforms}.
|
|
|
+
|
|
|
+**Prerequisites:**
|
|
|
+
|
|
|
+These guildelines assume you have a {transform} you want to tune, and you’re
|
|
|
+already familiar with:
|
|
|
+
|
|
|
+* <<transform-overview,How {transforms} work>>.
|
|
|
+* <<transform-setup,How to set up {transforms}>>.
|
|
|
+* <<transform-checkpoints,How {transform} checkpoints work in continuous mode>>.
|
|
|
+
|
|
|
+The following considerations are not sequential – the numbers help to navigate
|
|
|
+between the list items; you can take action on one or more of them in any order.
|
|
|
+Most of the recommendations apply to both continuous and batch {transforms}. If
|
|
|
+a list item only applies to one {transform} type, this exception is highlighted
|
|
|
+in the description.
|
|
|
+
|
|
|
+The keywords in parenthesis at the end of each recommendation title indicates
|
|
|
+the bottleneck area that may be improved by following the given recommendation.
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[measure-performance]]
|
|
|
+== Measure {transforms} performance
|
|
|
+
|
|
|
+In order to optimize {transform} performance, start by identifying the areas
|
|
|
+where most work is being done. The **Stats** interface of the
|
|
|
+**{transforms-cap}** page in {kib} contains information that covers three main
|
|
|
+areas: indexing, searching, and processing time (alternatively, you can use the
|
|
|
+<<get-transform-stats, {transforms} stats API>>). If, for example, the results
|
|
|
+show that the highest proportion of time is spent on search, then prioritize
|
|
|
+efforts on optimizing the search query of the {transform}. {transforms-cap} also
|
|
|
+has https://esrally.readthedocs.io[Rally support] that makes it possible to run
|
|
|
+performance checks on {transforms} configurations if it is required. If you
|
|
|
+optimized the crucial factors and you still experience performance issues, you
|
|
|
+may also want to consider improving your hardware.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[frequency]]
|
|
|
+== 1. Optimize `frequency` (index)
|
|
|
+
|
|
|
+In a {ctransform}, the `frequency` configuration option sets the interval
|
|
|
+between checks for changes in the source indices. If changes are detected, then
|
|
|
+the source data is searched and the changes are applied to the destination
|
|
|
+index. Depending on your use case, you may wish to reduce the frequency at which
|
|
|
+changes are applied. By setting `frequency` to a higher value (maximum is one
|
|
|
+hour), the workload can be spread over time at the cost of less up-to-date data.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[increase-shards-dest-index]]
|
|
|
+== 2. Increase the number of shards of the destination index (index)
|
|
|
+
|
|
|
+Depending on the size of the destination index, you may consider increasing its
|
|
|
+shard count. {transforms-cap} use one shard by default when creating the
|
|
|
+destination index. To override the index settings, create the destination index
|
|
|
+before starting the {transform}. For more information about how the number of
|
|
|
+shards affects scalability and resilience, refer to <<scalability>>
|
|
|
+
|
|
|
+TIP: Use the <<preview-transform>> to check the settings that the {transform}
|
|
|
+would use to create the destination index. You can copy and adjust these in
|
|
|
+order to create the destination index prior to starting the {transform}.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[search-queries]]
|
|
|
+== 3. Profile and optimize your search queries (search)
|
|
|
+
|
|
|
+If you have defined a {transform} source index `query`, ensure it is as
|
|
|
+efficient as possible. Use the **Search Profiler** under **Dev Tools** in {kib}
|
|
|
+to get detailed timing information about the execution of individual components
|
|
|
+in the search request. Alternatively, you can use the <<search-profile>>. The
|
|
|
+results give you insight into how search requests are executed at a low level so
|
|
|
+that you can understand why certain requests are slow, and take steps to improve
|
|
|
+them.
|
|
|
+
|
|
|
+{transforms-cap} execute standard {es} search requests. There are different ways
|
|
|
+to write {es} queries, and some of them are more efficient than others. Consult
|
|
|
+<<tune-for-search-speed>> to learn more about {es} performance tuning.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[limit-source-query]]
|
|
|
+== 4. Limit the scope of the source query (search)
|
|
|
+
|
|
|
+Imagine your {ctransform} is configured to group by `IP` and calculate the sum
|
|
|
+of `bytes_sent`. For each checkpoint, a {ctransform} detects changes in the
|
|
|
+source data since the previous checkpoint, identifying the IPs for which new
|
|
|
+data has been ingested. Then it performs a second search, filtered for this
|
|
|
+group of IPs, in order to calculate the total `bytes_sent`. If this second
|
|
|
+search matches many shards, then this could be resource intensive. Consider
|
|
|
+limiting the scope that the source index pattern and query will match.
|
|
|
+
|
|
|
+Use an absolute time value as a date range filter in your source query (for
|
|
|
+example, greater than `2020-01-01T00:00:00`) to limit which historical indices
|
|
|
+are accessed. If you use a relative time value (for example, `now-30d`) then
|
|
|
+this date range is re-evaluated at the point of each checkpoint execution.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[optimize-shading-strategy]]
|
|
|
+== 5. Optimize the sharding strategy for the source index (search)
|
|
|
+
|
|
|
+There is no one-size-fits-all sharding strategy. A strategy that works in one
|
|
|
+environment may not scale in another. A good sharding strategy must account for
|
|
|
+your infrastructure, use case, and performance expectations.
|
|
|
+
|
|
|
+Too few shards may mean that the benefits of distributing the workload cannot be
|
|
|
+realised; however too many shards may impact your cluster health. To learn more
|
|
|
+about sizing your shards, read this <<size-your-shards,guide>>.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[tune-max-page-search-size]]
|
|
|
+== 6. Tune `max_page_search_size` (search)
|
|
|
+
|
|
|
+The `max_page_search_size` {transform} configuration option defines the number
|
|
|
+of buckets that are returned for each search request. The default value is 500.
|
|
|
+If you increase this value, you get better throughput at the cost of higher
|
|
|
+latency and memory usage.
|
|
|
+
|
|
|
+The ideal value of this parameter is highly dependent on your use case. If your
|
|
|
+{transform} executes memory-intensive aggregations – for example, cardinality or
|
|
|
+percentiles – then increasing `max_page_search_size` requires more available
|
|
|
+memory. If memory limits are exceeded, a circuit breaker exception occurs.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[indexed-fields-in-source]]
|
|
|
+== 7. Use indexed fields in your source indices (search)
|
|
|
+
|
|
|
+Runtime fields and scripted fields are not indexed fields; their values are only
|
|
|
+extracted or computed at search time. While these fields provide flexibility in
|
|
|
+how you access your data, they increase performance costs at search time. If
|
|
|
+{transform} performance using runtime fields or scripted fields is a concern,
|
|
|
+you may wish to consider using indexed fields instead. For performance reasons,
|
|
|
+we do not recommend using a runtime field as the time field that synchs a
|
|
|
+{ctransform}.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[index-sorting-group-by-ordering]]
|
|
|
+== 8. Use index sorting and `group_by` ordering (search, process)
|
|
|
+
|
|
|
+If you use more than one `group_by` field in your {transform}, then the order of
|
|
|
+the fields in conjunction with the use of <<index-modules-index-sorting>> may
|
|
|
+improve runtime.
|
|
|
+
|
|
|
+Index sorting enables you to store documents on disk in a specific order which
|
|
|
+can improve query efficiency. The ideal sorting logic depends on your use case,
|
|
|
+but the rule of thumb may be to sort the fields in descending order (high to low
|
|
|
+cardinality) starting with the time-based fields. Then put the time-based
|
|
|
+components first in the `group_by` if you have any, and then apply the same
|
|
|
+order to your `group_by` fields as configured for index sorting. Index sorting
|
|
|
+can be defined only once at index creation. If you don't already have index
|
|
|
+sorting on the index that you want to use as a source, consider reindexing it to
|
|
|
+a new, sorted index.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[disable-source-dest]]
|
|
|
+== 9. Disable the `_source` field on the destination index (storage)
|
|
|
+
|
|
|
+The <<mapping-source-field>> contains the original JSON document body that was
|
|
|
+passed at index time. The `_source` field itself is not indexed (and thus is not
|
|
|
+searchable), but it is still stored in the index and incurs a storage overhead.
|
|
|
+Consider disabling `_source` to save storage space if you have a large
|
|
|
+destination index. Disabling `_source` is only possible during index creation.
|
|
|
+
|
|
|
+NOTE: When the `_source` field is disabled, a number of features are not
|
|
|
+supported. Consult <<disable-source-field>> to understand the consequences
|
|
|
+before disabling it.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+== Further reading
|
|
|
+
|
|
|
+* <<tune-for-search-speed>>
|
|
|
+* <<tune-for-indexing-speed>>
|
|
|
+* <<size-your-shards>>
|
|
|
+* <<ilm-index-lifecycle>>
|