123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201 |
- [role="xpack"]
- [[transform-scale]]
- = Working with {transforms} at scale
- ++++
- <titleabbrev>{transforms-cap} at scale</titleabbrev>
- ++++
- {transforms-cap} convert existing {es} indices into summarized indices, which
- provide opportunities for new insights and analytics. The search and index
- operations performed by {transforms} use standard {es} features so similar
- considerations for working with {es} at scale are often applicable to
- {transforms}. If you experience performance issues, start by identifying the
- bottleneck areas (search, indexing, processing, or storage) then review the
- relevant considerations in this guide to improve performance. It also helps to
- understand how {transforms} work as different considerations apply depending on
- whether or not your transform is running in continuous mode or in batch.
- In this guide, you’ll learn how to:
- * Understand the impact of configuration options on the performance of
- {transforms}.
- **Prerequisites:**
- These guildelines assume you have a {transform} you want to tune, and you’re
- already familiar with:
- * <<transform-overview,How {transforms} work>>.
- * <<transform-setup,How to set up {transforms}>>.
- * <<transform-checkpoints,How {transform} checkpoints work in continuous mode>>.
- The following considerations are not sequential – the numbers help to navigate
- between the list items; you can take action on one or more of them in any order.
- Most of the recommendations apply to both continuous and batch {transforms}. If
- a list item only applies to one {transform} type, this exception is highlighted
- in the description.
- The keywords in parenthesis at the end of each recommendation title indicates
- the bottleneck area that may be improved by following the given recommendation.
- [discrete]
- [[measure-performance]]
- == Measure {transforms} performance
- In order to optimize {transform} performance, start by identifying the areas
- where most work is being done. The **Stats** interface of the
- **{transforms-cap}** page in {kib} contains information that covers three main
- areas: indexing, searching, and processing time (alternatively, you can use the
- <<get-transform-stats, {transforms} stats API>>). If, for example, the results
- show that the highest proportion of time is spent on search, then prioritize
- efforts on optimizing the search query of the {transform}. {transforms-cap} also
- has https://esrally.readthedocs.io[Rally support] that makes it possible to run
- performance checks on {transforms} configurations if it is required. If you
- optimized the crucial factors and you still experience performance issues, you
- may also want to consider improving your hardware.
- [discrete]
- [[frequency]]
- == 1. Optimize `frequency` (index)
- In a {ctransform}, the `frequency` configuration option sets the interval
- between checks for changes in the source indices. If changes are detected, then
- the source data is searched and the changes are applied to the destination
- index. Depending on your use case, you may wish to reduce the frequency at which
- changes are applied. By setting `frequency` to a higher value (maximum is one
- hour), the workload can be spread over time at the cost of less up-to-date data.
- [discrete]
- [[increase-shards-dest-index]]
- == 2. Increase the number of shards of the destination index (index)
- Depending on the size of the destination index, you may consider increasing its
- shard count. {transforms-cap} use one shard by default when creating the
- destination index. To override the index settings, create the destination index
- before starting the {transform}. For more information about how the number of
- shards affects scalability and resilience, refer to <<scalability>>
- TIP: Use the <<preview-transform>> to check the settings that the {transform}
- would use to create the destination index. You can copy and adjust these in
- order to create the destination index prior to starting the {transform}.
- [discrete]
- [[search-queries]]
- == 3. Profile and optimize your search queries (search)
- If you have defined a {transform} source index `query`, ensure it is as
- efficient as possible. Use the **Search Profiler** under **Dev Tools** in {kib}
- to get detailed timing information about the execution of individual components
- in the search request. Alternatively, you can use the <<search-profile>>. The
- results give you insight into how search requests are executed at a low level so
- that you can understand why certain requests are slow, and take steps to improve
- them.
- {transforms-cap} execute standard {es} search requests. There are different ways
- to write {es} queries, and some of them are more efficient than others. Consult
- <<tune-for-search-speed>> to learn more about {es} performance tuning.
- [discrete]
- [[limit-source-query]]
- == 4. Limit the scope of the source query (search)
- Imagine your {ctransform} is configured to group by `IP` and calculate the sum
- of `bytes_sent`. For each checkpoint, a {ctransform} detects changes in the
- source data since the previous checkpoint, identifying the IPs for which new
- data has been ingested. Then it performs a second search, filtered for this
- group of IPs, in order to calculate the total `bytes_sent`. If this second
- search matches many shards, then this could be resource intensive. Consider
- limiting the scope that the source index pattern and query will match.
- Use an absolute time value as a date range filter in your source query (for
- example, greater than `2020-01-01T00:00:00`) to limit which historical indices
- are accessed. If you use a relative time value (for example, `now-30d`) then
- this date range is re-evaluated at the point of each checkpoint execution.
- [discrete]
- [[optimize-shading-strategy]]
- == 5. Optimize the sharding strategy for the source index (search)
- There is no one-size-fits-all sharding strategy. A strategy that works in one
- environment may not scale in another. A good sharding strategy must account for
- your infrastructure, use case, and performance expectations.
- Too few shards may mean that the benefits of distributing the workload cannot be
- realised; however too many shards may impact your cluster health. To learn more
- about sizing your shards, read this <<size-your-shards,guide>>.
- [discrete]
- [[tune-max-page-search-size]]
- == 6. Tune `max_page_search_size` (search)
- The `max_page_search_size` {transform} configuration option defines the number
- of buckets that are returned for each search request. The default value is 500.
- If you increase this value, you get better throughput at the cost of higher
- latency and memory usage.
- The ideal value of this parameter is highly dependent on your use case. If your
- {transform} executes memory-intensive aggregations – for example, cardinality or
- percentiles – then increasing `max_page_search_size` requires more available
- memory. If memory limits are exceeded, a circuit breaker exception occurs.
- [discrete]
- [[indexed-fields-in-source]]
- == 7. Use indexed fields in your source indices (search)
- Runtime fields and scripted fields are not indexed fields; their values are only
- extracted or computed at search time. While these fields provide flexibility in
- how you access your data, they increase performance costs at search time. If
- {transform} performance using runtime fields or scripted fields is a concern,
- you may wish to consider using indexed fields instead. For performance reasons,
- we do not recommend using a runtime field as the time field that synchronizes a
- {ctransform}.
- [discrete]
- [[index-sorting-group-by-ordering]]
- == 8. Use index sorting and `group_by` ordering (search, process)
- If you use more than one `group_by` field in your {transform}, then the order of
- the fields in conjunction with the use of <<index-modules-index-sorting>> may
- improve runtime.
- Index sorting enables you to store documents on disk in a specific order which
- can improve query efficiency. The ideal sorting logic depends on your use case,
- but the rule of thumb may be to sort the fields in descending order (high to low
- cardinality) starting with the time-based fields. Then put the time-based
- components first in the `group_by` if you have any, and then apply the same
- order to your `group_by` fields as configured for index sorting. Index sorting
- can be defined only once at index creation. If you don't already have index
- sorting on the index that you want to use as a source, consider reindexing it to
- a new, sorted index.
- [discrete]
- [[disable-source-dest]]
- == 9. Disable the `_source` field on the destination index (storage)
- The <<mapping-source-field>> contains the original JSON document body that was
- passed at index time. The `_source` field itself is not indexed (and thus is not
- searchable), but it is still stored in the index and incurs a storage overhead.
- Consider disabling `_source` to save storage space if you have a large
- destination index. Disabling `_source` is only possible during index creation.
- NOTE: When the `_source` field is disabled, a number of features are not
- supported. Consult <<disable-source-field>> to understand the consequences
- before disabling it.
- [discrete]
- == Further reading
- * <<tune-for-search-speed>>
- * <<tune-for-indexing-speed>>
- * <<size-your-shards>>
- * <<ilm-index-lifecycle>>
|