123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283 |
- [[request-body-search-scroll]]
- ==== Scroll
- While a `search` request returns a single ``page'' of results, the `scroll`
- API can be used to retrieve large numbers of results (or even all results)
- from a single search request, in much the same way as you would use a cursor
- on a traditional database.
- Scrolling is not intended for real time user requests, but rather for
- processing large amounts of data, e.g. in order to reindex the contents of one
- index into a new index with a different configuration.
- .Client support for scrolling and reindexing
- *********************************************
- Some of the officially supported clients provide helpers to assist with
- scrolled searches and reindexing of documents from one index to another:
- Perl::
- See https://metacpan.org/pod/Search::Elasticsearch::Client::5_0::Bulk[Search::Elasticsearch::Client::5_0::Bulk]
- and https://metacpan.org/pod/Search::Elasticsearch::Client::5_0::Scroll[Search::Elasticsearch::Client::5_0::Scroll]
- Python::
- See http://elasticsearch-py.readthedocs.org/en/master/helpers.html[elasticsearch.helpers.*]
- *********************************************
- NOTE: The results that are returned from a scroll request reflect the state of
- the index at the time that the initial `search` request was made, like a
- snapshot in time. Subsequent changes to documents (index, update or delete)
- will only affect later search requests.
- In order to use scrolling, the initial search request should specify the
- `scroll` parameter in the query string, which tells Elasticsearch how long it
- should keep the ``search context'' alive (see <<scroll-search-context>>), eg `?scroll=1m`.
- [source,console]
- --------------------------------------------------
- POST /twitter/_search?scroll=1m
- {
- "size": 100,
- "query": {
- "match" : {
- "title" : "elasticsearch"
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:twitter]
- The result from the above request includes a `_scroll_id`, which should
- be passed to the `scroll` API in order to retrieve the next batch of
- results.
- [source,console]
- --------------------------------------------------
- POST /_search/scroll <1>
- {
- "scroll" : "1m", <2>
- "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" <3>
- }
- --------------------------------------------------
- // TEST[continued s/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==/$body._scroll_id/]
- <1> `GET` or `POST` can be used and the URL should not include the `index`
- name -- this is specified in the original `search` request instead.
- <2> The `scroll` parameter tells Elasticsearch to keep the search context open
- for another `1m`.
- <3> The `scroll_id` parameter
- The `size` parameter allows you to configure the maximum number of hits to be
- returned with each batch of results. Each call to the `scroll` API returns the
- next batch of results until there are no more results left to return, ie the
- `hits` array is empty.
- IMPORTANT: The initial search request and each subsequent scroll request each
- return a `_scroll_id`. While the `_scroll_id` may change between requests, it doesn’t
- always change — in any case, only the most recently received `_scroll_id` should be used.
- NOTE: If the request specifies aggregations, only the initial search response
- will contain the aggregations results.
- NOTE: Scroll requests have optimizations that make them faster when the sort
- order is `_doc`. If you want to iterate over all documents regardless of the
- order, this is the most efficient option:
- [source,console]
- --------------------------------------------------
- GET /_search?scroll=1m
- {
- "sort": [
- "_doc"
- ]
- }
- --------------------------------------------------
- // TEST[setup:twitter]
- [[scroll-search-context]]
- ===== Keeping the search context alive
- A scroll returns all the documents which matched the search at the time of the
- initial search request. It ignores any subsequent changes to these documents.
- The `scroll_id` identifies a _search context_ which keeps track of everything
- that {es} needs to return the correct documents. The search context is created
- by the initial request and kept alive by subsequent requests.
- The `scroll` parameter (passed to the `search` request and to every `scroll`
- request) tells Elasticsearch how long it should keep the search context alive.
- Its value (e.g. `1m`, see <<time-units>>) does not need to be long enough to
- process all data -- it just needs to be long enough to process the previous
- batch of results. Each `scroll` request (with the `scroll` parameter) sets a
- new expiry time. If a `scroll` request doesn't pass in the `scroll`
- parameter, then the search context will be freed as part of _that_ `scroll`
- request.
- Normally, the background merge process optimizes the index by merging together
- smaller segments to create new, bigger segments. Once the smaller segments are
- no longer needed they are deleted. This process continues during scrolling, but
- an open search context prevents the old segments from being deleted since they
- are still in use.
- TIP: Keeping older segments alive means that more disk space and file handles
- are needed. Ensure that you have configured your nodes to have ample free file
- handles. See <<file-descriptors>>.
- Additionally, if a segment contains deleted or updated documents then the
- search context must keep track of whether each document in the segment was live
- at the time of the initial search request. Ensure that your nodes have
- sufficient heap space if you have many open scrolls on an index that is subject
- to ongoing deletes or updates.
- NOTE: To prevent against issues caused by having too many scrolls open, the
- user is not allowed to open scrolls past a certain limit. By default, the
- maximum number of open scrolls is 500. This limit can be updated with the
- `search.max_open_scroll_context` cluster setting.
- You can check how many search contexts are open with the
- <<cluster-nodes-stats,nodes stats API>>:
- [source,console]
- ---------------------------------------
- GET /_nodes/stats/indices/search
- ---------------------------------------
- ===== Clear scroll API
- Search context are automatically removed when the `scroll` timeout has been
- exceeded. However keeping scrolls open has a cost, as discussed in the
- <<scroll-search-context,previous section>> so scrolls should be explicitly
- cleared as soon as the scroll is not being used anymore using the
- `clear-scroll` API:
- [source,console]
- ---------------------------------------
- DELETE /_search/scroll
- {
- "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
- }
- ---------------------------------------
- // TEST[catch:missing]
- Multiple scroll IDs can be passed as array:
- [source,console]
- ---------------------------------------
- DELETE /_search/scroll
- {
- "scroll_id" : [
- "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==",
- "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB"
- ]
- }
- ---------------------------------------
- // TEST[catch:missing]
- All search contexts can be cleared with the `_all` parameter:
- [source,console]
- ---------------------------------------
- DELETE /_search/scroll/_all
- ---------------------------------------
- The `scroll_id` can also be passed as a query string parameter or in the request body.
- Multiple scroll IDs can be passed as comma separated values:
- [source,console]
- ---------------------------------------
- DELETE /_search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==,DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB
- ---------------------------------------
- // TEST[catch:missing]
- [[sliced-scroll]]
- ===== Sliced Scroll
- For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which
- can be consumed independently:
- [source,console]
- --------------------------------------------------
- GET /twitter/_search?scroll=1m
- {
- "slice": {
- "id": 0, <1>
- "max": 2 <2>
- },
- "query": {
- "match" : {
- "title" : "elasticsearch"
- }
- }
- }
- GET /twitter/_search?scroll=1m
- {
- "slice": {
- "id": 1,
- "max": 2
- },
- "query": {
- "match" : {
- "title" : "elasticsearch"
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:big_twitter]
- <1> The id of the slice
- <2> The maximum number of slices
- The result from the first request returned documents that belong to the first slice (id: 0) and the result from the
- second request returned documents that belong to the second slice. Since the maximum number of slices is set to 2
- the union of the results of the two requests is equivalent to the results of a scroll query without slicing.
- By default the splitting is done on the shards first and then locally on each shard using the _id field
- with the following formula:
- `slice(doc) = floorMod(hashCode(doc._id), max)`
- For instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned
- to the first shard and the slices 1 and 3 are assigned to the second shard.
- Each scroll is independent and can be processed in parallel like any scroll request.
- NOTE: If the number of slices is bigger than the number of shards the slice filter is very slow on the first calls, it has a complexity of O(N) and a memory cost equals
- to N bits per slice where N is the total number of documents in the shard.
- After few calls the filter should be cached and subsequent calls should be faster but you should limit the number of
- sliced query you perform in parallel to avoid the memory explosion.
- To avoid this cost entirely it is possible to use the `doc_values` of another field to do the slicing
- but the user must ensure that the field has the following properties:
- * The field is numeric.
- * `doc_values` are enabled on that field
- * Every document should contain a single value. If a document has multiple values for the specified field, the first value is used.
- * The value for each document should be set once when the document is created and never updated. This ensures that each
- slice gets deterministic results.
- * The cardinality of the field should be high. This ensures that each slice gets approximately the same amount of documents.
- [source,console]
- --------------------------------------------------
- GET /twitter/_search?scroll=1m
- {
- "slice": {
- "field": "date",
- "id": 0,
- "max": 10
- },
- "query": {
- "match" : {
- "title" : "elasticsearch"
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:big_twitter]
- For append only time-based indices, the `timestamp` field can be used safely.
- NOTE: By default the maximum number of slices allowed per scroll is limited to 1024.
- You can update the `index.max_slices_per_scroll` index setting to bypass this limit.
|