123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482 |
- [[paginate-search-results]]
- == Paginate search results
- By default, searches return the top 10 matching hits. To page through a larger
- set of results, you can use the <<search-search,search API>>'s `from` and `size`
- parameters. The `from` parameter defines the number of hits to skip, defaulting
- to `0`. The `size` parameter is the maximum number of hits to return. Together,
- these two parameters define a page of results.
- [source,console]
- ----
- GET /_search
- {
- "from": 5,
- "size": 20,
- "query": {
- "match": {
- "user.id": "kimchy"
- }
- }
- }
- ----
- Avoid using `from` and `size` to page too deeply or request too many results at
- once. Search requests usually span multiple shards. Each shard must load its
- requested hits and the hits for any previous pages into memory. For deep pages
- or large sets of results, these operations can significantly increase memory and
- CPU usage, resulting in degraded performance or node failures.
- By default, you cannot use `from` and `size` to page through more than 10,000
- hits. This limit is a safeguard set by the
- <<index-max-result-window,`index.max_result_window`>> index setting. If you need
- to page through more than 10,000 hits, use the <<search-after,`search_after`>>
- parameter instead.
- WARNING: {es} uses Lucene's internal doc IDs as tie-breakers. These internal doc
- IDs can be completely different across replicas of the same data. When paging
- search hits, you might occasionally see that documents with the same sort values
- are not ordered consistently.
- [discrete]
- [[search-after]]
- === Search after
- You can use the `search_after` parameter to retrieve the next page of hits
- using a set of <<sort-search-results,sort values>> from the previous page.
- Using `search_after` requires multiple search requests with the same `query` and
- `sort` values. If a <<near-real-time,refresh>> occurs between these requests,
- the order of your results may change, causing inconsistent results across pages. To
- prevent this, you can create a <<point-in-time-api,point in time (PIT)>> to
- preserve the current index state over your searches.
- [source,console]
- ----
- POST /my-index-000001/_pit?keep_alive=1m
- ----
- // TEST[setup:my_index]
- The API returns a PIT ID.
- [source,console-result]
- ----
- {
- "id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="
- }
- ----
- // TESTRESPONSE[s/"id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="/"id": $body.id/]
- To get the first page of results, submit a search request with a `sort`
- argument. If using a PIT, specify the PIT ID in the `pit.id` parameter.
- IMPORTANT: We recommend you include a tiebreaker field in your `sort`. This
- tiebreaker field should contain a unique value for each document. If you don't
- include a tiebreaker field, your paged results could miss or duplicate hits.
- [source,console]
- ----
- GET /my-index-000001/_search
- {
- "size": 10000,
- "query": {
- "match" : {
- "user.id" : "elkbee"
- }
- },
- "pit": {
- "id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", <1>
- "keep_alive": "1m"
- },
- "sort": [ <2>
- {"@timestamp": "asc"},
- {"tie_breaker_id": "asc"}
- ]
- }
- ----
- // TEST[catch:missing]
- <1> PIT ID for the search.
- <2> Sorts hits for the search.
- The search response includes an array of `sort` values for each hit. If you used
- a PIT, the response's `pit_id` parameter contains an updated PIT ID.
- [source,console-result]
- ----
- {
- "pit_id" : "46ToAwEPbXktaW5kZXgtMDAwMDAxFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAFldicVdzOFFtVHZTZDFoWWowTGkwS0EAAAAAAAAAAAQURzZzcUszUUJ5U1NMX3Jyak5ET0wBFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAAA==", <1>
- "took" : 17,
- "timed_out" : false,
- "_shards" : ...,
- "hits" : {
- "total" : ...,
- "max_score" : null,
- "hits" : [
- ...
- {
- "_index" : "my-index-000001",
- "_id" : "FaslK3QBySSL_rrj9zM5",
- "_score" : null,
- "_source" : ...,
- "sort" : [ <2>
- 4098435132000,
- "FaslK3QBySSL_rrj9zM5"
- ]
- }
- ]
- }
- }
- ----
- // TESTRESPONSE[skip: unable to access PIT ID]
- <1> Updated `id` for the point in time.
- <2> Sort values for the last returned hit.
- To get the next page of results, rerun the previous search using the last hit's
- sort values as the `search_after` argument. If using a PIT, use the latest PIT
- ID in the `pit.id` parameter. The search's `query` and `sort` arguments must
- remain unchanged. If provided, the `from` argument must be `0` (default) or `-1`.
- [source,console]
- ----
- GET /my-index-000001/_search
- {
- "size": 10000,
- "query": {
- "match" : {
- "user.id" : "elkbee"
- }
- },
- "pit": {
- "id": "46ToAwEPbXktaW5kZXgtMDAwMDAxFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAFldicVdzOFFtVHZTZDFoWWowTGkwS0EAAAAAAAAAAAQURzZzcUszUUJ5U1NMX3Jyak5ET0wBFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAAA==", <1>
- "keep_alive": "1m"
- },
- "sort": [
- {"@timestamp": "asc"},
- {"tie_breaker_id": "asc"}
- ],
- "search_after": [ <2>
- 4098435132000,
- "FaslK3QBySSL_rrj9zM5"
- ]
- }
- ----
- // TEST[catch:missing]
- <1> PIT ID returned by the previous search.
- <2> Sort values from the previous search's last hit.
- You can repeat this process to get additional pages of results. If using a PIT,
- you can extend the PIT's retention period using the
- `keep_alive` parameter of each search request.
- When you're finished, you should delete your PIT.
- [source,console]
- ----
- DELETE /_pit
- {
- "id" : "46ToAwEPbXktaW5kZXgtMDAwMDAxFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAFldicVdzOFFtVHZTZDFoWWowTGkwS0EAAAAAAAAAAAQURzZzcUszUUJ5U1NMX3Jyak5ET0wBFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAAA=="
- }
- ----
- // TEST[catch:missing]
- [discrete]
- [[scroll-search-results]]
- === Scroll search results
- IMPORTANT: We no longer recommend using the scroll API for deep pagination. If
- you need to preserve the index state while paging through more than 10,000 hits,
- use the <<search-after,`search_after`>> parameter with a point in time (PIT).
- While a `search` request returns a single ``page'' of results, the `scroll`
- API can be used to retrieve large numbers of results (or even all results)
- from a single search request, in much the same way as you would use a cursor
- on a traditional database.
- Scrolling is not intended for real time user requests, but rather for
- processing large amounts of data, e.g. in order to reindex the contents of one
- data stream or index into a new data stream or index with a different
- configuration.
- .Client support for scrolling and reindexing
- *********************************************
- Some of the officially supported clients provide helpers to assist with
- scrolled searches and reindexing:
- Perl::
- See https://metacpan.org/pod/Search::Elasticsearch::Client::5_0::Bulk[Search::Elasticsearch::Client::5_0::Bulk]
- and https://metacpan.org/pod/Search::Elasticsearch::Client::5_0::Scroll[Search::Elasticsearch::Client::5_0::Scroll]
- Python::
- See https://elasticsearch-py.readthedocs.org/en/master/helpers.html[elasticsearch.helpers.*]
- JavaScript::
- See {jsclient-current}/client-helpers.html[client.helpers.*]
- *********************************************
- NOTE: The results that are returned from a scroll request reflect the state of
- the data stream or index at the time that the initial `search` request was made, like a
- snapshot in time. Subsequent changes to documents (index, update or delete)
- will only affect later search requests.
- In order to use scrolling, the initial search request should specify the
- `scroll` parameter in the query string, which tells Elasticsearch how long it
- should keep the ``search context'' alive (see <<scroll-search-context>>), eg `?scroll=1m`.
- [source,console]
- --------------------------------------------------
- POST /my-index-000001/_search?scroll=1m
- {
- "size": 100,
- "query": {
- "match": {
- "message": "foo"
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:my_index]
- The result from the above request includes a `_scroll_id`, which should
- be passed to the `scroll` API in order to retrieve the next batch of
- results.
- [source,console]
- --------------------------------------------------
- POST /_search/scroll <1>
- {
- "scroll" : "1m", <2>
- "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" <3>
- }
- --------------------------------------------------
- // TEST[continued s/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==/$body._scroll_id/]
- <1> `GET` or `POST` can be used and the URL should not include the `index`
- name -- this is specified in the original `search` request instead.
- <2> The `scroll` parameter tells Elasticsearch to keep the search context open
- for another `1m`.
- <3> The `scroll_id` parameter
- The `size` parameter allows you to configure the maximum number of hits to be
- returned with each batch of results. Each call to the `scroll` API returns the
- next batch of results until there are no more results left to return, ie the
- `hits` array is empty.
- IMPORTANT: The initial search request and each subsequent scroll request each
- return a `_scroll_id`. While the `_scroll_id` may change between requests, it doesn’t
- always change — in any case, only the most recently received `_scroll_id` should be used.
- NOTE: If the request specifies aggregations, only the initial search response
- will contain the aggregations results.
- NOTE: Scroll requests have optimizations that make them faster when the sort
- order is `_doc`. If you want to iterate over all documents regardless of the
- order, this is the most efficient option:
- [source,console]
- --------------------------------------------------
- GET /_search?scroll=1m
- {
- "sort": [
- "_doc"
- ]
- }
- --------------------------------------------------
- // TEST[setup:my_index]
- [discrete]
- [[scroll-search-context]]
- ==== Keeping the search context alive
- A scroll returns all the documents which matched the search at the time of the
- initial search request. It ignores any subsequent changes to these documents.
- The `scroll_id` identifies a _search context_ which keeps track of everything
- that {es} needs to return the correct documents. The search context is created
- by the initial request and kept alive by subsequent requests.
- The `scroll` parameter (passed to the `search` request and to every `scroll`
- request) tells Elasticsearch how long it should keep the search context alive.
- Its value (e.g. `1m`, see <<time-units>>) does not need to be long enough to
- process all data -- it just needs to be long enough to process the previous
- batch of results. Each `scroll` request (with the `scroll` parameter) sets a
- new expiry time. If a `scroll` request doesn't pass in the `scroll`
- parameter, then the search context will be freed as part of _that_ `scroll`
- request.
- Normally, the background merge process optimizes the index by merging together
- smaller segments to create new, bigger segments. Once the smaller segments are
- no longer needed they are deleted. This process continues during scrolling, but
- an open search context prevents the old segments from being deleted since they
- are still in use.
- TIP: Keeping older segments alive means that more disk space and file handles
- are needed. Ensure that you have configured your nodes to have ample free file
- handles. See <<file-descriptors>>.
- Additionally, if a segment contains deleted or updated documents then the
- search context must keep track of whether each document in the segment was live
- at the time of the initial search request. Ensure that your nodes have
- sufficient heap space if you have many open scrolls on an index that is subject
- to ongoing deletes or updates.
- NOTE: To prevent against issues caused by having too many scrolls open, the
- user is not allowed to open scrolls past a certain limit. By default, the
- maximum number of open scrolls is 500. This limit can be updated with the
- `search.max_open_scroll_context` cluster setting.
- You can check how many search contexts are open with the
- <<cluster-nodes-stats,nodes stats API>>:
- [source,console]
- ---------------------------------------
- GET /_nodes/stats/indices/search
- ---------------------------------------
- [discrete]
- [[clear-scroll]]
- ==== Clear scroll
- Search context are automatically removed when the `scroll` timeout has been
- exceeded. However keeping scrolls open has a cost, as discussed in the
- <<scroll-search-context,previous section>> so scrolls should be explicitly
- cleared as soon as the scroll is not being used anymore using the
- `clear-scroll` API:
- [source,console]
- ---------------------------------------
- DELETE /_search/scroll
- {
- "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
- }
- ---------------------------------------
- // TEST[catch:missing]
- Multiple scroll IDs can be passed as array:
- [source,console]
- ---------------------------------------
- DELETE /_search/scroll
- {
- "scroll_id" : [
- "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==",
- "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB"
- ]
- }
- ---------------------------------------
- // TEST[catch:missing]
- All search contexts can be cleared with the `_all` parameter:
- [source,console]
- ---------------------------------------
- DELETE /_search/scroll/_all
- ---------------------------------------
- The `scroll_id` can also be passed as a query string parameter or in the request body.
- Multiple scroll IDs can be passed as comma separated values:
- [source,console]
- ---------------------------------------
- DELETE /_search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==,DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB
- ---------------------------------------
- // TEST[catch:missing]
- [discrete]
- [[slice-scroll]]
- ==== Sliced scroll
- For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which
- can be consumed independently:
- [source,console]
- --------------------------------------------------
- GET /my-index-000001/_search?scroll=1m
- {
- "slice": {
- "id": 0, <1>
- "max": 2 <2>
- },
- "query": {
- "match": {
- "message": "foo"
- }
- }
- }
- GET /my-index-000001/_search?scroll=1m
- {
- "slice": {
- "id": 1,
- "max": 2
- },
- "query": {
- "match": {
- "message": "foo"
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:my_index_big]
- <1> The id of the slice
- <2> The maximum number of slices
- The result from the first request returned documents that belong to the first slice (id: 0) and the result from the
- second request returned documents that belong to the second slice. Since the maximum number of slices is set to 2
- the union of the results of the two requests is equivalent to the results of a scroll query without slicing.
- By default the splitting is done on the shards first and then locally on each shard using the _id field
- with the following formula:
- `slice(doc) = floorMod(hashCode(doc._id), max)`
- For instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned
- to the first shard and the slices 1 and 3 are assigned to the second shard.
- Each scroll is independent and can be processed in parallel like any scroll request.
- NOTE: If the number of slices is bigger than the number of shards the slice filter is very slow on the first calls, it has a complexity of O(N) and a memory cost equals
- to N bits per slice where N is the total number of documents in the shard.
- After few calls the filter should be cached and subsequent calls should be faster but you should limit the number of
- sliced query you perform in parallel to avoid the memory explosion.
- To avoid this cost entirely it is possible to use the `doc_values` of another field to do the slicing
- but the user must ensure that the field has the following properties:
- * The field is numeric.
- * `doc_values` are enabled on that field
- * Every document should contain a single value. If a document has multiple values for the specified field, the first value is used.
- * The value for each document should be set once when the document is created and never updated. This ensures that each
- slice gets deterministic results.
- * The cardinality of the field should be high. This ensures that each slice gets approximately the same amount of documents.
- [source,console]
- --------------------------------------------------
- GET /my-index-000001/_search?scroll=1m
- {
- "slice": {
- "field": "@timestamp",
- "id": 0,
- "max": 10
- },
- "query": {
- "match": {
- "message": "foo"
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:my_index_big]
- For append only time-based indices, the `timestamp` field can be used safely.
- NOTE: By default the maximum number of slices allowed per scroll is limited to 1024.
- You can update the `index.max_slices_per_scroll` index setting to bypass this limit.
|