123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507 |
- [[docs-delete-by-query]]
- == Delete By Query API
- experimental[The delete-by-query API is new and should still be considered experimental. The API may change in ways that are not backwards compatible]
- The simplest usage of `_delete_by_query` just performs a deletion on every
- document that match a query. Here is the API:
- [source,js]
- --------------------------------------------------
- POST twitter/_delete_by_query
- {
- "query": { <1>
- "match": {
- "message": "some message"
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[setup:big_twitter]
- <1> The query must be passed as a value to the `query` key, in the same
- way as the <<search-search,Search API>>. You can also use the `q`
- parameter in the same way as the search api.
- That will return something like this:
- [source,js]
- --------------------------------------------------
- {
- "took" : 147,
- "timed_out": false,
- "deleted": 119,
- "batches": 1,
- "version_conflicts": 0,
- "noops": 0,
- "retries": {
- "bulk": 0,
- "search": 0
- },
- "throttled_millis": 0,
- "requests_per_second": -1.0,
- "throttled_until_millis": 0,
- "total": 119,
- "failures" : [ ]
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"took" : 147/"took" : "$body.took"/]
- `_delete_by_query` gets a snapshot of the index when it starts and deletes what
- it finds using `internal` versioning. That means that you'll get a version
- conflict if the document changes between the time when the snapshot was taken
- and when the delete request is processed. When the versions match the document
- is deleted.
- NOTE: Since `internal` versioning does not support the value 0 as a valid
- version number, documents with version equal to zero cannot be deleted using
- `_delete_by_query` and will fail the request.
- During the `_delete_by_query` execution, multiple search requests are sequentially
- executed in order to find all the matching documents to delete. Every time a batch
- of documents is found, a corresponding bulk request is executed to delete all
- these documents. In case a search or bulk request got rejected, `_delete_by_query`
- relies on a default policy to retry rejected requests (up to 10 times, with
- exponential back off). Reaching the maximum retries limit causes the `_delete_by_query`
- to abort and all failures are returned in the `failures` of the response.
- The deletions that have been performed still stick. In other words, the process
- is not rolled back, only aborted. While the first failure causes the abort, all
- failures that are returned by the failing bulk request are returned in the `failures`
- element; therefore it's possible for there to be quite a few failed entities.
- If you'd like to count version conflicts rather than cause them to abort then
- set `conflicts=proceed` on the url or `"conflicts": "proceed"` in the request body.
- Back to the API format, you can limit `_delete_by_query` to a single type. This
- will only delete `tweet` documents from the `twitter` index:
- [source,js]
- --------------------------------------------------
- POST twitter/tweet/_delete_by_query?conflicts=proceed
- {
- "query": {
- "match_all": {}
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[setup:twitter]
- It's also possible to delete documents of multiple indexes and multiple
- types at once, just like the search API:
- [source,js]
- --------------------------------------------------
- POST twitter,blog/tweet,post/_delete_by_query
- {
- "query": {
- "match_all": {}
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[s/^/PUT twitter\nPUT blog\n/]
- If you provide `routing` then the routing is copied to the scroll query,
- limiting the process to the shards that match that routing value:
- [source,js]
- --------------------------------------------------
- POST twitter/_delete_by_query?routing=1
- {
- "query": {
- "range" : {
- "age" : {
- "gte" : 10
- }
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[setup:twitter]
- By default `_delete_by_query` uses scroll batches of 1000. You can change the
- batch size with the `scroll_size` URL parameter:
- [source,js]
- --------------------------------------------------
- POST twitter/_delete_by_query?scroll_size=5000
- {
- "query": {
- "term": {
- "user": "kimchy"
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[setup:twitter]
- [float]
- === URL Parameters
- In addition to the standard parameters like `pretty`, the Delete By Query API
- also supports `refresh`, `wait_for_completion`, `wait_for_active_shards`, and `timeout`.
- Sending the `refresh` will refresh all shards involved in the delete by query
- once the request completes. This is different than the Delete API's `refresh`
- parameter which causes just the shard that received the delete request
- to be refreshed.
- If the request contains `wait_for_completion=false` then Elasticsearch will
- perform some preflight checks, launch the request, and then return a `task`
- which can be used with <<docs-delete-by-query-task-api,Tasks APIs>>
- to cancel or get the status of the task. Elasticsearch will also create a
- record of this task as a document at `.tasks/task/${taskId}`. This is yours
- to keep or remove as you see fit. When you are done with it, delete it so
- Elasticsearch can reclaim the space it uses.
- `wait_for_active_shards` controls how many copies of a shard must be active
- before proceeding with the request. See <<index-wait-for-active-shards,here>>
- for details. `timeout` controls how long each write request waits for unavailable
- shards to become available. Both work exactly how they work in the
- <<docs-bulk,Bulk API>>.
- `requests_per_second` can be set to any positive decimal number (`1.4`, `6`,
- `1000`, etc) and throttles the number of requests per second that the delete-by-query
- issues or it can be set to `-1` to disabled throttling. The throttling is done
- waiting between bulk batches so that it can manipulate the scroll timeout. The
- wait time is the difference between the time it took the batch to complete and
- the time `requests_per_second * requests_in_the_batch`. Since the batch isn't
- broken into multiple bulk requests large batch sizes will cause Elasticsearch
- to create many requests and then wait for a while before starting the next set.
- This is "bursty" instead of "smooth". The default is `-1`.
- [float]
- === Response body
- The JSON response looks like this:
- [source,js]
- --------------------------------------------------
- {
- "took" : 639,
- "deleted": 0,
- "batches": 1,
- "version_conflicts": 2,
- "retries": 0,
- "throttled_millis": 0,
- "failures" : [ ]
- }
- --------------------------------------------------
- `took`::
- The number of milliseconds from start to end of the whole operation.
- `deleted`::
- The number of documents that were successfully deleted.
- `batches`::
- The number of scroll responses pulled back by the the delete by query.
- `version_conflicts`::
- The number of version conflicts that the delete by query hit.
- `retries`::
- The number of retries that the delete by query did in response to a full queue.
- `throttled_millis`::
- Number of milliseconds the request slept to conform to `requests_per_second`.
- `failures`::
- Array of all indexing failures. If this is non-empty then the request aborted
- because of those failures. See `conflicts` for how to prevent version conflicts
- from aborting the operation.
- [float]
- [[docs-delete-by-query-task-api]]
- === Works with the Task API
- You can fetch the status of any running delete-by-query requests with the
- <<tasks,Task API>>:
- [source,js]
- --------------------------------------------------
- GET _tasks?detailed=true&actions=*/delete/byquery
- --------------------------------------------------
- // CONSOLE
- The responses looks like:
- [source,js]
- --------------------------------------------------
- {
- "nodes" : {
- "r1A2WoRbTwKZ516z6NEs5A" : {
- "name" : "r1A2WoR",
- "transport_address" : "127.0.0.1:9300",
- "host" : "127.0.0.1",
- "ip" : "127.0.0.1:9300",
- "attributes" : {
- "testattr" : "test",
- "portsfile" : "true"
- },
- "tasks" : {
- "r1A2WoRbTwKZ516z6NEs5A:36619" : {
- "node" : "r1A2WoRbTwKZ516z6NEs5A",
- "id" : 36619,
- "type" : "transport",
- "action" : "indices:data/write/delete/byquery",
- "status" : { <1>
- "total" : 6154,
- "updated" : 0,
- "created" : 0,
- "deleted" : 3500,
- "batches" : 36,
- "version_conflicts" : 0,
- "noops" : 0,
- "retries": 0,
- "throttled_millis": 0
- },
- "description" : ""
- }
- }
- }
- }
- }
- --------------------------------------------------
- <1> this object contains the actual status. It is just like the response json
- with the important addition of the `total` field. `total` is the total number
- of operations that the reindex expects to perform. You can estimate the
- progress by adding the `updated`, `created`, and `deleted` fields. The request
- will finish when their sum is equal to the `total` field.
- With the task id you can look up the task directly:
- [source,js]
- --------------------------------------------------
- GET /_tasks/taskId:1
- --------------------------------------------------
- // CONSOLE
- // TEST[catch:missing]
- The advantage of this API is that it integrates with `wait_for_completion=false`
- to transparently return the status of completed tasks. If the task is completed
- and `wait_for_completion=false` was set on it them it'll come back with a
- `results` or an `error` field. The cost of this feature is the document that
- `wait_for_completion=false` creates at `.tasks/task/${taskId}`. It is up to
- you to delete that document.
- [float]
- [[docs-delete-by-query-cancel-task-api]]
- === Works with the Cancel Task API
- Any Delete By Query can be canceled using the <<tasks,Task Cancel API>>:
- [source,js]
- --------------------------------------------------
- POST _tasks/task_id:1/_cancel
- --------------------------------------------------
- // CONSOLE
- The `task_id` can be found using the tasks API above.
- Cancellation should happen quickly but might take a few seconds. The task status
- API above will continue to list the task until it is wakes to cancel itself.
- [float]
- [[docs-delete-by-query-rethrottle]]
- === Rethrottling
- The value of `requests_per_second` can be changed on a running delete by query
- using the `_rethrottle` API:
- [source,js]
- --------------------------------------------------
- POST _delete_by_query/task_id:1/_rethrottle?requests_per_second=-1
- --------------------------------------------------
- // CONSOLE
- The `task_id` can be found using the tasks API above.
- Just like when setting it on the `_delete_by_query` API `requests_per_second`
- can be either `-1` to disable throttling or any decimal number
- like `1.7` or `12` to throttle to that level. Rethrottling that speeds up the
- query takes effect immediately but rethrotting that slows down the query will
- take effect on after completing the current batch. This prevents scroll
- timeouts.
- [float]
- [[docs-delete-by-query-manual-slice]]
- === Manually slicing
- Delete-by-query supports <<sliced-scroll>> allowing you to manually parallelize
- the process relatively easily:
- [source,js]
- ----------------------------------------------------------------
- POST twitter/_delete_by_query
- {
- "slice": {
- "id": 0,
- "max": 2
- },
- "query": {
- "range": {
- "likes": {
- "lt": 10
- }
- }
- }
- }
- POST twitter/_delete_by_query
- {
- "slice": {
- "id": 1,
- "max": 2
- },
- "query": {
- "range": {
- "likes": {
- "lt": 10
- }
- }
- }
- }
- ----------------------------------------------------------------
- // CONSOLE
- // TEST[setup:big_twitter]
- Which you can verify works with:
- [source,js]
- ----------------------------------------------------------------
- GET _refresh
- POST twitter/_search?size=0&filter_path=hits.total
- {
- "query": {
- "range": {
- "likes": {
- "lt": 10
- }
- }
- }
- }
- ----------------------------------------------------------------
- // CONSOLE
- // TEST[continued]
- Which results in a sensible `total` like this one:
- [source,js]
- ----------------------------------------------------------------
- {
- "hits": {
- "total": 0
- }
- }
- ----------------------------------------------------------------
- // TESTRESPONSE
- [float]
- [[docs-delete-by-query-automatic-slice]]
- === Automatic slicing
- You can also let delete-by-query automatically parallelize using
- <<sliced-scroll>> to slice on `_uid`:
- [source,js]
- ----------------------------------------------------------------
- POST twitter/_delete_by_query?refresh&slices=5
- {
- "query": {
- "range": {
- "likes": {
- "lt": 10
- }
- }
- }
- }
- ----------------------------------------------------------------
- // CONSOLE
- // TEST[setup:big_twitter]
- Which you also can verify works with:
- [source,js]
- ----------------------------------------------------------------
- POST twitter/_search?size=0&filter_path=hits.total
- {
- "query": {
- "range": {
- "likes": {
- "lt": 10
- }
- }
- }
- }
- ----------------------------------------------------------------
- // CONSOLE
- // TEST[continued]
- Which results in a sensible `total` like this one:
- [source,js]
- ----------------------------------------------------------------
- {
- "hits": {
- "total": 0
- }
- }
- ----------------------------------------------------------------
- // TESTRESPONSE
- Adding `slices` to `_delete_by_query` just automates the manual process used in
- the section above, creating sub-requests which means it has some quirks:
- * You can see these requests in the
- <<docs-delete-by-query-task-api,Tasks APIs>>. These sub-requests are "child"
- tasks of the task for the request with `slices`.
- * Fetching the status of the task for the request with `slices` only contains
- the status of completed slices.
- * These sub-requests are individually addressable for things like cancellation
- and rethrottling.
- * Rethrottling the request with `slices` will rethrottle the unfinished
- sub-request proportionally.
- * Canceling the request with `slices` will cancel each sub-request.
- * Due to the nature of `slices` each sub-request won't get a perfectly even
- portion of the documents. All documents will be addressed, but some slices may
- be larger than others. Expect larger slices to have a more even distribution.
- * Parameters like `requests_per_second` and `size` on a request with `slices`
- are distributed proportionally to each sub-request. Combine that with the point
- above about distribution being uneven and you should conclude that the using
- `size` with `slices` might not result in exactly `size` documents being
- `_delete_by_query`ed.
- * Each sub-requests gets a slightly different snapshot of the source index
- though these are all taken at approximately the same time.
- [float]
- [[docs-delete-by-query-picking-slices]]
- === Picking the number of slices
- At this point we have a few recommendations around the number of `slices` to
- use (the `max` parameter in the slice API if manually parallelizing):
- * Don't use large numbers. `500` creates fairly massive CPU thrash.
- * It is more efficient from a query performance standpoint to use some multiple
- of the number of shards in the source index.
- * Using exactly as many shards as are in the source index is the most efficient
- from a query performance standpoint.
- * Indexing performance should scale linearly across available resources with
- the number of `slices`.
- * Whether indexing or query performance dominates that process depends on lots
- of factors like the documents being reindexed and the cluster doing the
- reindexing.
|