123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270 |
- [[plugins-delete-by-query]]
- === Delete By Query Plugin
- The delete-by-query plugin adds support for deleting all of the documents
- (from one or more indices) which match the specified query. It is a
- replacement for the problematic _delete-by-query_ functionality which has been
- removed from Elasticsearch core.
- Internally, it uses the {ref}/search-request-scroll.html#scroll-scan[Scan/Scroll]
- and {ref}/docs-bulk.html[Bulk] APIs to delete documents in an efficient and
- safe manner. It is slower than the old _delete-by-query_ functionality, but
- fixes the problems with the previous implementation.
- To understand more about why we removed delete-by-query from core and about
- the semantics of the new implementation, see
- <<delete-by-query-plugin-reason>>.
- [TIP]
- ============================================
- Queries which match large numbers of documents may run for a long time,
- as every document has to be deleted individually. Don't use _delete-by-query_
- to clean out all or most documents in an index. Rather create a new index and
- perhaps reindex the documents you want to keep.
- ============================================
- [float]
- ==== Installation
- This plugin can be installed using the plugin manager:
- [source,sh]
- ----------------------------------------------------------------
- sudo bin/plugin install delete-by-query
- ----------------------------------------------------------------
- The plugin must be installed on every node in the cluster, and each node must
- be restarted after installation.
- [float]
- ==== Removal
- The plugin can be removed with the following command:
- [source,sh]
- ----------------------------------------------------------------
- sudo bin/plugin remove delete-by-query
- ----------------------------------------------------------------
- The node must be stopped before removing the plugin.
- [[delete-by-query-usage]]
- ==== Using Delete-by-Query
- The query can either be provided using a simple query string as
- a parameter:
- [source,shell]
- --------------------------------------------------
- DELETE /twitter/tweet/_query?q=user:kimchy
- --------------------------------------------------
- // AUTOSENSE
- or using the {ref}/query-dsl.html[Query DSL] defined within the request body:
- [source,js]
- --------------------------------------------------
- DELETE /twitter/tweet/_query
- {
- "query": { <1>
- "term": {
- "user": "kimchy"
- }
- }
- }
- --------------------------------------------------
- // AUTOSENSE
- <1> The query must be passed as a value to the `query` key, in the same way as
- the {ref}/search-search.html[search api].
- Both of the above examples end up doing the same thing, which is to delete all
- tweets from the twitter index for the user `kimchy`.
- Delete-by-query supports deletion across
- {ref}/search-search.html#search-multi-index-type[multiple indices and multiple types].
- [float]
- === Query-string parameters
- The following query string parameters are supported:
- `q`::
- Instead of using the {ref}/query-dsl.html[Query DSL] to pass a `query` in the request
- body, you can use the `q` query string parameter to specify a query using
- {ref}/query-dsl-query-string-query.html#query-string-syntax[`query_string` syntax].
- In this case, the following additional parameters are supported: `df`,
- `analyzer`, `default_operator`, `lowercase_expanded_terms`,
- `analyze_wildcard` and `lenient`.
- See {ref}/search-uri-request.html[URI search request] for details.
- `size`::
- The number of hits returned *per shard* by the {ref}/search-request-scroll.html#scroll-scan[scan]
- request. Defaults to 10. May also be specified in the request body.
- `timeout`::
- The maximum execution time of the delete by query process. Once expired, no
- more documents will be deleted.
- `routing`::
- A comma separated list of routing values to control which shards the delete by
- query request should be executed on.
- When using the `q` parameter, the following additional parameters are
- supported (as explained in {ref}/search-uri-request.html[URI search request]): `df`, `analyzer`,
- `default_operator`.
- [float]
- === Response body
- The JSON response looks like this:
- [source,js]
- --------------------------------------------------
- {
- "took" : 639,
- "timed_out" : false,
- "_indices" : {
- "_all" : {
- "found" : 5901,
- "deleted" : 5901,
- "missing" : 0,
- "failed" : 0
- },
- "twitter" : {
- "found" : 5901,
- "deleted" : 5901,
- "missing" : 0,
- "failed" : 0
- }
- },
- "failures" : [ ]
- }
- --------------------------------------------------
- Internally, the query is used to execute an initial
- {ref}/search-request-scroll.html#scroll-scan[scroll/scan] request. As hits are
- pulled from the scroll API, they are passed to the {ref}/docs-bulk.html[Bulk
- API] for deletion.
- IMPORTANT: Delete by query will only delete the version of the document that
- was visible to search at the time the request was executed. Any documents
- that have been reindexed or updated during execution will not be deleted.
- Since documents can be updated or deleted by external operations during the
- _scan-scroll-bulk_ process, the plugin keeps track of different counters for
- each index, with the totals displayed under the `_all` index. The counters
- are as follows:
- `found`::
- The number of documents matching the query for the given index.
- `deleted`::
- The number of documents successfully deleted for the given index.
- `missing`::
- The number of documents that were missing when the plugin tried to delete
- them. Missing documents were present when the original query was run, but have
- already been deleted by another process.
- `failed`::
- The number of documents that failed to be deleted for the given index. A
- document may fail to be deleted if it has been updated to a new version by
- another process, or if the shard containing the document has gone missing due
- to hardware failure, for example.
- [[delete-by-query-plugin-reason]]
- ==== Why Delete-By-Query is a plugin
- The old delete-by-query API in Elasticsearch 1.x was fast but problematic. We
- decided to remove the feature from Elasticsearch for these reasons:
- Forward compatibility::
- The old implementation wrote a delete-by-query request, including the
- query, to the transaction log. This meant that, when upgrading to a new
- version, old unsupported queries which cannot be executed might exist in
- the translog, thus causing data corruption.
- Consistency and correctness::
- The old implementation executed the query and deleted all matching docs on
- the primary first. It then repeated this procedure on each replica shard.
- There was no guarantee that the queries on the primary and the replicas
- matched the same document, so it was quite possible to end up with
- different documents on each shard copy.
- Resiliency::
- The old implementation could cause out-of-memory exceptions, merge storms,
- and dramatic slow downs if used incorrectly.
- [float]
- === New delete-by-query implementation
- The new implementation, provided by this plugin, is built internally
- using {ref}/search-request-scroll.html#scroll-scan[scan and scroll] to return
- the document IDs and versions of all the documents that need to be deleted.
- It then uses the {ref}/docs-bulk.html[`bulk` API] to do the actual deletion.
- This can have performance as well as visibility implications. Delete-by-query
- now has the following semantics:
- non-atomic::
- A delete-by-query may fail at any time while some documents matching the
- query have already been deleted.
- try-once::
- A delete-by-query may fail at any time and will not retry it's execution.
- All retry logic is left to the user.
- syntactic sugar::
- A delete-by-query is equivalent to a scan/scroll search and corresponding
- bulk-deletes by ID.
- point-in-time::
- A delete-by-query will only delete the documents that are visible at the
- point in time the delete-by-query was started, equivalent to the
- scan/scroll API.
- consistent::
- A delete-by-query will yield consistent results across all replicas of a
- shard.
- forward-compatible::
- A delete-by-query will only send IDs to the shards as deletes such that no
- queries are stored in the transaction logs that might not be supported in
- the future.
- visibility::
- The effect of a delete-by-query request will not be visible to search
- until the user refreshes the index, or the index is refreshed
- automatically.
- The new implementation suffers from two issues, which is why we decided to
- move the functionality to a plugin instead of replacing the feautre in core:
- * It is not as fast as the previous implementation. For most use cases, this
- difference should not be noticeable but users running delete-by-query on
- many matching documents may be affected.
- * There is currently no way to monitor or cancel a running delete-by-query
- request, except for the `timeout` parameter.
- We have plans to solve both of these issues in a later version of Elasticsearch.
|