delete-by-query.asciidoc 8.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270
  1. [[plugins-delete-by-query]]
  2. === Delete By Query Plugin
  3. The delete-by-query plugin adds support for deleting all of the documents
  4. (from one or more indices) which match the specified query. It is a
  5. replacement for the problematic _delete-by-query_ functionality which has been
  6. removed from Elasticsearch core.
  7. Internally, it uses {ref}/search-request-scroll.html[Scroll]
  8. and {ref}/docs-bulk.html[Bulk] APIs to delete documents in an efficient and
  9. safe manner. It is slower than the old _delete-by-query_ functionality, but
  10. fixes the problems with the previous implementation.
  11. To understand more about why we removed delete-by-query from core and about
  12. the semantics of the new implementation, see
  13. <<delete-by-query-plugin-reason>>.
  14. [TIP]
  15. ============================================
  16. Queries which match large numbers of documents may run for a long time,
  17. as every document has to be deleted individually. Don't use _delete-by-query_
  18. to clean out all or most documents in an index. Rather create a new index and
  19. perhaps reindex the documents you want to keep.
  20. ============================================
  21. [float]
  22. ==== Installation
  23. This plugin can be installed using the plugin manager:
  24. [source,sh]
  25. ----------------------------------------------------------------
  26. sudo bin/plugin install delete-by-query
  27. ----------------------------------------------------------------
  28. The plugin must be installed on every node in the cluster, and each node must
  29. be restarted after installation.
  30. [float]
  31. ==== Removal
  32. The plugin can be removed with the following command:
  33. [source,sh]
  34. ----------------------------------------------------------------
  35. sudo bin/plugin remove delete-by-query
  36. ----------------------------------------------------------------
  37. The node must be stopped before removing the plugin.
  38. [[delete-by-query-usage]]
  39. ==== Using Delete-by-Query
  40. The query can either be provided using a simple query string as
  41. a parameter:
  42. [source,shell]
  43. --------------------------------------------------
  44. DELETE /twitter/tweet/_query?q=user:kimchy
  45. --------------------------------------------------
  46. // AUTOSENSE
  47. or using the {ref}/query-dsl.html[Query DSL] defined within the request body:
  48. [source,js]
  49. --------------------------------------------------
  50. DELETE /twitter/tweet/_query
  51. {
  52. "query": { <1>
  53. "term": {
  54. "user": "kimchy"
  55. }
  56. }
  57. }
  58. --------------------------------------------------
  59. // AUTOSENSE
  60. <1> The query must be passed as a value to the `query` key, in the same way as
  61. the {ref}/search-search.html[search api].
  62. Both of the above examples end up doing the same thing, which is to delete all
  63. tweets from the twitter index for the user `kimchy`.
  64. Delete-by-query supports deletion across
  65. {ref}/search-search.html#search-multi-index-type[multiple indices and multiple types].
  66. [float]
  67. === Query-string parameters
  68. The following query string parameters are supported:
  69. `q`::
  70. Instead of using the {ref}/query-dsl.html[Query DSL] to pass a `query` in the request
  71. body, you can use the `q` query string parameter to specify a query using
  72. {ref}/query-dsl-query-string-query.html#query-string-syntax[`query_string` syntax].
  73. In this case, the following additional parameters are supported: `df`,
  74. `analyzer`, `default_operator`, `lowercase_expanded_terms`,
  75. `analyze_wildcard` and `lenient`.
  76. See {ref}/search-uri-request.html[URI search request] for details.
  77. `size`::
  78. The number of hits returned by the {ref}/search-request-scroll.html[scroll]
  79. request. Defaults to 10. May also be specified in the request body.
  80. `timeout`::
  81. The maximum execution time of the delete by query process. Once expired, no
  82. more documents will be deleted.
  83. `routing`::
  84. A comma separated list of routing values to control which shards the delete by
  85. query request should be executed on.
  86. When using the `q` parameter, the following additional parameters are
  87. supported (as explained in {ref}/search-uri-request.html[URI search request]): `df`, `analyzer`,
  88. `default_operator`.
  89. [float]
  90. === Response body
  91. The JSON response looks like this:
  92. [source,js]
  93. --------------------------------------------------
  94. {
  95. "took" : 639,
  96. "timed_out" : false,
  97. "_indices" : {
  98. "_all" : {
  99. "found" : 5901,
  100. "deleted" : 5901,
  101. "missing" : 0,
  102. "failed" : 0
  103. },
  104. "twitter" : {
  105. "found" : 5901,
  106. "deleted" : 5901,
  107. "missing" : 0,
  108. "failed" : 0
  109. }
  110. },
  111. "failures" : [ ]
  112. }
  113. --------------------------------------------------
  114. Internally, the query is used to execute an initial
  115. {ref}/search-request-scroll.html[scroll] request. As hits are
  116. pulled from the scroll API, they are passed to the {ref}/docs-bulk.html[Bulk
  117. API] for deletion.
  118. IMPORTANT: Delete by query will only delete the version of the document that
  119. was visible to search at the time the request was executed. Any documents
  120. that have been reindexed or updated during execution will not be deleted.
  121. Since documents can be updated or deleted by external operations during the
  122. _scroll-bulk_ process, the plugin keeps track of different counters for
  123. each index, with the totals displayed under the `_all` index. The counters
  124. are as follows:
  125. `found`::
  126. The number of documents matching the query for the given index.
  127. `deleted`::
  128. The number of documents successfully deleted for the given index.
  129. `missing`::
  130. The number of documents that were missing when the plugin tried to delete
  131. them. Missing documents were present when the original query was run, but have
  132. already been deleted by another process.
  133. `failed`::
  134. The number of documents that failed to be deleted for the given index. A
  135. document may fail to be deleted if it has been updated to a new version by
  136. another process, or if the shard containing the document has gone missing due
  137. to hardware failure, for example.
  138. [[delete-by-query-plugin-reason]]
  139. ==== Why Delete-By-Query is a plugin
  140. The old delete-by-query API in Elasticsearch 1.x was fast but problematic. We
  141. decided to remove the feature from Elasticsearch for these reasons:
  142. Forward compatibility::
  143. The old implementation wrote a delete-by-query request, including the
  144. query, to the transaction log. This meant that, when upgrading to a new
  145. version, old unsupported queries which cannot be executed might exist in
  146. the translog, thus causing data corruption.
  147. Consistency and correctness::
  148. The old implementation executed the query and deleted all matching docs on
  149. the primary first. It then repeated this procedure on each replica shard.
  150. There was no guarantee that the queries on the primary and the replicas
  151. matched the same document, so it was quite possible to end up with
  152. different documents on each shard copy.
  153. Resiliency::
  154. The old implementation could cause out-of-memory exceptions, merge storms,
  155. and dramatic slow downs if used incorrectly.
  156. [float]
  157. === New delete-by-query implementation
  158. The new implementation, provided by this plugin, is built internally
  159. using {ref}/search-request-scroll.html[scroll] to return
  160. the document IDs and versions of all the documents that need to be deleted.
  161. It then uses the {ref}/docs-bulk.html[`bulk` API] to do the actual deletion.
  162. This can have performance as well as visibility implications. Delete-by-query
  163. now has the following semantics:
  164. non-atomic::
  165. A delete-by-query may fail at any time while some documents matching the
  166. query have already been deleted.
  167. try-once::
  168. A delete-by-query may fail at any time and will not retry it's execution.
  169. All retry logic is left to the user.
  170. syntactic sugar::
  171. A delete-by-query is equivalent to a scroll search ordered by `_doc` and
  172. corresponding bulk-deletes by ID.
  173. point-in-time::
  174. A delete-by-query will only delete the documents that are visible at the
  175. point in time the delete-by-query was started, equivalent to the
  176. scan/scroll API.
  177. consistent::
  178. A delete-by-query will yield consistent results across all replicas of a
  179. shard.
  180. forward-compatible::
  181. A delete-by-query will only send IDs to the shards as deletes such that no
  182. queries are stored in the transaction logs that might not be supported in
  183. the future.
  184. visibility::
  185. The effect of a delete-by-query request will not be visible to search
  186. until the user refreshes the index, or the index is refreshed
  187. automatically.
  188. The new implementation suffers from two issues, which is why we decided to
  189. move the functionality to a plugin instead of replacing the feautre in core:
  190. * It is not as fast as the previous implementation. For most use cases, this
  191. difference should not be noticeable but users running delete-by-query on
  192. many matching documents may be affected.
  193. * There is currently no way to monitor or cancel a running delete-by-query
  194. request, except for the `timeout` parameter.
  195. We have plans to solve both of these issues in a later version of Elasticsearch.