common-terms-query.asciidoc 7.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265
  1. [[query-dsl-common-terms-query]]
  2. === Common Terms Query
  3. The `common` terms query is a modern alternative to stopwords which
  4. improves the precision and recall of search results (by taking stopwords
  5. into account), without sacrificing performance.
  6. [float]
  7. ==== The problem
  8. Every term in a query has a cost. A search for `"The brown fox"`
  9. requires three term queries, one for each of `"the"`, `"brown"` and
  10. `"fox"`, all of which are executed against all documents in the index.
  11. The query for `"the"` is likely to match many documents and thus has a
  12. much smaller impact on relevance than the other two terms.
  13. Previously, the solution to this problem was to ignore terms with high
  14. frequency. By treating `"the"` as a _stopword_, we reduce the index size
  15. and reduce the number of term queries that need to be executed.
  16. The problem with this approach is that, while stopwords have a small
  17. impact on relevance, they are still important. If we remove stopwords,
  18. we lose precision, (eg we are unable to distinguish between `"happy"`
  19. and `"not happy"`) and we lose recall (eg text like `"The The"` or
  20. `"To be or not to be"` would simply not exist in the index).
  21. [float]
  22. ==== The solution
  23. The `common` terms query divides the query terms into two groups: more
  24. important (ie _low frequency_ terms) and less important (ie _high
  25. frequency_ terms which would previously have been stopwords).
  26. First it searches for documents which match the more important terms.
  27. These are the terms which appear in fewer documents and have a greater
  28. impact on relevance.
  29. Then, it executes a second query for the less important terms -- terms
  30. which appear frequently and have a low impact on relevance. But instead
  31. of calculating the relevance score for *all* matching documents, it only
  32. calculates the `_score` for documents already matched by the first
  33. query. In this way the high frequency terms can improve the relevance
  34. calculation without paying the cost of poor performance.
  35. If a query consists only of high frequency terms, then a single query is
  36. executed as an `AND` (conjunction) query, in other words all terms are
  37. required. Even though each individual term will match many documents,
  38. the combination of terms narrows down the resultset to only the most
  39. relevant. The single query can also be executed as an `OR` with a
  40. specific
  41. <<query-dsl-minimum-should-match,`minimum_should_match`>>,
  42. in this case a high enough value should probably be used.
  43. Terms are allocated to the high or low frequency groups based on the
  44. `cutoff_frequency`, which can be specified as an absolute frequency
  45. (`>=1`) or as a relative frequency (`0.0 .. 1.0`). (Remember that document
  46. frequencies are computed on a per shard level as explained in the blog post
  47. {defguide}/relevance-is-broken.html[Relevance is broken].)
  48. Perhaps the most interesting property of this query is that it adapts to
  49. domain specific stopwords automatically. For example, on a video hosting
  50. site, common terms like `"clip"` or `"video"` will automatically behave
  51. as stopwords without the need to maintain a manual list.
  52. [float]
  53. ==== Examples
  54. In this example, words that have a document frequency greater than 0.1%
  55. (eg `"this"` and `"is"`) will be treated as _common terms_.
  56. [source,js]
  57. --------------------------------------------------
  58. {
  59. "common": {
  60. "body": {
  61. "query": "this is bonsai cool",
  62. "cutoff_frequency": 0.001
  63. }
  64. }
  65. }
  66. --------------------------------------------------
  67. The number of terms which should match can be controlled with the
  68. <<query-dsl-minimum-should-match,`minimum_should_match`>>
  69. (`high_freq`, `low_freq`), `low_freq_operator` (default `"or"`) and
  70. `high_freq_operator` (default `"or"`) parameters.
  71. For low frequency terms, set the `low_freq_operator` to `"and"` to make
  72. all terms required:
  73. [source,js]
  74. --------------------------------------------------
  75. {
  76. "common": {
  77. "body": {
  78. "query": "nelly the elephant as a cartoon",
  79. "cutoff_frequency": 0.001,
  80. "low_freq_operator": "and"
  81. }
  82. }
  83. }
  84. --------------------------------------------------
  85. which is roughly equivalent to:
  86. [source,js]
  87. --------------------------------------------------
  88. {
  89. "bool": {
  90. "must": [
  91. { "term": { "body": "nelly"}},
  92. { "term": { "body": "elephant"}},
  93. { "term": { "body": "cartoon"}}
  94. ],
  95. "should": [
  96. { "term": { "body": "the"}}
  97. { "term": { "body": "as"}}
  98. { "term": { "body": "a"}}
  99. ]
  100. }
  101. }
  102. --------------------------------------------------
  103. Alternatively use
  104. <<query-dsl-minimum-should-match,`minimum_should_match`>>
  105. to specify a minimum number or percentage of low frequency terms which
  106. must be present, for instance:
  107. [source,js]
  108. --------------------------------------------------
  109. {
  110. "common": {
  111. "body": {
  112. "query": "nelly the elephant as a cartoon",
  113. "cutoff_frequency": 0.001,
  114. "minimum_should_match": 2
  115. }
  116. }
  117. }
  118. --------------------------------------------------
  119. which is roughly equivalent to:
  120. [source,js]
  121. --------------------------------------------------
  122. {
  123. "bool": {
  124. "must": {
  125. "bool": {
  126. "should": [
  127. { "term": { "body": "nelly"}},
  128. { "term": { "body": "elephant"}},
  129. { "term": { "body": "cartoon"}}
  130. ],
  131. "minimum_should_match": 2
  132. }
  133. },
  134. "should": [
  135. { "term": { "body": "the"}}
  136. { "term": { "body": "as"}}
  137. { "term": { "body": "a"}}
  138. ]
  139. }
  140. }
  141. --------------------------------------------------
  142. minimum_should_match
  143. A different
  144. <<query-dsl-minimum-should-match,`minimum_should_match`>>
  145. can be applied for low and high frequency terms with the additional
  146. `low_freq` and `high_freq` parameters. Here is an example when providing
  147. additional parameters (note the change in structure):
  148. [source,js]
  149. --------------------------------------------------
  150. {
  151. "common": {
  152. "body": {
  153. "query": "nelly the elephant not as a cartoon",
  154. "cutoff_frequency": 0.001,
  155. "minimum_should_match": {
  156. "low_freq" : 2,
  157. "high_freq" : 3
  158. }
  159. }
  160. }
  161. }
  162. --------------------------------------------------
  163. which is roughly equivalent to:
  164. [source,js]
  165. --------------------------------------------------
  166. {
  167. "bool": {
  168. "must": {
  169. "bool": {
  170. "should": [
  171. { "term": { "body": "nelly"}},
  172. { "term": { "body": "elephant"}},
  173. { "term": { "body": "cartoon"}}
  174. ],
  175. "minimum_should_match": 2
  176. }
  177. },
  178. "should": {
  179. "bool": {
  180. "should": [
  181. { "term": { "body": "the"}},
  182. { "term": { "body": "not"}},
  183. { "term": { "body": "as"}},
  184. { "term": { "body": "a"}}
  185. ],
  186. "minimum_should_match": 3
  187. }
  188. }
  189. }
  190. }
  191. --------------------------------------------------
  192. In this case it means the high frequency terms have only an impact on
  193. relevance when there are at least three of them. But the most
  194. interesting use of the
  195. <<query-dsl-minimum-should-match,`minimum_should_match`>>
  196. for high frequency terms is when there are only high frequency terms:
  197. [source,js]
  198. --------------------------------------------------
  199. {
  200. "common": {
  201. "body": {
  202. "query": "how not to be",
  203. "cutoff_frequency": 0.001,
  204. "minimum_should_match": {
  205. "low_freq" : 2,
  206. "high_freq" : 3
  207. }
  208. }
  209. }
  210. }
  211. --------------------------------------------------
  212. which is roughly equivalent to:
  213. [source,js]
  214. --------------------------------------------------
  215. {
  216. "bool": {
  217. "should": [
  218. { "term": { "body": "how"}},
  219. { "term": { "body": "not"}},
  220. { "term": { "body": "to"}},
  221. { "term": { "body": "be"}}
  222. ],
  223. "minimum_should_match": "3<50%"
  224. }
  225. }
  226. --------------------------------------------------
  227. The high frequency generated query is then slightly less restrictive
  228. than with an `AND`.
  229. The `common` terms query also supports `boost`, `analyzer` and
  230. `disable_coord` as parameters.