123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265 |
- [[query-dsl-common-terms-query]]
- === Common Terms Query
- The `common` terms query is a modern alternative to stopwords which
- improves the precision and recall of search results (by taking stopwords
- into account), without sacrificing performance.
- [float]
- ==== The problem
- Every term in a query has a cost. A search for `"The brown fox"`
- requires three term queries, one for each of `"the"`, `"brown"` and
- `"fox"`, all of which are executed against all documents in the index.
- The query for `"the"` is likely to match many documents and thus has a
- much smaller impact on relevance than the other two terms.
- Previously, the solution to this problem was to ignore terms with high
- frequency. By treating `"the"` as a _stopword_, we reduce the index size
- and reduce the number of term queries that need to be executed.
- The problem with this approach is that, while stopwords have a small
- impact on relevance, they are still important. If we remove stopwords,
- we lose precision, (eg we are unable to distinguish between `"happy"`
- and `"not happy"`) and we lose recall (eg text like `"The The"` or
- `"To be or not to be"` would simply not exist in the index).
- [float]
- ==== The solution
- The `common` terms query divides the query terms into two groups: more
- important (ie _low frequency_ terms) and less important (ie _high
- frequency_ terms which would previously have been stopwords).
- First it searches for documents which match the more important terms.
- These are the terms which appear in fewer documents and have a greater
- impact on relevance.
- Then, it executes a second query for the less important terms -- terms
- which appear frequently and have a low impact on relevance. But instead
- of calculating the relevance score for *all* matching documents, it only
- calculates the `_score` for documents already matched by the first
- query. In this way the high frequency terms can improve the relevance
- calculation without paying the cost of poor performance.
- If a query consists only of high frequency terms, then a single query is
- executed as an `AND` (conjunction) query, in other words all terms are
- required. Even though each individual term will match many documents,
- the combination of terms narrows down the resultset to only the most
- relevant. The single query can also be executed as an `OR` with a
- specific
- <<query-dsl-minimum-should-match,`minimum_should_match`>>,
- in this case a high enough value should probably be used.
- Terms are allocated to the high or low frequency groups based on the
- `cutoff_frequency`, which can be specified as an absolute frequency
- (`>=1`) or as a relative frequency (`0.0 .. 1.0`). (Remember that document
- frequencies are computed on a per shard level as explained in the blog post
- {defguide}/relevance-is-broken.html[Relevance is broken].)
- Perhaps the most interesting property of this query is that it adapts to
- domain specific stopwords automatically. For example, on a video hosting
- site, common terms like `"clip"` or `"video"` will automatically behave
- as stopwords without the need to maintain a manual list.
- [float]
- ==== Examples
- In this example, words that have a document frequency greater than 0.1%
- (eg `"this"` and `"is"`) will be treated as _common terms_.
- [source,js]
- --------------------------------------------------
- {
- "common": {
- "body": {
- "query": "this is bonsai cool",
- "cutoff_frequency": 0.001
- }
- }
- }
- --------------------------------------------------
- The number of terms which should match can be controlled with the
- <<query-dsl-minimum-should-match,`minimum_should_match`>>
- (`high_freq`, `low_freq`), `low_freq_operator` (default `"or"`) and
- `high_freq_operator` (default `"or"`) parameters.
- For low frequency terms, set the `low_freq_operator` to `"and"` to make
- all terms required:
- [source,js]
- --------------------------------------------------
- {
- "common": {
- "body": {
- "query": "nelly the elephant as a cartoon",
- "cutoff_frequency": 0.001,
- "low_freq_operator" "and"
- }
- }
- }
- --------------------------------------------------
- which is roughly equivalent to:
- [source,js]
- --------------------------------------------------
- {
- "bool": {
- "must": [
- { "term": { "body": "nelly"}},
- { "term": { "body": "elephant"}},
- { "term": { "body": "cartoon"}}
- ],
- "should": [
- { "term": { "body": "the"}}
- { "term": { "body": "as"}}
- { "term": { "body": "a"}}
- ]
- }
- }
- --------------------------------------------------
- Alternatively use
- <<query-dsl-minimum-should-match,`minimum_should_match`>>
- to specify a minimum number or percentage of low frequency terms which
- must be present, for instance:
- [source,js]
- --------------------------------------------------
- {
- "common": {
- "body": {
- "query": "nelly the elephant as a cartoon",
- "cutoff_frequency": 0.001,
- "minimum_should_match": 2
- }
- }
- }
- --------------------------------------------------
- which is roughly equivalent to:
- [source,js]
- --------------------------------------------------
- {
- "bool": {
- "must": {
- "bool": {
- "should": [
- { "term": { "body": "nelly"}},
- { "term": { "body": "elephant"}},
- { "term": { "body": "cartoon"}}
- ],
- "minimum_should_match": 2
- }
- },
- "should": [
- { "term": { "body": "the"}}
- { "term": { "body": "as"}}
- { "term": { "body": "a"}}
- ]
- }
- }
- --------------------------------------------------
- minimum_should_match
- A different
- <<query-dsl-minimum-should-match,`minimum_should_match`>>
- can be applied for low and high frequency terms with the additional
- `low_freq` and `high_freq` parameters Here is an example when providing
- additional parameters (note the change in structure):
- [source,js]
- --------------------------------------------------
- {
- "common": {
- "body": {
- "query": "nelly the elephant not as a cartoon",
- "cutoff_frequency": 0.001,
- "minimum_should_match": {
- "low_freq" : 2,
- "high_freq" : 3
- }
- }
- }
- }
- --------------------------------------------------
- which is roughly equivalent to:
- [source,js]
- --------------------------------------------------
- {
- "bool": {
- "must": {
- "bool": {
- "should": [
- { "term": { "body": "nelly"}},
- { "term": { "body": "elephant"}},
- { "term": { "body": "cartoon"}}
- ],
- "minimum_should_match": 2
- }
- },
- "should": {
- "bool": {
- "should": [
- { "term": { "body": "the"}},
- { "term": { "body": "not"}},
- { "term": { "body": "as"}},
- { "term": { "body": "a"}}
- ],
- "minimum_should_match": 3
- }
- }
- }
- }
- --------------------------------------------------
- In this case it means the high frequency terms have only an impact on
- relevance when there are at least three of them. But the most
- interesting use of the
- <<query-dsl-minimum-should-match,`minimum_should_match`>>
- for high frequency terms is when there are only high frequency terms:
- [source,js]
- --------------------------------------------------
- {
- "common": {
- "body": {
- "query": "how not to be",
- "cutoff_frequency": 0.001,
- "minimum_should_match": {
- "low_freq" : 2,
- "high_freq" : 3
- }
- }
- }
- }
- --------------------------------------------------
- which is roughly equivalent to:
- [source,js]
- --------------------------------------------------
- {
- "bool": {
- "should": [
- { "term": { "body": "how"}},
- { "term": { "body": "not"}},
- { "term": { "body": "to"}},
- { "term": { "body": "be"}}
- ],
- "minimum_should_match": "3<50%"
- }
- }
- --------------------------------------------------
- The high frequency generated query is then slightly less restrictive
- than with an `AND`.
- The `common` terms query also supports `boost`, `analyzer` and
- `disable_coord` as parameters.
|