123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304 |
- [[recipes]]
- == Recipes
- [float]
- [[mixing-exact-search-with-stemming]]
- === Mixing exact search with stemming
- When building a search application, stemming is often a must as it is desirable
- for a query on `skiing` to match documents that contain `ski` or `skis`. But
- what if a user wants to search for `skiing` specifically? The typical way to do
- this would be to use a <<multi-fields,multi-field>> in order to have the same
- content indexed in two different ways:
- [source,js]
- --------------------------------------------------
- PUT index
- {
- "settings": {
- "analysis": {
- "analyzer": {
- "english_exact": {
- "tokenizer": "standard",
- "filter": [
- "lowercase"
- ]
- }
- }
- }
- },
- "mappings": {
- "type": {
- "properties": {
- "body": {
- "type": "text",
- "analyzer": "english",
- "fields": {
- "exact": {
- "type": "text",
- "analyzer": "english_exact"
- }
- }
- }
- }
- }
- }
- }
- PUT index/type/1
- {
- "body": "Ski resort"
- }
- PUT index/type/2
- {
- "body": "A pair of skis"
- }
- POST index/_refresh
- --------------------------------------------------
- // CONSOLE
- With such a setup, searching for `ski` on `body` would return both documents:
- [source,js]
- --------------------------------------------------
- GET index/_search
- {
- "query": {
- "simple_query_string": {
- "fields": [ "body" ],
- "query": "ski"
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[continued]
- [source,js]
- --------------------------------------------------
- {
- "took": 2,
- "timed_out": false,
- "_shards": {
- "total": 5,
- "successful": 5,
- "failed": 0
- },
- "hits": {
- "total": 2,
- "max_score": 0.25811607,
- "hits": [
- {
- "_index": "index",
- "_type": "type",
- "_id": "2",
- "_score": 0.25811607,
- "_source": {
- "body": "A pair of skis"
- }
- },
- {
- "_index": "index",
- "_type": "type",
- "_id": "1",
- "_score": 0.25811607,
- "_source": {
- "body": "Ski resort"
- }
- }
- ]
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"took": 2,/"took": "$body.took",/]
- On the other hand, searching for `ski` on `body.exact` would only return
- document `1` since the analysis chain of `body.exact` does not perform
- stemming.
- [source,js]
- --------------------------------------------------
- GET index/_search
- {
- "query": {
- "simple_query_string": {
- "fields": [ "body.exact" ],
- "query": "ski"
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[continued]
- [source,js]
- --------------------------------------------------
- {
- "took": 1,
- "timed_out": false,
- "_shards": {
- "total": 5,
- "successful": 5,
- "failed": 0
- },
- "hits": {
- "total": 1,
- "max_score": 0.25811607,
- "hits": [
- {
- "_index": "index",
- "_type": "type",
- "_id": "1",
- "_score": 0.25811607,
- "_source": {
- "body": "Ski resort"
- }
- }
- ]
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"took": 1,/"took": "$body.took",/]
- This is not something that is easy to expose to end users, as we would need to
- have a way to figure out whether they are looking for an exact match or not and
- redirect to the appropriate field accordingly. Also what to do if only parts of
- the query need to be matched exactly while other parts should still take
- stemming into account?
- Fortunately, the `query_string` and `simple_query_string` queries have a feature
- that allows to solve exactly this problem: `quote_field_suffix`. It allows to
- tell Elasticsearch that words that appear in between quotes should be redirected
- to a different field, see below:
- [source,js]
- --------------------------------------------------
- GET index/_search
- {
- "query": {
- "simple_query_string": {
- "fields": [ "body" ],
- "quote_field_suffix": ".exact",
- "query": "\"ski\""
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[continued]
- [source,js]
- --------------------------------------------------
- {
- "took": 2,
- "timed_out": false,
- "_shards": {
- "total": 5,
- "successful": 5,
- "failed": 0
- },
- "hits": {
- "total": 1,
- "max_score": 0.25811607,
- "hits": [
- {
- "_index": "index",
- "_type": "type",
- "_id": "1",
- "_score": 0.25811607,
- "_source": {
- "body": "Ski resort"
- }
- }
- ]
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"took": 2,/"took": "$body.took",/]
- In that case, since `ski` was in-between quotes, it was searched on the
- `body.exact` field due to the `quote_field_suffix` parameter, so only document
- `1` matched. This allows users to mix exact search with stemmed search as they
- like.
- [float]
- [[consistent-scoring]]
- === Getting consistent scoring
- The fact that Elasticsearch operates with shards and replicas adds challenges
- when it comes to having good scoring.
- [float]
- ==== Scores are not reproducible
- Say the same user runs the same request twice in a row and documents do not come
- back in the same order both times, this is a pretty bad experience isn't it?
- Unfortunately this is something that can happen if you have replicas
- (`index.number_of_replicas` is greater than 0). The reason is that Elasticsearch
- selects the shards that the query should go to in a round-robin fashion, so it
- is quite likely if you run the same query twice in a row that it will go to
- different copies of the same shard.
- Now why is it a problem? Index statistics are an important part of the score.
- And these index statistics may be different across copies of the same shard
- due to deleted documents. As you may know when documents are deleted or updated,
- the old document is not immediately removed from the index, it is just marked
- as deleted and it will only be removed from disk on the next time that the
- segment this old document belongs to is merged. However for practical reasons,
- those deleted documents are taken into account for index statistics. So imagine
- that the primary shard just finished a large merge that removed lots of deleted
- documents, then it might have index statistics that are sufficiently different
- from the replica (which still have plenty of deleted documents) so that scores
- are different too.
- The recommended way to work around this issue is to use a string that identifies
- the user that is logged is (a user id or session id for instance) as a
- <<search-request-preference,preference>>. This ensures that all queries of a
- given user are always going to hit the same shards, so scores remain more
- consistent across queries.
- This work around has another benefit: when two documents have the same score,
- they will be sorted by their internal Lucene doc id (which is unrelated to the
- `_id` or `_uid`) by default. However these doc ids could be different across
- copies of the same shard. So by always hitting the same shard, we would get
- more consistent ordering of documents that have the same scores.
- [float]
- ==== Relevancy looks wrong
- If you notice that two documents with the same content get different scores or
- that an exact match is not ranked first, then the issue might be related to
- sharding. By default, Elasticsearch makes each shard responsible for producing
- its own scores. However since index statistics are an important contributor to
- the scores, this only works well if shards have similar index statistics. The
- assumption is that since documents are routed evenly to shards by default, then
- index statistics should be very similar and scoring would work as expected.
- However in the event that you either
- - use routing at index time,
- - query multiple _indices_,
- - or have too little data in your index
- then there are good chances that all shards that are involved in the search
- request do not have similar index statistics and relevancy could be bad.
- If you have a small dataset, the easiest way to work around this issue is to
- index everything into an index that has a single shard
- (`index.number_of_shards: 1`). Then index statistics will be the same for all
- documents and scores will be consistent.
- Otherwise the recommended way to work around this issue is to use the
- <<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
- Elasticsearch perform an inital round trip to all involved shards, asking
- them for their index statistics relatively to the query, then the coordinating
- node will merge those statistics and send the merged statistics alongside the
- request when asking shards to perform the `query` phase, so that shards can
- use these global statistics rather than their own statistics in order to do the
- scoring.
- In most cases, this additional round trip should be very cheap. However in the
- event that your query contains a very large number of fields/terms or fuzzy
- queries, beware that gathering statistics alone might not be cheap since all
- terms have to be looked up in the terms dictionaries in order to look up
- statistics.
|