123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199 |
- [[consistent-scoring]]
- === Getting consistent scoring
- The fact that Elasticsearch operates with shards and replicas adds challenges
- when it comes to having good scoring.
- [discrete]
- ==== Scores are not reproducible
- Say the same user runs the same request twice in a row and documents do not come
- back in the same order both times, this is a pretty bad experience isn't it?
- Unfortunately this is something that can happen if you have replicas
- (`index.number_of_replicas` is greater than 0). The reason is that Elasticsearch
- selects the shards that the query should go to in a round-robin fashion, so it
- is quite likely if you run the same query twice in a row that it will go to
- different copies of the same shard.
- Now why is it a problem? Index statistics are an important part of the score.
- And these index statistics may be different across copies of the same shard
- due to deleted documents. As you may know when documents are deleted or updated,
- the old document is not immediately removed from the index, it is just marked
- as deleted and it will only be removed from disk on the next time that the
- segment this old document belongs to is merged. However for practical reasons,
- those deleted documents are taken into account for index statistics. So imagine
- that the primary shard just finished a large merge that removed lots of deleted
- documents, then it might have index statistics that are sufficiently different
- from the replica (which still have plenty of deleted documents) so that scores
- are different too.
- The recommended way to work around this issue is to use a string that identifies
- the user that is logged in (a user id or session id for instance) as a
- <<search-preference,preference>>. This ensures that all queries of a
- given user are always going to hit the same shards, so scores remain more
- consistent across queries.
- This work around has another benefit: when two documents have the same score,
- they will be sorted by their internal Lucene doc id (which is unrelated to the
- `_id`) by default. However these doc ids could be different across copies of
- the same shard. So by always hitting the same shard, we would get more
- consistent ordering of documents that have the same scores.
- [discrete]
- ==== Relevancy looks wrong
- If you notice that two documents with the same content get different scores or
- that an exact match is not ranked first, then the issue might be related to
- sharding. By default, Elasticsearch makes each shard responsible for producing
- its own scores. However since index statistics are an important contributor to
- the scores, this only works well if shards have similar index statistics. The
- assumption is that since documents are routed evenly to shards by default, then
- index statistics should be very similar and scoring would work as expected.
- However in the event that you either:
- - use routing at index time,
- - query multiple _indices_,
- - or have too little data in your index
- then there are good chances that all shards that are involved in the search
- request do not have similar index statistics and relevancy could be bad.
- If you have a small dataset, the easiest way to work around this issue is to
- index everything into an index that has a single shard
- (`index.number_of_shards: 1`), which is the default. Then index statistics
- will be the same for all documents and scores will be consistent.
- Otherwise the recommended way to work around this issue is to use the
- <<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
- Elasticsearch perform an initial round trip to all involved shards, asking
- them for their index statistics relatively to the query, then the coordinating
- node will merge those statistics and send the merged statistics alongside the
- request when asking shards to perform the `query` phase, so that shards can
- use these global statistics rather than their own statistics in order to do the
- scoring.
- In most cases, this additional round trip should be very cheap. However in the
- event that your query contains a very large number of fields/terms or fuzzy
- queries, beware that gathering statistics alone might not be cheap since all
- terms have to be looked up in the terms dictionaries in order to look up
- statistics.
- [[static-scoring-signals]]
- === Incorporating static relevance signals into the score
- Many domains have static signals that are known to be correlated with relevance.
- For instance {wikipedia}/PageRank[PageRank] and url length are
- two commonly used features for web search in order to tune the score of web
- pages independently of the query.
- There are two main queries that allow combining static score contributions with
- textual relevance, eg. as computed with BM25:
- - <<query-dsl-script-score-query,`script_score` query>>
- - <<query-dsl-rank-feature-query,`rank_feature` query>>
- For instance imagine that you have a `pagerank` field that you wish to
- combine with the BM25 score so that the final score is equal to
- `score = bm25_score + pagerank / (10 + pagerank)`.
- With the <<query-dsl-script-score-query,`script_score` query>> the query would
- look like this:
- //////////////////////////
- [source,console]
- --------------------------------------------------
- PUT index
- {
- "mappings": {
- "properties": {
- "body": {
- "type": "text"
- },
- "pagerank": {
- "type": "long"
- }
- }
- }
- }
- --------------------------------------------------
- //////////////////////////
- [source,console]
- --------------------------------------------------
- GET index/_search
- {
- "query": {
- "script_score": {
- "query": {
- "match": { "body": "elasticsearch" }
- },
- "script": {
- "source": "_score * saturation(doc['pagerank'].value, 10)" <1>
- }
- }
- }
- }
- --------------------------------------------------
- //TEST[continued]
- <1> `pagerank` must be mapped as a <<number>>
- while with the <<query-dsl-rank-feature-query,`rank_feature` query>> it would
- look like below:
- //////////////////////////
- [source,console]
- --------------------------------------------------
- PUT index
- {
- "mappings": {
- "properties": {
- "body": {
- "type": "text"
- },
- "pagerank": {
- "type": "rank_feature"
- }
- }
- }
- }
- --------------------------------------------------
- // TEST
- //////////////////////////
- [source,console]
- --------------------------------------------------
- GET _search
- {
- "query": {
- "bool": {
- "must": {
- "match": { "body": "elasticsearch" }
- },
- "should": {
- "rank_feature": {
- "field": "pagerank", <1>
- "saturation": {
- "pivot": 10
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- <1> `pagerank` must be mapped as a <<rank-feature,`rank_feature`>> field
- While both options would return similar scores, there are trade-offs:
- <<query-dsl-script-score-query,script_score>> provides a lot of flexibility,
- enabling you to combine the text relevance score with static signals as you
- prefer. On the other hand, the <<rank-feature,`rank_feature` query>> only
- exposes a couple ways to incorporate static signals into the score. However,
- it relies on the <<rank-feature,`rank_feature`>> and
- <<rank-features,`rank_features`>> fields, which index values in a special way
- that allows the <<query-dsl-rank-feature-query,`rank_feature` query>> to skip
- over non-competitive documents and get the top matches of a query faster.
|