scoring.asciidoc 7.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199
  1. [[consistent-scoring]]
  2. === Getting consistent scoring
  3. The fact that Elasticsearch operates with shards and replicas adds challenges
  4. when it comes to having good scoring.
  5. [discrete]
  6. ==== Scores are not reproducible
  7. Say the same user runs the same request twice in a row and documents do not come
  8. back in the same order both times, this is a pretty bad experience isn't it?
  9. Unfortunately this is something that can happen if you have replicas
  10. (`index.number_of_replicas` is greater than 0). The reason is that Elasticsearch
  11. selects the shards that the query should go to in a round-robin fashion, so it
  12. is quite likely if you run the same query twice in a row that it will go to
  13. different copies of the same shard.
  14. Now why is it a problem? Index statistics are an important part of the score.
  15. And these index statistics may be different across copies of the same shard
  16. due to deleted documents. As you may know when documents are deleted or updated,
  17. the old document is not immediately removed from the index, it is just marked
  18. as deleted and it will only be removed from disk on the next time that the
  19. segment this old document belongs to is merged. However for practical reasons,
  20. those deleted documents are taken into account for index statistics. So imagine
  21. that the primary shard just finished a large merge that removed lots of deleted
  22. documents, then it might have index statistics that are sufficiently different
  23. from the replica (which still have plenty of deleted documents) so that scores
  24. are different too.
  25. The recommended way to work around this issue is to use a string that identifies
  26. the user that is logged in (a user id or session id for instance) as a
  27. <<search-preference,preference>>. This ensures that all queries of a
  28. given user are always going to hit the same shards, so scores remain more
  29. consistent across queries.
  30. This work around has another benefit: when two documents have the same score,
  31. they will be sorted by their internal Lucene doc id (which is unrelated to the
  32. `_id`) by default. However these doc ids could be different across copies of
  33. the same shard. So by always hitting the same shard, we would get more
  34. consistent ordering of documents that have the same scores.
  35. [discrete]
  36. ==== Relevancy looks wrong
  37. If you notice that two documents with the same content get different scores or
  38. that an exact match is not ranked first, then the issue might be related to
  39. sharding. By default, Elasticsearch makes each shard responsible for producing
  40. its own scores. However since index statistics are an important contributor to
  41. the scores, this only works well if shards have similar index statistics. The
  42. assumption is that since documents are routed evenly to shards by default, then
  43. index statistics should be very similar and scoring would work as expected.
  44. However in the event that you either:
  45. - use routing at index time,
  46. - query multiple _indices_,
  47. - or have too little data in your index
  48. then there are good chances that all shards that are involved in the search
  49. request do not have similar index statistics and relevancy could be bad.
  50. If you have a small dataset, the easiest way to work around this issue is to
  51. index everything into an index that has a single shard
  52. (`index.number_of_shards: 1`), which is the default. Then index statistics
  53. will be the same for all documents and scores will be consistent.
  54. Otherwise the recommended way to work around this issue is to use the
  55. <<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
  56. Elasticsearch perform an initial round trip to all involved shards, asking
  57. them for their index statistics relatively to the query, then the coordinating
  58. node will merge those statistics and send the merged statistics alongside the
  59. request when asking shards to perform the `query` phase, so that shards can
  60. use these global statistics rather than their own statistics in order to do the
  61. scoring.
  62. In most cases, this additional round trip should be very cheap. However in the
  63. event that your query contains a very large number of fields/terms or fuzzy
  64. queries, beware that gathering statistics alone might not be cheap since all
  65. terms have to be looked up in the terms dictionaries in order to look up
  66. statistics.
  67. [[static-scoring-signals]]
  68. === Incorporating static relevance signals into the score
  69. Many domains have static signals that are known to be correlated with relevance.
  70. For instance {wikipedia}/PageRank[PageRank] and url length are
  71. two commonly used features for web search in order to tune the score of web
  72. pages independently of the query.
  73. There are two main queries that allow combining static score contributions with
  74. textual relevance, eg. as computed with BM25:
  75. - <<query-dsl-script-score-query,`script_score` query>>
  76. - <<query-dsl-rank-feature-query,`rank_feature` query>>
  77. For instance imagine that you have a `pagerank` field that you wish to
  78. combine with the BM25 score so that the final score is equal to
  79. `score = bm25_score + pagerank / (10 + pagerank)`.
  80. With the <<query-dsl-script-score-query,`script_score` query>> the query would
  81. look like this:
  82. //////////////////////////
  83. [source,console]
  84. --------------------------------------------------
  85. PUT index
  86. {
  87. "mappings": {
  88. "properties": {
  89. "body": {
  90. "type": "text"
  91. },
  92. "pagerank": {
  93. "type": "long"
  94. }
  95. }
  96. }
  97. }
  98. --------------------------------------------------
  99. //////////////////////////
  100. [source,console]
  101. --------------------------------------------------
  102. GET index/_search
  103. {
  104. "query": {
  105. "script_score": {
  106. "query": {
  107. "match": { "body": "elasticsearch" }
  108. },
  109. "script": {
  110. "source": "_score * saturation(doc['pagerank'].value, 10)" <1>
  111. }
  112. }
  113. }
  114. }
  115. --------------------------------------------------
  116. //TEST[continued]
  117. <1> `pagerank` must be mapped as a <<number>>
  118. while with the <<query-dsl-rank-feature-query,`rank_feature` query>> it would
  119. look like below:
  120. //////////////////////////
  121. [source,console]
  122. --------------------------------------------------
  123. PUT index
  124. {
  125. "mappings": {
  126. "properties": {
  127. "body": {
  128. "type": "text"
  129. },
  130. "pagerank": {
  131. "type": "rank_feature"
  132. }
  133. }
  134. }
  135. }
  136. --------------------------------------------------
  137. // TEST
  138. //////////////////////////
  139. [source,console]
  140. --------------------------------------------------
  141. GET _search
  142. {
  143. "query": {
  144. "bool": {
  145. "must": {
  146. "match": { "body": "elasticsearch" }
  147. },
  148. "should": {
  149. "rank_feature": {
  150. "field": "pagerank", <1>
  151. "saturation": {
  152. "pivot": 10
  153. }
  154. }
  155. }
  156. }
  157. }
  158. }
  159. --------------------------------------------------
  160. <1> `pagerank` must be mapped as a <<rank-feature,`rank_feature`>> field
  161. While both options would return similar scores, there are trade-offs:
  162. <<query-dsl-script-score-query,script_score>> provides a lot of flexibility,
  163. enabling you to combine the text relevance score with static signals as you
  164. prefer. On the other hand, the <<rank-feature,`rank_feature` query>> only
  165. exposes a couple ways to incorporate static signals into the score. However,
  166. it relies on the <<rank-feature,`rank_feature`>> and
  167. <<rank-features,`rank_features`>> fields, which index values in a special way
  168. that allows the <<query-dsl-rank-feature-query,`rank_feature` query>> to skip
  169. over non-competitive documents and get the top matches of a query faster.