|
@@ -60,8 +60,8 @@ request do not have similar index statistics and relevancy could be bad.
|
|
|
|
|
|
If you have a small dataset, the easiest way to work around this issue is to
|
|
|
index everything into an index that has a single shard
|
|
|
-(`index.number_of_shards: 1`). Then index statistics will be the same for all
|
|
|
-documents and scores will be consistent.
|
|
|
+(`index.number_of_shards: 1`), which is the default. Then index statistics
|
|
|
+will be the same for all documents and scores will be consistent.
|
|
|
|
|
|
Otherwise the recommended way to work around this issue is to use the
|
|
|
<<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
|
|
@@ -78,3 +78,125 @@ queries, beware that gathering statistics alone might not be cheap since all
|
|
|
terms have to be looked up in the terms dictionaries in order to look up
|
|
|
statistics.
|
|
|
|
|
|
+[[static-scoring-signals]]
|
|
|
+=== Incorporating static relevance signals into the score
|
|
|
+
|
|
|
+Many domains have static signals that are known to be correlated with relevance.
|
|
|
+For instance https://en.wikipedia.org/wiki/PageRank[PageRank] and url length are
|
|
|
+two commonly used features for web search in order to tune the score of web
|
|
|
+pages independently of the query.
|
|
|
+
|
|
|
+There are two main queries that allow combining static score contributions with
|
|
|
+textual relevance, eg. as computed with BM25:
|
|
|
+ - <<query-dsl-script-score-query,`script_score` query>>
|
|
|
+ - <<query-dsl-rank-feature-query,`rank_feature` query>>
|
|
|
+
|
|
|
+For instance imagine that you have a `pagerank` field that you wish to
|
|
|
+combine with the BM25 score so that the final score is equal to
|
|
|
+`score = bm25_score + pagerank / (10 + pagerank)`.
|
|
|
+
|
|
|
+With the <<query-dsl-script-score-query,`script_score` query>> the query would
|
|
|
+look like this:
|
|
|
+
|
|
|
+//////////////////////////
|
|
|
+
|
|
|
+[source,js]
|
|
|
+--------------------------------------------------
|
|
|
+PUT index
|
|
|
+{
|
|
|
+ "mappings": {
|
|
|
+ "properties": {
|
|
|
+ "body": {
|
|
|
+ "type": "text"
|
|
|
+ },
|
|
|
+ "pagerank": {
|
|
|
+ "type": "long"
|
|
|
+ }
|
|
|
+ }
|
|
|
+ }
|
|
|
+}
|
|
|
+--------------------------------------------------
|
|
|
+// CONSOLE
|
|
|
+// TEST
|
|
|
+
|
|
|
+//////////////////////////
|
|
|
+
|
|
|
+[source,js]
|
|
|
+--------------------------------------------------
|
|
|
+GET index/_search
|
|
|
+{
|
|
|
+ "query" : {
|
|
|
+ "script_score" : {
|
|
|
+ "query" : {
|
|
|
+ "match": { "body": "elasticsearch" }
|
|
|
+ },
|
|
|
+ "script" : {
|
|
|
+ "source" : "_score * saturation(doc['pagerank'].value, 10)" <1>
|
|
|
+ }
|
|
|
+ }
|
|
|
+ }
|
|
|
+}
|
|
|
+--------------------------------------------------
|
|
|
+// CONSOLE
|
|
|
+//TEST[continued]
|
|
|
+<1> `pagerank` must be mapped as a <<number>>
|
|
|
+
|
|
|
+while with the <<query-dsl-rank-feature-query,`rank_feature` query>> it would
|
|
|
+look like below:
|
|
|
+
|
|
|
+//////////////////////////
|
|
|
+
|
|
|
+[source,js]
|
|
|
+--------------------------------------------------
|
|
|
+PUT index
|
|
|
+{
|
|
|
+ "mappings": {
|
|
|
+ "properties": {
|
|
|
+ "body": {
|
|
|
+ "type": "text"
|
|
|
+ },
|
|
|
+ "pagerank": {
|
|
|
+ "type": "rank_feature"
|
|
|
+ }
|
|
|
+ }
|
|
|
+ }
|
|
|
+}
|
|
|
+--------------------------------------------------
|
|
|
+// CONSOLE
|
|
|
+// TEST
|
|
|
+
|
|
|
+//////////////////////////
|
|
|
+
|
|
|
+[source,js]
|
|
|
+--------------------------------------------------
|
|
|
+GET _search
|
|
|
+{
|
|
|
+ "query" : {
|
|
|
+ "bool" : {
|
|
|
+ "must": {
|
|
|
+ "match": { "body": "elasticsearch" }
|
|
|
+ },
|
|
|
+ "should": {
|
|
|
+ "rank_feature": {
|
|
|
+ "field": "pagerank", <1>
|
|
|
+ "saturation": {
|
|
|
+ "pivot": 10
|
|
|
+ }
|
|
|
+ }
|
|
|
+ }
|
|
|
+ }
|
|
|
+ }
|
|
|
+}
|
|
|
+--------------------------------------------------
|
|
|
+// CONSOLE
|
|
|
+<1> `pagerank` must be mapped as a <<rank-feature,`rank_feature`>> field
|
|
|
+
|
|
|
+While both options would return similar scores, there are trade-offs:
|
|
|
+<<query-dsl-script-score-query,script_score>> provides a lot of flexibility,
|
|
|
+enabling you to combine the text relevance score with static signals as you
|
|
|
+prefer. On the other hand, the <<rank-feature,`rank_feature` query>> only
|
|
|
+exposes a couple ways to incorporate static signails into the score. However,
|
|
|
+it relies on the <<rank-feature,`rank_feature`>> and
|
|
|
+<<rank-features,`rank_features`>> fields, which index values in a special way
|
|
|
+that allows the <<query-dsl-rank-feature-query,`rank_feature` query>> to skip
|
|
|
+over non-competitive documents and get the top matches of a query faster.
|