Browse Source

Add initial documentation for RRF (#95687)

This is a follow up to (#93396) that adds documentation for RRF including an example with a 
breakdown of the RRF formula.
Jack Conradson 2 years ago
parent
commit
24c600748a
2 changed files with 432 additions and 4 deletions
  1. 423 2
      docs/reference/search/rrf.asciidoc
  2. 9 2
      docs/reference/search/search.asciidoc

+ 423 - 2
docs/reference/search/rrf.asciidoc

@@ -1,6 +1,427 @@
 [[rrf]]
 === Reciprocal rank fusion
 
-Reciprocal Rank Fusion (RRF) is a simple method to combine document result sets
-from multiple queries where the queries' document scores may be unrelated.
+preview::[]
+
+https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf[Reciprocal rank fusion (RRF)]
+is a method for combining multiple result sets with different relevance
+indicators into a single result set. RRF requires no tuning, and the different
+relevance indicators do not have to be related to each other to achieve high-quality
+results.
+
+RRF uses the following formula to determine the score for ranking each document:
+
+[source,python]
+----
+score = 0.0
+for q in queries:
+    if d in result(q):
+        score += 1.0 / ( k + rank( result(q), d ) )
+return score
+
+# where
+# k is a ranking constant
+# q is a query in the set of queries
+# d is a document in the result set of q
+# result(q) is the result set of q
+# rank( result(q), d ) is d's rank within the result(q) starting from 1
+----
+// NOTCONSOLE
+
+[[rrf-api]]
+==== Reciprocal rank fusion API
+
+You can use RRF as part of a <<search-search, search>> to combine and rank
+documents using multiple result sets from
+
+* 1 query and 1 or more kNN searches
+* 2 or more kNN searches
+
+The `rrf` parameter is an optional object defined as part of a search request's
+<<request-body-rank, rank parameter>>. The `rrf` object contains the following
+parameters:
+
+`rank_constant`::
+(Optional, integer) This value determines how much influence documents in individual
+result sets per query have over the final ranked result set. A higher value indicates
+that lower ranked documents have more influence. This value must be greater than or
+equal to `1`. Defaults to `60`.
+
+`window_size`::
+(Optional, integer) This value determines the size of the individual result sets per
+query. A higher value will improve result relevance at the cost of performance. The final
+ranked result set is pruned down to the search request's <<search-size-param, size>.
+`window_size` must be greater than or equal to `size` and greater than or equal to `1`.
+Defaults to `100`.
+
+An example request using RRF:
+
+[source,console]
+----
+GET example-index/_search
+{
+    "query": {
+        "term": {
+            "text": "shoes"
+        }
+    },
+    "knn": {
+        "field": "vector",
+        "query_vector": [1.25, 2, 3.5],
+        "k": 50,
+        "num_candidates": 100
+    },
+    "rank": {
+        "rrf": {
+            "window_size": 50,
+            "rank_constant": 20
+        }
+    }
+}
+----
+// TEST[skip:example fragment]
+
+In the above example, we first execute the kNN search to get its global top 50 results.
+Then we execute the query to get its global top 50 results. Afterwards, on a coordinating
+node, we combine the knn search results with the query results and rank them based on the
+RRF method to get the final top 10 results.
+
+Note that if `k` from a knn search is larger than `window_size`, the results are
+truncated to `window_size`. If `k` is smaller than `window_size`, the results are
+`k` size.
+
+[[rrf-supported-features]]
+==== Reciprocal rank fusion supported features
+
+RRF does support:
+
+* <<search-from-param, from>>
+* <<search-aggregations, aggregations>>
+
+RRF does not currently support:
+
+* <<search-api-scroll-query-param, scroll>>
+* <<search-api-pit, point in time>>
+* <<search-sort-param, sort>>
+* <<rescore, rescore>>
+* <<search-suggesters, suggesters>>
+* <<highlighting, highlighting>>
+* <<collapse-search-results, collapse>>
+* <<request-body-search-explain, explain>>
+* <<profiling-queries, profiling>>
+
+Using unsupported features as part of a search using RRF will result
+in an exception.
+
+[[rrf-full-example]]
+==== Reciprocal rank fusion full example
+
+We begin by creating a mapping for an index with a text field, a vector field,
+and an integer field along with indexing several documents. For this example we
+are going to use a vector with only a single dimension to make the ranking easier
+to explain.
+
+[source,console]
+----
+PUT example-index
+{
+  "mappings": {
+        "properties": {
+            "text" : {
+                "type" : "text"
+            },
+            "vector": {
+                "type": "dense_vector",
+                "dims": 1,
+                "index": true,
+                "similarity": "l2_norm"
+            },
+            "integer" : {
+                "type" : "integer"
+            }
+        }
+    }
+}
+
+PUT example-index/_doc/1
+{
+    "text" : "rrf",
+    "vector" : [5],
+    "integer": 1
+}
+
+PUT example-index/_doc/2
+{
+    "text" : "rrf rrf",
+    "vector" : [4],
+    "integer": 2
+}
+
+PUT example-index/_doc/3
+{
+    "text" : "rrf rrf rrf",
+    "vector" : [3],
+    "integer": 1
+}
+
+PUT example-index/_doc/4
+{
+    "text" : "rrf rrf rrf rrf",
+    "integer": 2
+}
+
+PUT example-index/_doc/5
+{
+    "vector" : [0],
+    "integer": 1
+}
+
+POST example-index/_refresh
+----
+// TEST
+
+We now execute a search using RRF with a query, a kNN search, and
+a terms aggregation.
+
+[source,console]
+----
+GET example-index/_search
+{
+    "query": {
+        "term": {
+            "text": "rrf"
+        }
+    },
+    "knn": {
+        "field": "vector",
+        "query_vector": [3],
+        "k": 5,
+        "num_candidates": 5
+    },
+    "rank": {
+        "rrf": {
+            "window_size": 5,
+            "rank_constant": 1
+        }
+    },
+    "size": 3,
+    "aggs": {
+        "int_count": {
+            "terms": {
+                "field": "integer"
+            }
+        }
+    }
+}
+----
+// TEST[continued]
+
+And we receive the response with ranked `hits` and the terms
+aggregation result. Note that `_score` is `null`, and we instead
+use `_rank` to show our top-ranked documents.
+
+[source,console-response]
+----
+{
+    "took": ...,
+    "timed_out" : false,
+    "_shards" : {
+        "total" : 1,
+        "successful" : 1,
+        "skipped" : 0,
+        "failed" : 0
+    },
+    "hits" : {
+        "total" : {
+            "value" : 5,
+            "relation" : "eq"
+        },
+        "max_score" : null,
+        "hits" : [
+            {
+                "_index" : "example-index",
+                "_id" : "3",
+                "_score" : null,
+                "_rank" : 1,
+                "_source" : {
+                    "integer" : 1,
+                    "vector" : [
+                        3
+                    ],
+                    "text" : "rrf rrf rrf"
+                }
+            },
+            {
+                "_index" : "example-index",
+                "_id" : "2",
+                "_score" : null,
+                "_rank" : 2,
+                "_source" : {
+                    "integer" : 2,
+                    "vector" : [
+                        4
+                    ],
+                    "text" : "rrf rrf"
+                }
+            },
+            {
+                "_index" : "example-index",
+                "_id" : "4",
+                "_score" : null,
+                "_rank" : 3,
+                "_source" : {
+                    "integer" : 2,
+                    "text" : "rrf rrf rrf rrf"
+                }
+            }
+        ]
+    },
+    "aggregations" : {
+        "int_count" : {
+            "doc_count_error_upper_bound" : 0,
+            "sum_other_doc_count" : 0,
+            "buckets" : [
+                {
+                    "key" : 1,
+                    "doc_count" : 3
+                },
+                {
+                    "key" : 2,
+                    "doc_count" : 2
+                }
+            ]
+        }
+    }
+}
+----
+// TESTRESPONSE[s/: \.\.\./: $body.$_path/]
+
+Let's break down how these hits were ranked. We
+start by running the query and the kNN search
+separately to collect what their individual hits are.
+
+First, we look at the hits for the query.
+
+[source,console-result]
+----
+"hits" : [
+    {
+        "_index" : "example-index",
+        "_id" : "4",
+        "_score" : 0.16152832,              <1>
+        "_source" : {
+            "integer" : 2,
+            "text" : "rrf rrf rrf rrf"
+        }
+    },
+    {
+        "_index" : "example-index",
+        "_id" : "3",                        <2>
+        "_score" : 0.15876243,
+        "_source" : {
+            "integer" : 1,
+            "vector" : [3],
+            "text" : "rrf rrf rrf"
+        }
+    },
+    {
+        "_index" : "example-index",
+        "_id" : "2",                        <3>
+        "_score" : 0.15350538,
+        "_source" : {
+            "integer" : 2,
+            "vector" : [4],
+            "text" : "rrf rrf"
+        }
+    },
+    {
+        "_index" : "example-index",
+        "_id" : "1",                        <4>
+        "_score" : 0.13963442,
+        "_source" : {
+            "integer" : 1,
+            "vector" : [5],
+            "text" : "rrf"
+        }
+    }
+]
+----
+// TEST[skip:example fragment]
+<1> rank 1, `_id` 4
+<2> rank 2, `_id` 3
+<3> rank 3, `_id` 2
+<4> rank 4, `_id` 1
+
+Note that our first hit doesn't have a value for the `vector` field. Now,
+we look at the results for the kNN search.
+
+[source,console-result]
+----
+"hits" : [
+    {
+        "_index" : "example-index",
+        "_id" : "3",                   <1>
+        "_score" : 1.0,
+        "_source" : {
+            "integer" : 1,
+            "vector" : [3],
+            "text" : "rrf rrf rrf"
+        }
+    },
+    {
+        "_index" : "example-index",
+        "_id" : "2",                   <2>
+        "_score" : 0.5,
+        "_source" : {
+            "integer" : 2,
+            "vector" : [4],
+            "text" : "rrf rrf"
+        }
+    },
+    {
+        "_index" : "example-index",
+        "_id" : "1",                   <3>
+        "_score" : 0.2,
+        "_source" : {
+            "integer" : 1,
+            "vector" : [5],
+            "text" : "rrf"
+        }
+    },
+    {
+        "_index" : "example-index",
+        "_id" : "5",                   <4>
+        "_score" : 0.1,
+        "_source" : {
+            "integer" : 1,
+            "vector" : [0]
+        }
+    }
+]
+----
+// TEST[skip:example fragment]
+<1> rank 1, `_id` 3
+<2> rank 2, `_id` 2
+<3> rank 3, `_id` 1
+<4> rank 4, `_id` 5
+
+We can now take the two individually ranked result sets and apply the
+RRF formula to them to get our final ranking.
+
+[source,python]
+----
+# doc  | query     | knn       | score
+_id: 1 = 1.0/(1+4) + 1.0/(1+3) = 0.4500
+_id: 2 = 1.0/(1+3) + 1.0/(1+2) = 0.5833
+_id: 3 = 1.0/(1+2) + 1.0/(1+1) = 0.8333
+_id: 4 = 1.0/(1+1)             = 0.5000
+_id: 5 =             1.0/(1+4) = 0.2000
+----
+// NOTCONSOLE
+
+We rank the documents based on the RRF formula with a `window_size` of `5`
+truncating the bottom `2` docs in our RRF result set with a `size` of `3`.
+We end with `_id: 3` as `_rank: 1`, `_id: 2` as `_rank: 2`, and
+`_id: 4` as `_rank: 3`. This ranking matches the result set from the
+original RRF search as expected.
 

+ 9 - 2
docs/reference/search/search.asciidoc

@@ -102,6 +102,7 @@ Defaults to `open`.
 (Optional, Boolean) If `true`, returns detailed information about score
 computation as part of a hit. Defaults to `false`.
 
+[[search-from-param]]
 include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=from]
 +
 By default, you cannot page through more than 10,000 hits using the `from` and
@@ -231,6 +232,7 @@ searches.
 (Optional, Boolean) If `true`, returns sequence number and primary term of the
 last modification of each hit. See <<optimistic-concurrency-control>>.
 
+[[search-size-param]]
 `size`::
 (Optional, integer) Defines the number of hits to return. Defaults to `10`.
 +
@@ -238,6 +240,7 @@ By default, you cannot page through more than 10,000 hits using the `from` and
 `size` parameters. To page through more hits, use the
 <<search-after,`search_after`>> parameter.
 
+[[search-sort-param]]
 `sort`::
 (Optional, string) A comma-separated list of <field>:<direction> pairs.
 
@@ -527,6 +530,7 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=knn-similarity]
 Minimum <<relevance-scores,`_score`>> for matching documents. Documents with a
 lower `_score` are not included in the search results.
 
+[[search-api-pit]]
 `pit`::
 (Optional, object)
 Limits the search to a <<point-in-time-api,point in time (PIT)>>. If you provide
@@ -552,8 +556,11 @@ Period of time used to extend the life of the PIT.
 
 [[request-body-rank]]
 `rank`::
-(Optional, object) Defines a method for combining and ranking `1` standard query
-with `1..n` knn searches or no standard query with `2..n` knn searches.
+(Optional, object) Defines a method for combining and ranking result sets from
+either:
++
+* 1 query and 1 or more kNN searches
+* 2 or more kNN searches
 +
 .Ranking methods
 [%collapsible%open]