123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513 |
- [[search-rank-eval]]
- === Ranking evaluation API
- ++++
- <titleabbrev>Ranking evaluation</titleabbrev>
- ++++
- Allows you to evaluate the quality of ranked search results over a set of
- typical search queries.
- [[search-rank-eval-api-request]]
- ==== {api-request-title}
- `GET /<target>/_rank_eval`
- `POST /<target>/_rank_eval`
- [[search-rank-eval-api-desc]]
- ==== {api-description-title}
- The ranking evaluation API allows you to evaluate the quality of ranked search
- results over a set of typical search queries. Given this set of queries and a
- list of manually rated documents, the `_rank_eval` endpoint calculates and
- returns typical information retrieval metrics like _mean reciprocal rank_,
- _precision_ or _discounted cumulative gain_.
- Search quality evaluation starts with looking at the users of your search
- application, and the things that they are searching for. Users have a specific
- _information need_; for example, they are looking for gift in a web shop or want
- to book a flight for their next holiday. They usually enter some search terms
- into a search box or some other web form. All of this information, together with
- meta information about the user (for example the browser, location, earlier
- preferences and so on) then gets translated into a query to the underlying
- search system.
- The challenge for search engineers is to tweak this translation process from
- user entries to a concrete query, in such a way that the search results contain
- the most relevant information with respect to the user's information need. This
- can only be done if the search result quality is evaluated constantly across a
- representative test suite of typical user queries, so that improvements in the
- rankings for one particular query don't negatively affect the ranking for
- other types of queries.
- In order to get started with search quality evaluation, you need three basic
- things:
- . A collection of documents you want to evaluate your query performance against,
- usually one or more data streams or indices.
- . A collection of typical search requests that users enter into your system.
- . A set of document ratings that represent the documents' relevance with respect
- to a search request.
-
- It is important to note that one set of document ratings is needed per test
- query, and that the relevance judgements are based on the information need of
- the user that entered the query.
- The ranking evaluation API provides a convenient way to use this information in
- a ranking evaluation request to calculate different search evaluation metrics.
- This gives you a first estimation of your overall search quality, as well as a
- measurement to optimize against when fine-tuning various aspect of the query
- generation in your application.
- [[search-rank-eval-api-path-params]]
- ==== {api-path-parms-title}
- `<target>`::
- (Optional, string)
- Comma-separated list of data streams, indices, and index aliases used to limit
- the request. Wildcard (`*`) expressions are supported.
- +
- To target all data streams and indices in a cluster, omit this parameter or use
- `_all` or `*`.
- [[search-rank-eval-api-query-params]]
- ==== {api-query-parms-title}
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=allow-no-indices]
- +
- Defaults to `true`.
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=expand-wildcards]
- +
- --
- Defaults to `open`.
- --
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=index-ignore-unavailable]
- [[search-rank-eval-api-example]]
- ==== {api-examples-title}
- In its most basic form, a request to the `_rank_eval` endpoint has two sections:
- [source,js]
- -----------------------------
- GET /my-index-000001/_rank_eval
- {
- "requests": [ ... ], <1>
- "metric": { <2>
- "mean_reciprocal_rank": { ... } <3>
- }
- }
- -----------------------------
- // NOTCONSOLE
- <1> a set of typical search requests, together with their provided ratings
- <2> definition of the evaluation metric to calculate
- <3> a specific metric and its parameters
- The request section contains several search requests typical to your
- application, along with the document ratings for each particular search request.
- [source,js]
- -----------------------------
- GET /my-index-000001/_rank_eval
- {
- "requests": [
- {
- "id": "amsterdam_query", <1>
- "request": { <2>
- "query": { "match": { "text": "amsterdam" } }
- },
- "ratings": [ <3>
- { "_index": "my-index-000001", "_id": "doc1", "rating": 0 },
- { "_index": "my-index-000001", "_id": "doc2", "rating": 3 },
- { "_index": "my-index-000001", "_id": "doc3", "rating": 1 }
- ]
- },
- {
- "id": "berlin_query",
- "request": {
- "query": { "match": { "text": "berlin" } }
- },
- "ratings": [
- { "_index": "my-index-000001", "_id": "doc1", "rating": 1 }
- ]
- }
- ]
- }
- -----------------------------
- // NOTCONSOLE
- <1> The search request's ID, used to group result details later.
- <2> The query being evaluated.
- <3> A list of document ratings. Each entry contains the following arguments:
- - `_index`: The document's index. For data streams, this should be the
- document's backing index.
- - `_id`: The document ID.
- - `rating`: The document's relevance with regard to this search request.
- A document `rating` can be any integer value that expresses the relevance of the
- document on a user-defined scale. For some of the metrics, just giving a binary
- rating (for example `0` for irrelevant and `1` for relevant) will be sufficient,
- while other metrics can use a more fine-grained scale.
- ===== Template-based ranking evaluation
- As an alternative to having to provide a single query per test request, it is
- possible to specify query templates in the evaluation request and later refer to
- them. This way, queries with a similar structure that differ only in their
- parameters don't have to be repeated all the time in the `requests` section.
- In typical search systems, where user inputs usually get filled into a small
- set of query templates, this helps make the evaluation request more succinct.
- [source,js]
- --------------------------------
- GET /my-index-000001/_rank_eval
- {
- [...]
- "templates": [
- {
- "id": "match_one_field_query", <1>
- "template": { <2>
- "inline": {
- "query": {
- "match": { "{{field}}": { "query": "{{query_string}}" }}
- }
- }
- }
- }
- ],
- "requests": [
- {
- "id": "amsterdam_query",
- "ratings": [ ... ],
- "template_id": "match_one_field_query", <3>
- "params": { <4>
- "query_string": "amsterdam",
- "field": "text"
- }
- },
- [...]
- }
- --------------------------------
- // NOTCONSOLE
- <1> the template id
- <2> the template definition to use
- <3> a reference to a previously defined template
- <4> the parameters to use to fill the template
- ===== Available evaluation metrics
- The `metric` section determines which of the available evaluation metrics
- will be used. The following metrics are supported:
- [discrete]
- [[k-precision]]
- ===== Precision at K (P@k)
- This metric measures the proportion of relevant results in the top k search results.
- It's a form of the well-known
- {wikipedia}/Evaluation_measures_(information_retrieval)#Precision[Precision]
- metric that only looks at the top k documents. It is the fraction of relevant
- documents in those first k results. A precision at 10 (P@10) value of 0.6 then
- means 6 out of the 10 top hits are relevant with respect to the user's
- information need.
- P@k works well as a simple evaluation metric that has the benefit of being easy
- to understand and explain. Documents in the collection need to be rated as either
- relevant or irrelevant with respect to the current query. P@k is a set-based
- metric and does not take into account the position of the relevant documents
- within the top k results, so a ranking of ten results that contains one
- relevant result in position 10 is equally as good as a ranking of ten results
- that contains one relevant result in position 1.
- [source,console]
- --------------------------------
- GET /my-index-000001/_rank_eval
- {
- "requests": [
- {
- "id": "JFK query",
- "request": { "query": { "match_all": {} } },
- "ratings": []
- } ],
- "metric": {
- "precision": {
- "k": 20,
- "relevant_rating_threshold": 1,
- "ignore_unlabeled": false
- }
- }
- }
- --------------------------------
- // TEST[setup:my_index]
- The `precision` metric takes the following optional parameters
- [cols="<,<",options="header",]
- |=======================================================================
- |Parameter |Description
- |`k` |sets the maximum number of documents retrieved per query. This value will act in place of the usual `size` parameter
- in the query. Defaults to 10.
- |`relevant_rating_threshold` |sets the rating threshold above which documents are considered to be
- "relevant". Defaults to `1`.
- |`ignore_unlabeled` |controls how unlabeled documents in the search results are counted.
- If set to 'true', unlabeled documents are ignored and neither count as relevant or irrelevant. Set to 'false' (the default), they are treated as irrelevant.
- |=======================================================================
- [discrete]
- [[k-recall]]
- ===== Recall at K (R@k)
- This metric measures the total number of relevant results in the top k search
- results. It's a form of the well-known
- {wikipedia}/Evaluation_measures_(information_retrieval)#Recall[Recall]
- metric. It is the fraction of relevant documents in those first k results
- relative to all possible relevant results. A recall at 10 (R@10) value of 0.5 then
- means 4 out of 8 relevant documents, with respect to the user's information
- need, were retrieved in the 10 top hits.
- R@k works well as a simple evaluation metric that has the benefit of being easy
- to understand and explain. Documents in the collection need to be rated as either
- relevant or irrelevant with respect to the current query. R@k is a set-based
- metric and does not take into account the position of the relevant documents
- within the top k results, so a ranking of ten results that contains one
- relevant result in position 10 is equally as good as a ranking of ten results
- that contains one relevant result in position 1.
- [source,console]
- --------------------------------
- GET /my-index-000001/_rank_eval
- {
- "requests": [
- {
- "id": "JFK query",
- "request": { "query": { "match_all": {} } },
- "ratings": []
- } ],
- "metric": {
- "recall": {
- "k": 20,
- "relevant_rating_threshold": 1
- }
- }
- }
- --------------------------------
- // TEST[setup:my_index]
- The `recall` metric takes the following optional parameters
- [cols="<,<",options="header",]
- |=======================================================================
- |Parameter |Description
- |`k` |sets the maximum number of documents retrieved per query. This value will act in place of the usual `size` parameter
- in the query. Defaults to 10.
- |`relevant_rating_threshold` |sets the rating threshold above which documents are considered to be
- "relevant". Defaults to `1`.
- |=======================================================================
- [discrete]
- ===== Mean reciprocal rank
- For every query in the test suite, this metric calculates the reciprocal of the
- rank of the first relevant document. For example, finding the first relevant
- result in position 3 means the reciprocal rank is 1/3. The reciprocal rank for
- each query is averaged across all queries in the test suite to give the
- {wikipedia}/Mean_reciprocal_rank[mean reciprocal rank].
- [source,console]
- --------------------------------
- GET /my-index-000001/_rank_eval
- {
- "requests": [
- {
- "id": "JFK query",
- "request": { "query": { "match_all": {} } },
- "ratings": []
- } ],
- "metric": {
- "mean_reciprocal_rank": {
- "k": 20,
- "relevant_rating_threshold": 1
- }
- }
- }
- --------------------------------
- // TEST[setup:my_index]
- The `mean_reciprocal_rank` metric takes the following optional parameters
- [cols="<,<",options="header",]
- |=======================================================================
- |Parameter |Description
- |`k` |sets the maximum number of documents retrieved per query. This value will act in place of the usual `size` parameter
- in the query. Defaults to 10.
- |`relevant_rating_threshold` |Sets the rating threshold above which documents are considered to be
- "relevant". Defaults to `1`.
- |=======================================================================
- [discrete]
- ===== Discounted cumulative gain (DCG)
- In contrast to the two metrics above,
- {wikipedia}/Discounted_cumulative_gain[discounted cumulative gain]
- takes both the rank and the rating of the search results into account.
- The assumption is that highly relevant documents are more useful for the user
- when appearing at the top of the result list. Therefore, the DCG formula reduces
- the contribution that high ratings for documents on lower search ranks have on
- the overall DCG metric.
- [source,console]
- --------------------------------
- GET /my-index-000001/_rank_eval
- {
- "requests": [
- {
- "id": "JFK query",
- "request": { "query": { "match_all": {} } },
- "ratings": []
- } ],
- "metric": {
- "dcg": {
- "k": 20,
- "normalize": false
- }
- }
- }
- --------------------------------
- // TEST[setup:my_index]
- The `dcg` metric takes the following optional parameters:
- [cols="<,<",options="header",]
- |=======================================================================
- |Parameter |Description
- |`k` |sets the maximum number of documents retrieved per query. This value will act in place of the usual `size` parameter
- in the query. Defaults to 10.
- |`normalize` | If set to `true`, this metric will calculate the {wikipedia}/Discounted_cumulative_gain#Normalized_DCG[Normalized DCG].
- |=======================================================================
- [discrete]
- ===== Expected Reciprocal Rank (ERR)
- Expected Reciprocal Rank (ERR) is an extension of the classical reciprocal rank
- for the graded relevance case (Olivier Chapelle, Donald Metzler, Ya Zhang, and
- Pierre Grinspan. 2009.
- https://olivier.chapelle.cc/pub/err.pdf[Expected reciprocal rank for graded relevance].)
- It is based on the assumption of a cascade model of search, in which a user
- scans through ranked search results in order and stops at the first document
- that satisfies the information need. For this reason, it is a good metric for
- question answering and navigation queries, but less so for survey-oriented
- information needs where the user is interested in finding many relevant
- documents in the top k results.
- The metric models the expectation of the reciprocal of the position at which a
- user stops reading through the result list. This means that a relevant document
- in a top ranking position will have a large contribution to the overall score.
- However, the same document will contribute much less to the score if it appears
- in a lower rank; even more so if there are some relevant (but maybe less relevant)
- documents preceding it. In this way, the ERR metric discounts documents that
- are shown after very relevant documents. This introduces a notion of dependency
- in the ordering of relevant documents that e.g. Precision or DCG don't account
- for.
- [source,console]
- --------------------------------
- GET /my-index-000001/_rank_eval
- {
- "requests": [
- {
- "id": "JFK query",
- "request": { "query": { "match_all": {} } },
- "ratings": []
- } ],
- "metric": {
- "expected_reciprocal_rank": {
- "maximum_relevance": 3,
- "k": 20
- }
- }
- }
- --------------------------------
- // TEST[setup:my_index]
- The `expected_reciprocal_rank` metric takes the following parameters:
- [cols="<,<",options="header",]
- |=======================================================================
- |Parameter |Description
- | `maximum_relevance` | Mandatory parameter. The highest relevance grade used in the user-supplied
- relevance judgments.
- |`k` | sets the maximum number of documents retrieved per query. This value will act in place of the usual `size` parameter
- in the query. Defaults to 10.
- |=======================================================================
- ===== Response format
- The response of the `_rank_eval` endpoint contains the overall calculated result
- for the defined quality metric, a `details` section with a breakdown of results
- for each query in the test suite and an optional `failures` section that shows
- potential errors of individual queries. The response has the following format:
- [source,js]
- --------------------------------
- {
- "rank_eval": {
- "metric_score": 0.4, <1>
- "details": {
- "my_query_id1": { <2>
- "metric_score": 0.6, <3>
- "unrated_docs": [ <4>
- {
- "_index": "my-index-000001",
- "_id": "1960795"
- }, ...
- ],
- "hits": [
- {
- "hit": { <5>
- "_index": "my-index-000001",
- "_type": "page",
- "_id": "1528558",
- "_score": 7.0556192
- },
- "rating": 1
- }, ...
- ],
- "metric_details": { <6>
- "precision": {
- "relevant_docs_retrieved": 6,
- "docs_retrieved": 10
- }
- }
- },
- "my_query_id2": { [... ] }
- },
- "failures": { [... ] }
- }
- }
- --------------------------------
- // NOTCONSOLE
- <1> the overall evaluation quality calculated by the defined metric
- <2> the `details` section contains one entry for every query in the original `requests` section, keyed by the search request id
- <3> the `metric_score` in the `details` section shows the contribution of this query to the global quality metric score
- <4> the `unrated_docs` section contains an `_index` and `_id` entry for each document in the search result for this
- query that didn't have a ratings value. This can be used to ask the user to supply ratings for these documents
- <5> the `hits` section shows a grouping of the search results with their supplied ratings
- <6> the `metric_details` give additional information about the calculated quality metric (e.g. how many of the retrieved
- documents were relevant). The content varies for each metric but allows for better interpretation of the results
|