Browse Source

Add docs for kNN search endpoint (#80378)

This commit adds docs for the new `_knn_search` endpoint.

It focuses on being an API reference and is light on details in terms of how
exactly the kNN search works, and how the endpoint contrasts with
`script_score` queries. We plan to add a high-level guide on kNN search that
will explain this in depth.

Relates to #78473.
Julie Tibshirani 3 years ago
parent
commit
8ca693b271

+ 1 - 12
docs/reference/eql/eql-search-api.asciidoc

@@ -206,18 +206,7 @@ returned.
 +
 A greater `fetch_size` value often increases search speed but uses more memory.
 
-`fields`::
-(Optional, array of strings and objects)
-Array of wildcard (`*`) patterns. The response returns values for field names
-matching these patterns in the `fields` property of each hit.
-+
-You can specify items in the array as a string or object.
-+
-.Properties of `fields` objects
-[%collapsible%open]
-====
-include::{es-repo-dir}/search/search.asciidoc[tag=fields-api-props]
-====
+include::{es-repo-dir}/search/search.asciidoc[tag=fields-param-def]
 
 `filter`::
 (Optional, <<query-dsl,Query DSL object>>)

+ 21 - 16
docs/reference/mapping/types/dense-vector.asciidoc

@@ -5,18 +5,21 @@
 <titleabbrev>Dense vector</titleabbrev>
 ++++
 
-The `dense_vector` field type stores dense vectors of float values.
+The `dense_vector` field type stores dense vectors of float values. Dense
+vector fields can be used in the following ways:
 
-You can use `dense_vector` fields in
-<<query-dsl-script-score-query,`script_score`>> queries to score documents.
-They can also be indexed to support efficient k-nearest neighbor search. Dense
-vector fields do not support aggregations, sorting, or other query types.
+* In <<query-dsl-script-score-query,`script_score`>> queries, to score
+documents matching a filter
+* In the <<knn-search, kNN search API>>, to find the _k_ most similar vectors
+to a query vector
+
+The `dense_vector` type does not support aggregations or sorting.
 
 You add a `dense_vector` field as an array of floats:
 
 [source,console]
 --------------------------------------------------
-PUT my-index-000001
+PUT my-index
 {
   "mappings": {
     "properties": {
@@ -31,13 +34,13 @@ PUT my-index-000001
   }
 }
 
-PUT my-index-000001/_doc/1
+PUT my-index/_doc/1
 {
   "my_text" : "text1",
   "my_vector" : [0.5, 10, 6]
 }
 
-PUT my-index-000001/_doc/2
+PUT my-index/_doc/2
 {
   "my_text" : "text2",
   "my_vector" : [-0.5, 10, 10]
@@ -63,12 +66,13 @@ similarity.
 
 In many cases, a brute-force kNN search is not efficient enough. For this
 reason, the `dense_vector` type supports indexing vectors into a specialized
-data structure to support fast kNN search. You can enable indexing through the
-`index` parameter:
+data structure to support fast kNN retrieval through the
+<<knn-search, kNN search API>>. You can enable indexing by setting the `index`
+parameter:
 
 [source,console]
 --------------------------------------------------
-PUT my-index-000002
+PUT my-index-2
 {
   "mappings": {
     "properties": {
@@ -84,9 +88,9 @@ PUT my-index-000002
 --------------------------------------------------
 <1> When `index` is enabled, you must define the vector similarity to use in kNN search
 
-{es} uses the https://arxiv.org/abs/1603.09320[HNSW algorithm] to
-support efficient kNN search. Like most kNN algorithms, HNSW is an approximate
-method that sacrifices result accuracy for improved speed.
+{es} uses the https://arxiv.org/abs/1603.09320[HNSW algorithm] to support
+efficient kNN search. Like most kNN algorithms, HNSW is an approximate method
+that sacrifices result accuracy for improved speed.
 
 NOTE: Indexing vectors for approximate kNN search is an expensive process. It can take
 substantial time to ingest documents that contain vector fields with `index`
@@ -107,9 +111,10 @@ Number of vector dimensions. Can't exceed `2048`.
 
 `index`::
 (Optional, Boolean)
-If `true`, you can search this field using the kNN search API. Defaults to
-`false`.
+If `true`, you can search this field using the <<knn-search, kNN search API>>.
+Defaults to `false`.
 
+[[dense-vector-similarity]]
 `similarity`::
 (Required^*^, string)
 The vector similarity metric to use in kNN search. Documents are ranked by

+ 5 - 2
docs/reference/search.asciidoc

@@ -15,10 +15,11 @@ exception of the <<search-explain,explain API>>.
 * <<search-multi-search>>
 * <<async-search>>
 * <<point-in-time-api>>
-* <<scroll-api>>
-* <<clear-scroll-api>>
+* <<knn-search>>
 * <<search-suggesters>>
 * <<search-terms-enum>>
+* <<scroll-api>>
+* <<clear-scroll-api>>
 
 [discrete]
 [[search-testing-apis]]
@@ -50,6 +51,8 @@ include::search/async-search.asciidoc[]
 
 include::search/point-in-time-api.asciidoc[]
 
+include::search/knn-search.asciidoc[]
+
 include::search/scroll-api.asciidoc[]
 
 include::search/clear-scroll-api.asciidoc[]

+ 141 - 0
docs/reference/search/knn-search.asciidoc

@@ -0,0 +1,141 @@
+[[knn-search]]
+=== kNN search API
+++++
+<titleabbrev>kNN search</titleabbrev>
+++++
+
+experimental::[]
+
+Performs a k-nearest neighbor (kNN) search and returns the matching documents.
+
+////
+[source,console]
+----
+PUT my-index
+{
+  "mappings": {
+    "properties": {
+      "image_vector": {
+        "type": "dense_vector",
+        "dims": 3,
+        "index": true,
+        "similarity": "l2_norm"
+      }
+    }
+  }
+}
+
+PUT my-index/_doc/1?refresh
+{
+  "image_vector" : [0.5, 10, 6]
+}
+----
+////
+
+[source,console]
+----
+GET my-index/_knn_search
+{
+  "knn": {
+    "field": "image_vector",
+    "query_vector": [0.3, 0.1, 1.2],
+    "k": 10,
+    "num_candidates": 100
+  },
+  "_source": ["name", "date"]
+}
+----
+// TEST[continued]
+
+[[knn-search-api-request]]
+==== {api-request-title}
+
+`GET <target>/_knn_search`
+
+`POST <target>/_knn_search`
+
+[[knn-search-api-prereqs]]
+==== {api-prereq-title}
+
+* If the {es} {security-features} are enabled, you must have the `read`
+<<privileges-list-indices,index privilege>> for the target data stream, index,
+or alias.
+
+[[knn-search-api-desc]]
+==== {api-description-title}
+
+The kNN search API performs a k-nearest neighbor (kNN) search on a
+<<dense-vector,`dense_vector`>> field. Given a query vector, it finds the _k_
+closest vectors and returns those documents as search hits.
+
+{es} uses the https://arxiv.org/abs/1603.09320[HNSW algorithm] to support
+efficient kNN search. Like most kNN algorithms, HNSW is an approximate method
+that sacrifices result accuracy for improved search speed. This means the
+results returned are not always the true _k_ closest neighbors.
+
+[[knn-search-api-path-params]]
+==== {api-path-parms-title}
+
+`<target>`::
+(Optional, string) Comma-separated list of data streams, indices, and aliases
+to search. Supports wildcards (`*`). To search all data streams and indices,
+use `*` or `_all`.
+
+WARNING: kNN search does not yet work with <<filter-alias,filtered aliases>>.
+Running a kNN search against a filtered alias may incorrectly result in fewer
+than _k_ hits.
+
+[role="child_attributes"]
+[[knn-search-api-query-params]]
+==== {api-query-parms-title}
+
+include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=routing]
+
+[role="child_attributes"]
+[[knn-search-api-request-body]]
+==== {api-request-body-title}
+
+`knn`::
+(Required, object) Defines the kNN query to run.
++
+.Properties of `knn` object
+[%collapsible%open]
+====
+`field`::
+(Required, string) The name of the vector field to search against. Must be a
+<<index-vectors-knn-search, `dense_vector` field with indexing enabled>>.
+
+`query_vector`::
+(Required, array of floats) Query vector. Must have the same number of
+dimensions as the vector field you are searching against.
+
+`k`::
+(Required, integer) Number of nearest neighbors to return as top hits. This
+value must be less than `num_candidates`.
+
+`num_candidates`::
+(Required, integer) The number of nearest neighbor candidates to consider per
+shard. Cannot exceed 10,000. {es} collects `num_candidates` results from each
+shard, then merges them to find the top `k` results. Increasing
+`num_candidates` tends to improve the accuracy of the final `k` results.
+====
+
+include::{es-repo-dir}/search/search.asciidoc[tag=docvalue-fields-def]
+include::{es-repo-dir}/search/search.asciidoc[tag=fields-param-def]
+include::{es-repo-dir}/search/search.asciidoc[tag=source-filtering-def]
+include::{es-repo-dir}/search/search.asciidoc[tag=stored-fields-def]
+
+[role="child_attributes"]
+[[knn-search-api-response-body]]
+==== {api-response-body-title}
+
+A kNN search response has the exact same structure as a
+<<search-api-response-body, search API response>>. However, certain sections
+have a meaning specific to kNN search:
+
+* The <<search-api-response-body-score,document `_score`>> is determined by
+the similarity between the query and document vector. See
+<<dense-vector-similarity, `similarity`>>.
+* The `hits.total` object contains the total number of nearest neighbor
+candidates considered, which is `num_candidates * num_shards`. The
+`hits.total.relation` will always be `eq`, indicating an exact value.

+ 1 - 1
docs/reference/search/search-vector-tile-api.asciidoc

@@ -290,7 +290,7 @@ You can specify fields in the array as a string or object.
 .Properties of `fields` objects
 [%collapsible%open]
 ====
-include::search.asciidoc[tag=fields-api-props]
+include::search.asciidoc[tag=fields-param-props]
 ====
 
 include::search-vector-tile-api.asciidoc[tag=grid-precision]

+ 38 - 17
docs/reference/search/search.asciidoc

@@ -92,7 +92,7 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=df]
 
 `docvalue_fields`::
 (Optional, string) A comma-separated list of fields to return as the docvalue
-representation of a field for each hit.
+representation of a field for each hit. See <<docvalue-fields>>.
 
 include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=expand-wildcards]
 +
@@ -241,11 +241,12 @@ By default, you cannot page through more than 10,000 hits using the `from` and
 `sort`::
 (Optional, string) A comma-separated list of <field>:<direction> pairs.
 
+[[search-source-param]]
 `_source`::
 (Optional)
 Indicates which <<mapping-source-field,source fields>> are returned for matching
 documents. These fields are returned in the `hits._source` property of
-the search response. Defaults to `true`.
+the search response. Defaults to `true`. See <<source-filtering>>.
 +
 .Valid values for `_source`
 [%collapsible%open]
@@ -275,7 +276,7 @@ purposes.
 `stored_fields`::
 (Optional, string) A comma-separated list of stored fields to return as part
 of a hit. If no fields are specified, no stored fields are included in the
-response.
+response. See <<stored-fields>>.
 +
 If this field is specified, the `_source` parameter defaults to `false`. You can
 pass `_source: true` to return both source fields and
@@ -344,13 +345,14 @@ If `true`, returns document version as part of a hit. Defaults to `false`.
 ==== {api-request-body-title}
 
 [[search-docvalue-fields-param]]
+// tag::docvalue-fields-def[]
 `docvalue_fields`::
 (Optional, array of strings and objects)
-Array of wildcard (`*`) patterns. The request returns doc values for field names
-matching these patterns in the `hits.fields` property of the response.
+Array of field patterns. The request returns values for field names matching
+these patterns in the `hits.fields` property of the response.
 +
-You can specify items in the array as a string or object.
-See <<docvalue-fields>>.
+You can specify items in the array as a string or object. See
+<<docvalue-fields>>.
 +
 .Properties of `docvalue_fields` objects
 [%collapsible%open]
@@ -371,19 +373,22 @@ pattern].
 +
 For other field data types, this parameter is not supported.
 ====
+// end::docvalue-fields-def[]
 
 [[search-api-fields]]
+// tag::fields-param-def[]
 `fields`::
 (Optional, array of strings and objects)
-Array of wildcard (`*`) patterns. The request returns values for field names
-matching these patterns in the `hits.fields` property of the response.
+Array of field patterns. The request returns values for field names matching
+these patterns in the `hits.fields` property of the response.
 +
-You can specify items in the array as a string or object.
+You can specify items in the array as a string or object. See
+<<search-fields-param>>.
 +
 .Properties of `fields` objects
 [%collapsible%open]
 ====
-// tag::fields-api-props[]
+// tag::fields-param-props[]
 `field`::
 (Required, string) Field to return. Supports wildcards (`*`).
 
@@ -425,8 +430,21 @@ returns the tile as a base64-encoded string.
 square with equal sides. Defaults to `4096`.
 ========
 --
-// end::fields-api-props[]
+// end::fields-param-props[]
 ====
+// end::fields-param-def[]
+
+[[search-stored-fields-param]]
+// tag::stored-fields-def[]
+`stored_fields`::
+(Optional, string) A comma-separated list of stored fields to return as part
+of a hit. If no fields are specified, no stored fields are included in the
+response. See <<stored-fields>>.
++
+If this option is specified, the `_source` parameter defaults to `false`. You
+can pass `_source: true` to return both source fields and stored fields in the
+search response.
+// end::stored-fields-def[]
 
 [[request-body-search-explain]]
 `explain`::
@@ -579,11 +597,12 @@ By default, you cannot page through more than 10,000 hits using the `from` and
 `size` parameters. To page through more hits, use the
 <<search-after,`search_after`>> parameter.
 
+// tag::source-filtering-def[]
 `_source`::
 (Optional)
 Indicates which <<mapping-source-field,source fields>> are returned for matching
 documents. These fields are returned in the `hits._source` property of
-the search response. Defaults to `true`.
+the search response. Defaults to `true`. See <<source-filtering>>.
 +
 .Valid values for `_source`
 [%collapsible%open]
@@ -623,6 +642,7 @@ If this property is specified, only these source fields are returned. You can
 exclude fields from this subset using the `excludes` property.
 =====
 ====
+// end::source-filtering-def[]
 
 [[stats-groups]]
 `stats`::
@@ -733,25 +753,25 @@ Contains returned documents and metadata.
 ====
 `total`::
 (object)
-Metadata about the number of returned documents.
+Metadata about the number of matching documents.
 +
 .Properties of `total`
 [%collapsible%open]
 =====
 `value`::
 (integer)
-Total number of returned documents.
+Total number of matching documents.
 
 `relation`::
 (string)
-Indicates whether the number of returned documents in the `value`
+Indicates whether the number of matching documents in the `value`
 parameter is accurate or a lower bound.
 +
 .Values of `relation`:
 [%collapsible%open]
 ======
 `eq`:: Accurate
-`gte`:: Lower bound, including returned documents
+`gte`:: Lower bound
 ======
 =====
 
@@ -799,6 +819,7 @@ or specify which source fields to return.
 Contains field values for the documents. These fields must be specified in the
 request using one or more of the following request parameters:
 
+* <<search-fields-param,`fields`>>
 * <<search-docvalue-fields-param,`docvalue_fields`>>
 * <<script-fields,`script_fields`>>
 * <<stored-fields,`stored_fields`>>