Browse Source

Update `dense_vector` docs with kNN indexing options (#80306)

This commit updates the `dense_vector` docs to include information on the new
`index`, `similarity`, and `index_options` parameters. It also tries to clarify
the difference between `similarity` and `index_options` with the existing
parameters that have the same name.

Relates to #78473.
Julie Tibshirani 4 years ago
parent
commit
075d08eb64

+ 3 - 7
docs/reference/mapping/params/index-options.asciidoc

@@ -2,13 +2,9 @@
 === `index_options`
 
 The `index_options` parameter controls what information is added to the
-inverted index for search and highlighting purposes.
-
-[WARNING]
-====
-The `index_options` parameter is intended for use with <<text,`text`>> fields
-only. Avoid using `index_options` with other field data types.
-====
+inverted index for search and highlighting purposes. Only term-based field
+types like <<text,`text`>> and <<keyword,`keyword`>> support this
+configuration.
 
 The parameter accepts one of the following values. Each value retrieves
 information from the previous listed values. For example, `freqs` contains

+ 5 - 5
docs/reference/mapping/params/similarity.asciidoc

@@ -1,12 +1,12 @@
 [[similarity]]
 === `similarity`
 
-Elasticsearch allows you to configure a scoring algorithm or _similarity_ per
-field. The `similarity` setting provides a simple way of choosing a similarity
-algorithm other than the default `BM25`, such as `boolean`.
+{es} allows you to configure a text scoring algorithm or _similarity_
+per field. The `similarity` setting provides a simple way of choosing a
+text similarity algorithm other than the default `BM25`, such as `boolean`.
 
-Similarities are mostly useful for <<text,`text`>> fields, but can also apply
-to other field types.
+Only text-based field types like <<text,`text`>> and <<keyword,`keyword`>>
+support this configuration.
 
 Custom similarities can be configured by tuning the parameters of the built-in
 similarities. For more details about this expert options, see the

+ 129 - 8
docs/reference/mapping/types/dense-vector.asciidoc

@@ -6,14 +6,14 @@
 <titleabbrev>Dense vector</titleabbrev>
 ++++
 
-A `dense_vector` field stores dense vectors of float values.
-The maximum number of dimensions that can be in a vector should
-not exceed 2048. A `dense_vector` field is a single-valued field.
+The `dense_vector` field type stores dense vectors of float values.
 
-`dense_vector` fields do not support querying, sorting or aggregating. They can
-only be accessed in scripts through the dedicated <<vector-functions,vector functions>>.
+You can use `dense_vector` fields in
+<<query-dsl-script-score-query,`script_score`>> queries to score documents.
+They can also be indexed to support efficient k-nearest neighbor search. Dense
+vector fields do not support aggregations, sorting, or other query types.
 
-You index a dense vector as an array of floats.
+You add a `dense_vector` field as an array of floats:
 
 [source,console]
 --------------------------------------------------
@@ -23,7 +23,7 @@ PUT my-index-000001
     "properties": {
       "my_vector": {
         "type": "dense_vector",
-        "dims": 3  <1>
+        "dims": 3
       },
       "my_text" : {
         "type" : "keyword"
@@ -46,4 +46,125 @@ PUT my-index-000001/_doc/2
 
 --------------------------------------------------
 
-<1> dims – the number of dimensions in the vector, required parameter.
+NOTE: Unlike most other data types, dense vectors are always single-valued.
+It is not possible to store multiple values in one `dense_vector` field.
+
+[[index-vectors-knn-search]]
+==== Index vectors for kNN search
+
+experimental::[]
+
+A _k-nearest neighbor_ (kNN) search finds the _k_ nearest
+vectors to a query vector, as measured by a similarity metric.
+
+Dense vector fields can be used to rank documents in
+<<query-dsl-script-score-query,`script_score` queries>>. This lets you perform
+a brute-force kNN search by scanning all documents and ranking them by
+similarity.
+
+In many cases, a brute-force kNN search is not efficient enough. For this
+reason, the `dense_vector` type supports indexing vectors into a specialized
+data structure to support fast kNN search. You can enable indexing through the
+`index` parameter:
+
+[source,console]
+--------------------------------------------------
+PUT my-index-000002
+{
+  "mappings": {
+    "properties": {
+      "my_vector": {
+        "type": "dense_vector",
+        "dims": 3,
+        "index": true,
+        "similarity": "dot_product" <1>
+      }
+    }
+  }
+}
+--------------------------------------------------
+<1> When `index` is enabled, you must define the vector similarity to use in kNN search
+
+{es} uses the https://arxiv.org/abs/1603.09320[HNSW algorithm] to
+support efficient kNN search. Like most kNN algorithms, HNSW is an approximate
+method that sacrifices result accuracy for improved speed.
+
+NOTE: Indexing vectors for approximate kNN search is an expensive process. It can take
+substantial time to ingest documents that contain vector fields with `index`
+enabled.
+
+[role="child_attributes"]
+[[dense-vector-params]]
+==== Parameters for dense vector fields
+
+The following mapping parameters are accepted:
+
+`dims`::
+(Required, integer)
+Number of vector dimensions. Can't exceed `2048`.
+
+`index`::
+(Optional, Boolean)
+If `true`, you can search this field using the kNN search API. Defaults to
+`false`.
+
+`similarity`::
+(Required^*^, string)
+The vector similarity metric to use in kNN search. Documents are ranked by
+their vector field's similarity to the query vector. The `_score` of each
+document will be derived from the similarity, in a way that ensures scores are
+positive and that a larger score corresponds to a higher ranking.
++
+^*^ If `index` is `true`, this parameter is required.
++
+.Valid values for `similarity`
+[%collapsible%open]
+====
+`l2_norm`:::
+Computes similarity based on the L^2^ distance (also known as Euclidean
+distance) between the vectors. The document `_score` is computed as
+`1 / (1 + l2_norm(query, vector)^2)`.`
+
+`dot_product`:::
+Computes the dot product of two vectors. This option provides an optimized way
+to perform cosine similarity. In order to use it, all vectors must be of unit
+length, including both document and query vectors. The document `_score` is
+computed as `(1 + dot_product(query, vector)) / 2`.
+
+`cosine`:::
+Computes the cosine similarity. Note that the most efficient way to perform
+cosine similarity is to normalize all vectors to unit length, and instead use
+`dot_product`. You should only use `cosine` if you need to preserve the
+original vectors and cannot normalize them in advance. The document `_score`
+is computed as `(1 + cosine(query, vector)) / 2`.
+====
+
+NOTE: Although they are conceptually related, the `similarity` parameter is
+different from <<text,`text`>> field <<similarity,`similarity`>> and accepts
+a distinct set of options.
+
+`index_options`::
+(Optional, object)
+An optional section that configures the kNN indexing algorithm. The HNSW
+algorithm has two internal parameters that influence how the data structure is
+built. These can be adjusted to improve the accuracy of results, at the
+expense of slower indexing speed. When `index_options` is provided, all of its
+properties must be defined.
++
+.Properties of `index_options`
+[%collapsible%open]
+====
+`type`:::
+(Required, string)
+The type of kNN algorithm to use. Currently only `hnsw` is supported.
+
+`m`:::
+(Required, integer)
+The number of neighbors each node will be connected to in the HNSW graph.
+Defaults to `16`.
+
+`ef_construction`:::
+(Required, integer)
+The number of candidates to track while assembling the list of nearest
+neighbors for each new node. Defaults to `100`.
+====