瀏覽代碼

[DOCS] Adds docs to semantic text (#108311)

Co-authored-by: Carlos Delgado <6339205+carlosdelest@users.noreply.github.com>
Co-authored-by: Mike Pellegrini <mike.pellegrini@elastic.co>
Co-authored-by: Kathleen DeRusso <kathleen.derusso@elastic.co>
István Zoltán Szabó 1 年之前
父節點
當前提交
95ce898436

+ 1 - 1
docs/reference/mapping/types.asciidoc

@@ -75,7 +75,7 @@ markup. Used for identifying named entities.
 <<completion-suggester,`completion`>>:: Used for auto-complete suggestions.
 <<search-as-you-type,`search_as_you_type`>>:: `text`-like type for
 as-you-type completion.
-<<semantic-text, `semantic_text`>>:: 
+<<semantic-text, `semantic_text`>>:: Used for performing <<semantic-search,semantic search>>.
 <<token-count,`token_count`>>:: A count of tokens in a text.
 
 

+ 132 - 1
docs/reference/mapping/types/semantic-text.asciidoc

@@ -5,4 +5,135 @@
 <titleabbrev>Semantic text</titleabbrev>
 ++++
 
-The documentation page for the `semantic_text` field type.
+beta[]
+
+The `semantic_text` field type automatically generates embeddings for text
+content using an inference endpoint. 
+
+The `semantic_text` field type specifies an inference endpoint identifier that will be used to generate embeddings.
+You can create the inference endpoint by using the <<put-inference-api>>.
+This field type and the <<query-dsl-semantic-query,`semantic` query>> type make it simpler to perform semantic search on your data.
+
+Using `semantic_text`, you won't need to specify how to generate embeddings for
+your data, or how to index it. The inference endpoint automatically determines
+the embedding generation, indexing, and query to use.
+
+[source,console]
+------------------------------------------------------------
+PUT my-index-000001
+{
+  "mappings": {
+    "properties": {
+      "inference_field": { 
+        "type": "semantic_text",
+        "inference_id": "my-elser-endpoint"
+      }
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+
+[discrete]
+[[semantic-text-params]]
+==== Parameters for `semantic_text` fields
+
+`inference_id`::
+(Required, string)  
+Inference endpoint that will be used to generate the embeddings for the field.
+Use the <<put-inference-api>> to create the endpoint.
+
+
+[discrete]
+[[infer-endpoint-validation]]
+==== {infer-cap} endpoint validation
+
+The `inference_id` will not be validated when the mapping is created, but when documents are ingested into the index.
+When the first document is indexed, the `inference_id` will be used to generate underlying indexing structures for the field.
+
+
+[discrete]
+[[auto-text-chunking]]
+==== Automatic text chunking
+
+{infer-cap} endpoints have a limit on the amount of text they can process.
+To allow for large amounts of text to be used in semantic search, `semantic_text` automatically generates smaller passages if needed, called _chunks_.
+
+Each chunk will include the text subpassage and the corresponding embedding generated from it.
+When querying, the individual passages will be automatically searched for each document, and the most relevant passage will be used to compute a score.
+
+
+[discrete]
+[[semantic-text-structure]]
+==== `semantic_text` structure
+
+Once a document is ingested, a `semantic_text` field will have the following structure:
+
+[source,console-result]
+------------------------------------------------------------
+"inference_field": {
+  "text": "these are not the droids you're looking for", <1>
+  "inference": {
+    "inference_id": "my-elser-endpoint", <2>
+    "model_settings": { <3>
+      "task_type": "sparse_embedding"
+    },
+    "chunks": [ <4>
+      {
+        "text": "these are not the droids you're looking for",
+        "embeddings": {
+          (...)
+        }
+      }
+    ]
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+<1> The field will become an object structure to accommodate both the original
+text and the inference results.
+<2> The `inference_id` used to generate the embeddings.
+<3> Model settings, including the task type and dimensions/similarity if
+applicable.
+<4> Inference results will be grouped in chunks, each with its corresponding
+text and embeddings.
+
+Refer to <<semantic-search-semantic-text,this tutorial>> to learn more about
+semantic search using `semantic_text` and the `semantic` query.
+
+
+[discrete]
+[[custom-indexing]]
+==== Customizing `semantic_text` indexing
+
+`semantic_text` uses defaults for indexing data based on the {infer} endpoint
+specified. It enables you to quickstart your semantic search by providing
+automatic {infer} and a dedicated query so you don't need to provide further
+details.
+
+In case you want to customize data indexing, use the
+<<sparse-vector,`sparse_vector`>> or <<dense-vector,`dense_vector`>> field
+types and create an ingest pipeline with an
+<<inference-processor, {infer} processor>> to generate the embeddings.
+<<semantic-search-inference,This tutorial>> walks you through the process.
+
+[discrete]
+[[update-script]]
+==== Updates to `semantic_text` fields
+
+Updates that use scripts are not supported when the index contains a `semantic_text` field.
+
+
+[discrete]
+[[copy-to-support]]
+==== `copy_to` support
+
+The `semantic_text` field type can be the target of
+<<copy-to,`copy_to` fields>>. This means you can use a single `semantic_text`
+field to collect the values of other fields for semantic search. Each value has
+its embeddings calculated separately; each field value is a separate set of chunk(s) in
+the resulting embeddings.
+
+This imposes a restriction on bulk updates to documents with `semantic_text`. 
+In bulk requests, all fields that are copied to a `semantic_text` field must have a value to ensure every embedding is calculated correctly.

+ 187 - 0
docs/reference/query-dsl/semantic-query.asciidoc

@@ -0,0 +1,187 @@
+[[query-dsl-semantic-query]]
+=== Semantic query
+++++
+<titleabbrev>Semantic</titleabbrev>
+++++
+
+beta[]
+
+The `semantic` query type enables you to perform <<semantic-search,semantic search>> on data stored in a <<semantic-text,`semantic_text`>> field.
+
+
+[discrete]
+[[semantic-query-example]]
+==== Example request
+
+[source,console]
+------------------------------------------------------------
+GET my-index-000001/_search
+{
+  "query": {
+    "semantic": {
+      "field": "inference_field",
+      "query": "Best surfing places"
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+
+[discrete]
+[[semantic-query-params]]
+==== Top-level parameters for `semantic`
+
+field::
+(Required, string)
+The `semantic_text` field to perform the query on.
+
+query::
+(Required, string)
+The query text to be searched for on the field.
+
+
+Refer to <<semantic-search-semantic-text,this tutorial>> to learn more about semantic search using `semantic_text` and `semantic` query.
+
+[discrete]
+[[hybrid-search-semantic]]
+==== Hybrid search with the `semantic` query
+
+The `semantic` query can be used as a part of a hybrid search where the `semantic` query is combined with lexical queries.
+For example, the query below finds documents with the `title` field matching "mountain lake", and combines them with results from a semantic search on the field `title_semantic`, that is a `semantic_text` field.
+The combined documents are then scored, and the top 3 top scored documents are returned.
+
+[source,console]
+------------------------------------------------------------
+POST my-index/_search
+{
+  "size" : 3,
+  "query": {
+    "bool": {
+      "should": [
+        {
+          "match": {
+            "title": {
+              "query": "mountain lake",
+              "boost": 1
+            }
+          }
+        },
+        {
+          "semantic": {
+            "field": "title_semantic",
+            "query": "mountain lake",
+            "boost": 2
+          }
+        }
+      ]
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+You can also use semantic_text as part of <<rrf,Reciprocal Rank Fusion>> to make ranking relevant results easier:
+
+[source,console]
+------------------------------------------------------------
+GET my-index/_search
+{
+  "retriever": {
+    "rrf": {
+      "retrievers": [
+        {
+          "standard": {
+            "query": {
+              "term": {
+                "text": "shoes"
+              }
+            }
+          }
+        },
+        {
+          "semantic": {
+            "field": "semantic_field",
+            "query": "shoes"
+          }
+        }
+      ],
+      "rank_window_size": 50,
+      "rank_constant": 20
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+
+[discrete]
+[[advanced-search]]
+=== Advanced search on `semantic_text` fields
+
+The `semantic` query uses default settings for searching on `semantic_text` fields for ease of use.
+If you want to fine-tune a search on a `semantic_text` field, you need to know the task type used by the `inference_id` configured in `semantic_text`.
+You can find the task type using the <<get-inference-api>>, and check the `task_type` associated with the {infer} service.
+Depending on the `task_type`, use either the <<query-dsl-sparse-vector-query,`sparse_vector`>> or the <<query-dsl-knn-query,`knn`>> query for greater flexibility and customization.
+
+
+[discrete]
+[[search-sparse-inference]]
+==== Search with `sparse_embedding` inference
+
+When the {infer} endpoint uses a `sparse_embedding` model, you can use a <<query-dsl-sparse-vector-query,`sparse_vector` query>> on a <<semantic-text,`semantic_text`>> field in the following way:
+
+[source,console]
+------------------------------------------------------------
+GET test-index/_search
+{
+  "query": {
+    "nested": {
+      "path": "inference_field.inference.chunks",
+      "query": {
+        "sparse_vector": {
+          "field": "inference_field.inference.chunks.embeddings",
+          "inference_id": "my-inference-id",
+          "query": "mountain lake"
+        }
+      }
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+You can customize the `sparse_vector` query to include specific settings, like <<sparse-vector-query-with-pruning-config-and-rescore-example,pruning configuration>>.
+
+
+[discrete]
+[[search-text-inferece]]
+==== Search with `text_embedding` inference
+
+When the {infer} endpoint uses a `text_embedding` model, you can use a <<query-dsl-knn-query,`knn` query>> on a `semantic_text` field in the following way:
+
+[source,console]
+------------------------------------------------------------
+GET test-index/_search
+{
+  "query": {
+    "nested": {
+      "path": "inference_field.inference.chunks",
+      "query": {
+        "knn": {
+          "field": "inference_field.inference.chunks.embeddings",
+          "query_vector_builder": {
+            "text_embedding": {
+              "model_id": "my_inference_id",
+	      "model_text": "mountain lake"
+            }
+          }
+        }
+      }
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+You can customize the `knn` query to include specific settings, like `num_candidates` and `k`.

+ 5 - 0
docs/reference/query-dsl/special-queries.asciidoc

@@ -32,6 +32,9 @@ This query allows a script to act as a filter. Also see the
 <<query-dsl-script-score-query,`script_score` query>>::
 A query that allows to modify the score of a sub-query with a script.
 
+<<query-dsl-semantic-query,`semantic` query>>::
+A query that allows you to perform semantic search.
+
 <<query-dsl-wrapper-query,`wrapper` query>>::
 A query that accepts other queries as json or yaml string.
 
@@ -55,6 +58,8 @@ include::script-query.asciidoc[]
 
 include::script-score-query.asciidoc[]
 
+include::semantic-query.asciidoc[]
+
 include::wrapper-query.asciidoc[]
 
 include::pinned-query.asciidoc[]

+ 267 - 0
docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc

@@ -0,0 +1,267 @@
+[[semantic-search-semantic-text]]
+=== Tutorial: semantic search with `semantic_text`
+++++
+<titleabbrev>Semantic search with `semantic_text`</titleabbrev>
+++++
+
+beta[]
+
+This tutorial shows you how to use the semantic text feature to perform semantic search on your data.
+
+Semantic text simplifies the {infer} workflow by providing {infer} at ingestion time and sensible default values automatically.
+You don't need to define model related settings and parameters, or create {infer} ingest pipelines.
+
+The recommended way to use <<semantic-search,semantic search>> in the {stack} is following the `semantic_text` workflow.
+When you need more control over indexing and query settings, you can still use the complete {infer} workflow (refer to  <<semantic-search-inference,this tutorial>> to review the process).
+
+This tutorial uses the <<inference-example-elser,`elser` service>> for demonstration, but you can use any service and their supported models offered by the {infer-cap} API.
+
+
+[discrete]
+[[semantic-text-requirements]]
+==== Requirements
+
+To use the `semantic_text` field type, you must have an {infer} endpoint deployed in
+your cluster using the <<put-inference-api>>.
+
+
+[discrete]
+[[semantic-text-infer-endpoint]]
+==== Create the {infer} endpoint
+
+Create an inference endpoint by using the <<put-inference-api>>:
+
+[source,console]
+------------------------------------------------------------
+PUT _inference/sparse_embedding/my-elser-endpoint <1>
+{
+  "service": "elser", <2>
+  "service_settings": {
+    "num_allocations": 1,
+    "num_threads": 1
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+<1> The task type is `sparse_embedding` in the path as the `elser` service will
+be used and ELSER creates sparse vectors. The `inference_id` is
+`my-elser-endpoint`.
+<2> The `elser` service is used in this example.
+
+
+[discrete]
+[[semantic-text-index-mapping]]
+==== Create the index mapping
+
+The mapping of the destination index - the index that contains the embeddings
+that the inference endpoint will generate based on your input text - must be created. The
+destination index must have a field with the <<semantic-text,`semantic_text`>>
+field type to index the output of the used inference endpoint.
+
+[source,console]
+------------------------------------------------------------
+PUT semantic-embeddings
+{
+  "mappings": {
+    "properties": {
+      "semantic_text": { <1>
+        "type": "semantic_text", <2>
+        "inference_id": "my-elser-endpoint" <3>
+      },
+      "content": { <4>
+        "type": "text",
+        "copy_to": "semantic_text" <5>
+      }
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+<1> The name of the field to contain the generated embeddings.
+<2> The field to contain the embeddings is a `semantic_text` field.
+<3> The `inference_id` is the inference endpoint you created in the previous step.
+It will be used to generate the embeddings based on the input text.
+Every time you ingest data into the related `semantic_text` field, this endpoint will be used for creating the vector representation of the text.
+<4> The field to store the text reindexed from a source index in the  <<semantic-text-reindex-data,Reindex the data>> step.
+<5> The textual data stored in the `content` field will be copied to `semantic_text` and processed by the {infer} endpoint.
+The `semantic_text` field will store the embeddings generated based on the input data.
+
+
+[discrete]
+[[semantic-text-load-data]]
+==== Load data
+
+In this step, you load the data that you later use to create embeddings from it.
+
+Use the `msmarco-passagetest2019-top1000` data set, which is a subset of the MS
+MARCO Passage Ranking data set. It consists of 200 queries, each accompanied by
+a list of relevant text passages. All unique passages, along with their IDs,
+have been extracted from that data set and compiled into a
+https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
+
+Download the file and upload it to your cluster using the
+{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
+in the {ml-app} UI. Assign the name `id` to the first column and `content` to
+the second column. The index name is `test-data`. Once the upload is complete,
+you can see an index named `test-data` with 182469 documents.
+
+
+[discrete]
+[[semantic-text-reindex-data]]
+==== Reindex the data
+
+Create the embeddings from the text by reindexing the data from the `test-data`
+index to the `semantic-embeddings` index. The data in the `content` field will
+be reindexed into the `content` field of the destination index. 
+The `content` field data will be copied to the `semantic_text` field as a result of the `copy_to`
+parameter set in the index mapping creation step. The copied data will be
+processed by the {infer} endpoint associated with the `semantic_text` semantic text
+field.
+
+[source,console]
+------------------------------------------------------------
+POST _reindex?wait_for_completion=false
+{
+  "source": { 
+    "index": "test-data",
+    "size": 10 <1>
+  },
+  "dest": {
+    "index": "semantic-embeddings"
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+<1> The default batch size for reindexing is 1000. Reducing size to a smaller
+number makes the update of the reindexing process quicker which enables you to
+follow the progress closely and detect errors early.
+
+The call returns a task ID to monitor the progress:
+
+[source,console]
+------------------------------------------------------------
+GET _tasks/<task_id>
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+It is recommended to cancel the reindexing process if you don't want to wait
+until it is fully complete which might take a long time for an inference endpoint with few assigned resources:
+
+[source,console]
+------------------------------------------------------------
+POST _tasks/<task_id>/_cancel
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+
+[discrete]
+[[semantic-text-semantic-search]]
+==== Semantic search
+
+After the data set has been enriched with the embeddings, you can query the data
+using semantic search. Provide the `semantic_text` field name and the query text
+in a `semantic` query type. The {infer} endpoint used to generate the embeddings
+for the `semantic_text` field will be used to process the query text.
+
+[source,console]
+------------------------------------------------------------
+GET semantic-embeddings/_search
+{
+  "query": {
+    "semantic": { 
+      "field": "semantic_text", <1>
+      "query": "How to avoid muscle soreness while running?" <2>
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+<1> The `semantic_text` field on which you want to perform the search.
+<2> The query text.
+
+As a result, you receive the top 10 documents that are closest in meaning to the
+query from the `semantic-embedding` index:
+
+[source,console-result]
+------------------------------------------------------------
+"hits": [
+  {
+    "_index": "semantic-embeddings",
+    "_id": "6DdEuo8B0vYIvzmhoEtt",
+    "_score": 24.972616,
+    "_source": {
+      "semantic_text": {
+        "inference": {
+          "inference_id": "my-elser-endpoint",
+          "model_settings": {
+            "task_type": "sparse_embedding"
+          },
+          "chunks": [
+            {
+              "text": "There are a few foods and food groups that will help to fight inflammation and delayed onset muscle soreness (both things that are inevitable after a long, hard workout) when you incorporate them into your postworkout eats, whether immediately after your run or at a meal later in the day. Advertisement. Advertisement.",
+              "embeddings": {
+                (...)
+              }
+            }
+          ]
+        }
+      },
+      "id": 1713868,
+      "content": "There are a few foods and food groups that will help to fight inflammation and delayed onset muscle soreness (both things that are inevitable after a long, hard workout) when you incorporate them into your postworkout eats, whether immediately after your run or at a meal later in the day. Advertisement. Advertisement."
+    }
+  },
+  {
+    "_index": "semantic-embeddings",
+    "_id": "-zdEuo8B0vYIvzmhplLX",
+    "_score": 22.143118,
+    "_source": {
+      "semantic_text": {
+        "inference": {
+          "inference_id": "my-elser-endpoint",
+          "model_settings": {
+            "task_type": "sparse_embedding"
+          },
+          "chunks": [
+            {
+              "text": "During Your Workout. There are a few things you can do during your workout to help prevent muscle injury and soreness. According to personal trainer and writer for Iron Magazine, Marc David, doing warm-ups and cool-downs between sets can help keep muscle soreness to a minimum.",
+              "embeddings": {
+                (...)
+              }
+            }
+          ]
+        }
+      },
+      "id": 3389244,
+      "content": "During Your Workout. There are a few things you can do during your workout to help prevent muscle injury and soreness. According to personal trainer and writer for Iron Magazine, Marc David, doing warm-ups and cool-downs between sets can help keep muscle soreness to a minimum."
+    }
+  },
+  {
+    "_index": "semantic-embeddings",
+    "_id": "77JEuo8BdmhTuQdXtQWt",
+    "_score": 21.506052,
+    "_source": {
+      "semantic_text": {
+        "inference": {
+          "inference_id": "my-elser-endpoint",
+          "model_settings": {
+            "task_type": "sparse_embedding"
+          },
+          "chunks": [
+            {
+              "text": "This is especially important if the soreness is due to a weightlifting routine. For this time period, do not exert more than around 50% of the level of effort (weight, distance and speed) that caused the muscle groups to be sore.",
+              "embeddings": {
+                (...)
+              }
+            }
+          ]
+        }
+      },
+      "id": 363742,
+      "content": "This is especially important if the soreness is due to a weightlifting routine. For this time period, do not exert more than around 50% of the level of effort (weight, distance and speed) that caused the muscle groups to be sore."
+    }
+  },
+  (...)
+]
+------------------------------------------------------------
+// NOTCONSOLE
+

+ 1 - 0
docs/reference/search/search-your-data/semantic-search.asciidoc

@@ -135,5 +135,6 @@ include::{es-ref-dir}/tab-widgets/semantic-search/hybrid-search-widget.asciidoc[
 ** The https://github.com/elastic/elasticsearch-labs[`elasticsearch-labs`] repo contains a number of interactive semantic search examples in the form of executable Python notebooks, using the {es} Python client
 
 include::semantic-search-elser.asciidoc[]
+include::semantic-search-semantic-text.asciidoc[]
 include::semantic-search-inference.asciidoc[]
 include::cohere-es.asciidoc[]