|
@@ -0,0 +1,267 @@
|
|
|
+[[semantic-search-semantic-text]]
|
|
|
+=== Tutorial: semantic search with `semantic_text`
|
|
|
+++++
|
|
|
+<titleabbrev>Semantic search with `semantic_text`</titleabbrev>
|
|
|
+++++
|
|
|
+
|
|
|
+beta[]
|
|
|
+
|
|
|
+This tutorial shows you how to use the semantic text feature to perform semantic search on your data.
|
|
|
+
|
|
|
+Semantic text simplifies the {infer} workflow by providing {infer} at ingestion time and sensible default values automatically.
|
|
|
+You don't need to define model related settings and parameters, or create {infer} ingest pipelines.
|
|
|
+
|
|
|
+The recommended way to use <<semantic-search,semantic search>> in the {stack} is following the `semantic_text` workflow.
|
|
|
+When you need more control over indexing and query settings, you can still use the complete {infer} workflow (refer to <<semantic-search-inference,this tutorial>> to review the process).
|
|
|
+
|
|
|
+This tutorial uses the <<inference-example-elser,`elser` service>> for demonstration, but you can use any service and their supported models offered by the {infer-cap} API.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[semantic-text-requirements]]
|
|
|
+==== Requirements
|
|
|
+
|
|
|
+To use the `semantic_text` field type, you must have an {infer} endpoint deployed in
|
|
|
+your cluster using the <<put-inference-api>>.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[semantic-text-infer-endpoint]]
|
|
|
+==== Create the {infer} endpoint
|
|
|
+
|
|
|
+Create an inference endpoint by using the <<put-inference-api>>:
|
|
|
+
|
|
|
+[source,console]
|
|
|
+------------------------------------------------------------
|
|
|
+PUT _inference/sparse_embedding/my-elser-endpoint <1>
|
|
|
+{
|
|
|
+ "service": "elser", <2>
|
|
|
+ "service_settings": {
|
|
|
+ "num_allocations": 1,
|
|
|
+ "num_threads": 1
|
|
|
+ }
|
|
|
+}
|
|
|
+------------------------------------------------------------
|
|
|
+// TEST[skip:TBD]
|
|
|
+<1> The task type is `sparse_embedding` in the path as the `elser` service will
|
|
|
+be used and ELSER creates sparse vectors. The `inference_id` is
|
|
|
+`my-elser-endpoint`.
|
|
|
+<2> The `elser` service is used in this example.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[semantic-text-index-mapping]]
|
|
|
+==== Create the index mapping
|
|
|
+
|
|
|
+The mapping of the destination index - the index that contains the embeddings
|
|
|
+that the inference endpoint will generate based on your input text - must be created. The
|
|
|
+destination index must have a field with the <<semantic-text,`semantic_text`>>
|
|
|
+field type to index the output of the used inference endpoint.
|
|
|
+
|
|
|
+[source,console]
|
|
|
+------------------------------------------------------------
|
|
|
+PUT semantic-embeddings
|
|
|
+{
|
|
|
+ "mappings": {
|
|
|
+ "properties": {
|
|
|
+ "semantic_text": { <1>
|
|
|
+ "type": "semantic_text", <2>
|
|
|
+ "inference_id": "my-elser-endpoint" <3>
|
|
|
+ },
|
|
|
+ "content": { <4>
|
|
|
+ "type": "text",
|
|
|
+ "copy_to": "semantic_text" <5>
|
|
|
+ }
|
|
|
+ }
|
|
|
+ }
|
|
|
+}
|
|
|
+------------------------------------------------------------
|
|
|
+// TEST[skip:TBD]
|
|
|
+<1> The name of the field to contain the generated embeddings.
|
|
|
+<2> The field to contain the embeddings is a `semantic_text` field.
|
|
|
+<3> The `inference_id` is the inference endpoint you created in the previous step.
|
|
|
+It will be used to generate the embeddings based on the input text.
|
|
|
+Every time you ingest data into the related `semantic_text` field, this endpoint will be used for creating the vector representation of the text.
|
|
|
+<4> The field to store the text reindexed from a source index in the <<semantic-text-reindex-data,Reindex the data>> step.
|
|
|
+<5> The textual data stored in the `content` field will be copied to `semantic_text` and processed by the {infer} endpoint.
|
|
|
+The `semantic_text` field will store the embeddings generated based on the input data.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[semantic-text-load-data]]
|
|
|
+==== Load data
|
|
|
+
|
|
|
+In this step, you load the data that you later use to create embeddings from it.
|
|
|
+
|
|
|
+Use the `msmarco-passagetest2019-top1000` data set, which is a subset of the MS
|
|
|
+MARCO Passage Ranking data set. It consists of 200 queries, each accompanied by
|
|
|
+a list of relevant text passages. All unique passages, along with their IDs,
|
|
|
+have been extracted from that data set and compiled into a
|
|
|
+https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
|
|
|
+
|
|
|
+Download the file and upload it to your cluster using the
|
|
|
+{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
|
|
|
+in the {ml-app} UI. Assign the name `id` to the first column and `content` to
|
|
|
+the second column. The index name is `test-data`. Once the upload is complete,
|
|
|
+you can see an index named `test-data` with 182469 documents.
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[semantic-text-reindex-data]]
|
|
|
+==== Reindex the data
|
|
|
+
|
|
|
+Create the embeddings from the text by reindexing the data from the `test-data`
|
|
|
+index to the `semantic-embeddings` index. The data in the `content` field will
|
|
|
+be reindexed into the `content` field of the destination index.
|
|
|
+The `content` field data will be copied to the `semantic_text` field as a result of the `copy_to`
|
|
|
+parameter set in the index mapping creation step. The copied data will be
|
|
|
+processed by the {infer} endpoint associated with the `semantic_text` semantic text
|
|
|
+field.
|
|
|
+
|
|
|
+[source,console]
|
|
|
+------------------------------------------------------------
|
|
|
+POST _reindex?wait_for_completion=false
|
|
|
+{
|
|
|
+ "source": {
|
|
|
+ "index": "test-data",
|
|
|
+ "size": 10 <1>
|
|
|
+ },
|
|
|
+ "dest": {
|
|
|
+ "index": "semantic-embeddings"
|
|
|
+ }
|
|
|
+}
|
|
|
+------------------------------------------------------------
|
|
|
+// TEST[skip:TBD]
|
|
|
+<1> The default batch size for reindexing is 1000. Reducing size to a smaller
|
|
|
+number makes the update of the reindexing process quicker which enables you to
|
|
|
+follow the progress closely and detect errors early.
|
|
|
+
|
|
|
+The call returns a task ID to monitor the progress:
|
|
|
+
|
|
|
+[source,console]
|
|
|
+------------------------------------------------------------
|
|
|
+GET _tasks/<task_id>
|
|
|
+------------------------------------------------------------
|
|
|
+// TEST[skip:TBD]
|
|
|
+
|
|
|
+It is recommended to cancel the reindexing process if you don't want to wait
|
|
|
+until it is fully complete which might take a long time for an inference endpoint with few assigned resources:
|
|
|
+
|
|
|
+[source,console]
|
|
|
+------------------------------------------------------------
|
|
|
+POST _tasks/<task_id>/_cancel
|
|
|
+------------------------------------------------------------
|
|
|
+// TEST[skip:TBD]
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[semantic-text-semantic-search]]
|
|
|
+==== Semantic search
|
|
|
+
|
|
|
+After the data set has been enriched with the embeddings, you can query the data
|
|
|
+using semantic search. Provide the `semantic_text` field name and the query text
|
|
|
+in a `semantic` query type. The {infer} endpoint used to generate the embeddings
|
|
|
+for the `semantic_text` field will be used to process the query text.
|
|
|
+
|
|
|
+[source,console]
|
|
|
+------------------------------------------------------------
|
|
|
+GET semantic-embeddings/_search
|
|
|
+{
|
|
|
+ "query": {
|
|
|
+ "semantic": {
|
|
|
+ "field": "semantic_text", <1>
|
|
|
+ "query": "How to avoid muscle soreness while running?" <2>
|
|
|
+ }
|
|
|
+ }
|
|
|
+}
|
|
|
+------------------------------------------------------------
|
|
|
+// TEST[skip:TBD]
|
|
|
+<1> The `semantic_text` field on which you want to perform the search.
|
|
|
+<2> The query text.
|
|
|
+
|
|
|
+As a result, you receive the top 10 documents that are closest in meaning to the
|
|
|
+query from the `semantic-embedding` index:
|
|
|
+
|
|
|
+[source,console-result]
|
|
|
+------------------------------------------------------------
|
|
|
+"hits": [
|
|
|
+ {
|
|
|
+ "_index": "semantic-embeddings",
|
|
|
+ "_id": "6DdEuo8B0vYIvzmhoEtt",
|
|
|
+ "_score": 24.972616,
|
|
|
+ "_source": {
|
|
|
+ "semantic_text": {
|
|
|
+ "inference": {
|
|
|
+ "inference_id": "my-elser-endpoint",
|
|
|
+ "model_settings": {
|
|
|
+ "task_type": "sparse_embedding"
|
|
|
+ },
|
|
|
+ "chunks": [
|
|
|
+ {
|
|
|
+ "text": "There are a few foods and food groups that will help to fight inflammation and delayed onset muscle soreness (both things that are inevitable after a long, hard workout) when you incorporate them into your postworkout eats, whether immediately after your run or at a meal later in the day. Advertisement. Advertisement.",
|
|
|
+ "embeddings": {
|
|
|
+ (...)
|
|
|
+ }
|
|
|
+ }
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "id": 1713868,
|
|
|
+ "content": "There are a few foods and food groups that will help to fight inflammation and delayed onset muscle soreness (both things that are inevitable after a long, hard workout) when you incorporate them into your postworkout eats, whether immediately after your run or at a meal later in the day. Advertisement. Advertisement."
|
|
|
+ }
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "_index": "semantic-embeddings",
|
|
|
+ "_id": "-zdEuo8B0vYIvzmhplLX",
|
|
|
+ "_score": 22.143118,
|
|
|
+ "_source": {
|
|
|
+ "semantic_text": {
|
|
|
+ "inference": {
|
|
|
+ "inference_id": "my-elser-endpoint",
|
|
|
+ "model_settings": {
|
|
|
+ "task_type": "sparse_embedding"
|
|
|
+ },
|
|
|
+ "chunks": [
|
|
|
+ {
|
|
|
+ "text": "During Your Workout. There are a few things you can do during your workout to help prevent muscle injury and soreness. According to personal trainer and writer for Iron Magazine, Marc David, doing warm-ups and cool-downs between sets can help keep muscle soreness to a minimum.",
|
|
|
+ "embeddings": {
|
|
|
+ (...)
|
|
|
+ }
|
|
|
+ }
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "id": 3389244,
|
|
|
+ "content": "During Your Workout. There are a few things you can do during your workout to help prevent muscle injury and soreness. According to personal trainer and writer for Iron Magazine, Marc David, doing warm-ups and cool-downs between sets can help keep muscle soreness to a minimum."
|
|
|
+ }
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "_index": "semantic-embeddings",
|
|
|
+ "_id": "77JEuo8BdmhTuQdXtQWt",
|
|
|
+ "_score": 21.506052,
|
|
|
+ "_source": {
|
|
|
+ "semantic_text": {
|
|
|
+ "inference": {
|
|
|
+ "inference_id": "my-elser-endpoint",
|
|
|
+ "model_settings": {
|
|
|
+ "task_type": "sparse_embedding"
|
|
|
+ },
|
|
|
+ "chunks": [
|
|
|
+ {
|
|
|
+ "text": "This is especially important if the soreness is due to a weightlifting routine. For this time period, do not exert more than around 50% of the level of effort (weight, distance and speed) that caused the muscle groups to be sore.",
|
|
|
+ "embeddings": {
|
|
|
+ (...)
|
|
|
+ }
|
|
|
+ }
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "id": 363742,
|
|
|
+ "content": "This is especially important if the soreness is due to a weightlifting routine. For this time period, do not exert more than around 50% of the level of effort (weight, distance and speed) that caused the muscle groups to be sore."
|
|
|
+ }
|
|
|
+ },
|
|
|
+ (...)
|
|
|
+]
|
|
|
+------------------------------------------------------------
|
|
|
+// NOTCONSOLE
|
|
|
+
|