lqb
/
elasticsearch
spegling av https://gitee.com/mirrors/elasticsearch.git


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203
							[[semantic-text-hybrid-search]]
=== Tutorial: hybrid search with `semantic_text`
++++
<titleabbrev>Hybrid search with `semantic_text`</titleabbrev>
++++

This tutorial demonstrates how to perform hybrid search, combining semantic search with traditional full-text search.

In hybrid search, semantic search retrieves results based on the meaning of the text, while full-text search focuses on exact word matches. By combining both methods, hybrid search delivers more relevant results, particularly in cases where relying on a single approach may not be sufficient.

The recommended way to use hybrid search in the {stack} is following the `semantic_text` workflow. 
This tutorial uses the <<infer-service-elasticsearch,`elasticsearch` service>> for demonstration, but you can use any service and their supported models offered by the {infer-cap} API.

[discrete]
[[hybrid-search-create-index-mapping]]
==== Create an index mapping

The destination index will contain both the embeddings for semantic search and the original text field for full-text search. This structure enables the combination of semantic search and full-text search.

[source,console]
------------------------------------------------------------
PUT semantic-embeddings
{
  "mappings": {
    "properties": {
      "semantic_text": { <1>
        "type": "semantic_text", 
      },
      "content": { <2>
        "type": "text",
        "copy_to": "semantic_text" <3>
      }
    }
  }
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The name of the field to contain the generated embeddings for semantic search.
<2> The name of the field to contain the original text for lexical search.
<3> The textual data stored in the `content` field will be copied to `semantic_text` and processed by the {infer} endpoint.

[NOTE]
====
If you want to run a search on indices that were populated by web crawlers or connectors, you have to
<<indices-put-mapping,update the index mappings>> for these indices to
include the `semantic_text` field. Once the mapping is updated, you'll need to run a full web crawl or a full connector sync. This ensures that all existing
documents are reprocessed and updated with the new semantic embeddings, enabling hybrid search on the updated data.
====

[discrete]
[[semantic-text-hybrid-load-data]]
==== Load data

In this step, you load the data that you later use to create embeddings from.

Use the `msmarco-passagetest2019-top1000` data set, which is a subset of the MS MARCO Passage Ranking data set. It consists of 200 queries, each accompanied by a list of relevant text passages. All unique passages, along with their IDs, have been extracted from that data set and compiled into a https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].

Download the file and upload it to your cluster using the [Data Visualizer](docs-content://manage-data/ingest/upload-data-files.md) in the {ml-app} UI. After your data is analyzed, click **Override settings**. Under **Edit field names**, assign `id` to the first column and `content` to the second. Click **Apply**, then **Import**. Name the index `test-data`, and click **Import**. After the upload is complete, you will see an index named `test-data` with 182,469 documents.

[discrete]
[[hybrid-search-reindex-data]]
==== Reindex the data for hybrid search

Reindex the data from the `test-data` index into the `semantic-embeddings` index.
The data in the `content` field of the source index is copied into the `content` field of the destination index. 
The `copy_to` parameter set in the index mapping creation ensures that the content is copied into the `semantic_text` field. The data is processed by the {infer} endpoint at ingest time to generate embeddings.

[NOTE]
====
This step uses the reindex API to simulate data ingestion. If you are working with data that has already been indexed,
rather than using the `test-data` set, reindexing is still required to ensure that the data is processed by the {infer} endpoint
and the necessary embeddings are generated.
====

[source,console]
------------------------------------------------------------
POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "test-data",
    "size": 10 <1>
  },
  "dest": {
    "index": "semantic-embeddings"
  }
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The default batch size for reindexing is 1000. Reducing size to a smaller
number makes the update of the reindexing process quicker which enables you to
follow the progress closely and detect errors early.

The call returns a task ID to monitor the progress:

[source,console]
------------------------------------------------------------
GET _tasks/<task_id>
------------------------------------------------------------
// TEST[skip:TBD]

Reindexing large datasets can take a long time. You can test this workflow using only a subset of the dataset.

To cancel the reindexing process and generate embeddings for the subset that was reindexed:

[source,console]
------------------------------------------------------------
POST _tasks/<task_id>/_cancel
------------------------------------------------------------
// TEST[skip:TBD]

[discrete]
[[hybrid-search-perform-search]]
==== Perform hybrid search

After reindexing the data into the `semantic-embeddings` index, you can perform hybrid search by using <<rrf,reciprocal rank fusion (RRF)>>. RRF is a technique that merges the rankings from both semantic and lexical queries, giving more weight to results that rank high in either search. This ensures that the final results are balanced and relevant.
To extract the most relevant fragments from the original text and query, you can use the <<highlighting,highlight parameter>>:

[source,console]
------------------------------------------------------------
GET semantic-embeddings/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": { <1>
            "query": {
              "match": {
                "content": "How to avoid muscle soreness while running?" <2>
              }
            }
          }
        },
        {
          "standard": { <3>
            "query": {
              "semantic": {
                "field": "semantic_text", <4>
                "query": "How to avoid muscle soreness while running?"
              }
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
        "semantic_text": {
            "number_of_fragments": 2  <5>
        }
    }
  }
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The first `standard` retriever represents the traditional lexical search.
<2> Lexical search is performed on the `content` field using the specified phrase.
<3> The second `standard` retriever refers to the semantic search.
<4> The `semantic_text` field is used to perform the semantic search.
<5> Specifies the maximum number of fragments to return. See <<semantic-text-highlighting, semantic text highlighting>> for a more complete example.

After performing the hybrid search, the query will return the top 10 documents that match both semantic and lexical search criteria. The results include detailed information about each document:

[source,console-result]
------------------------------------------------------------
{
  "took": 107,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 473,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "semantic-embeddings",
        "_id": "wv65epIBEMBRnhfTsOFM", 
        "_score": 0.032786883,
        "_rank": 1,
        "_source": {
          "id": 8408852,
          "content": "What so many out there do not realize is the importance of (...)"
        },
        "highlight" : {
            "semantic_text" : [
                "... fragment_1 ...",
                "... fragment_2 ..."
            ]
        }
      }
    ]
  }
}
------------------------------------------------------------
// NOTCONSOLE