123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544 |
- [[index-modules-similarity]]
- == Similarity module
- A similarity (scoring / ranking model) defines how matching documents
- are scored. Similarity is per field, meaning that via the mapping one
- can define a different similarity per field.
- Configuring a custom similarity is considered a expert feature and the
- builtin similarities are most likely sufficient as is described in
- <<similarity>>.
- [float]
- [[configuration]]
- === Configuring a similarity
- Most existing or custom Similarities have configuration options which
- can be configured via the index settings as shown below. The index
- options can be provided when creating an index or updating index
- settings.
- [source,js]
- --------------------------------------------------
- PUT /index
- {
- "settings" : {
- "index" : {
- "similarity" : {
- "my_similarity" : {
- "type" : "DFR",
- "basic_model" : "g",
- "after_effect" : "l",
- "normalization" : "h2",
- "normalization.h2.c" : "3.0"
- }
- }
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- Here we configure the DFRSimilarity so it can be referenced as
- `my_similarity` in mappings as is illustrate in the below example:
- [source,js]
- --------------------------------------------------
- PUT /index/_mapping/book
- {
- "properties" : {
- "title" : { "type" : "text", "similarity" : "my_similarity" }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[continued]
- [float]
- === Available similarities
- [float]
- [[bm25]]
- ==== BM25 similarity (*default*)
- TF/IDF based similarity that has built-in tf normalization and
- is supposed to work better for short fields (like names). See
- http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
- This similarity has the following options:
- [horizontal]
- `k1`::
- Controls non-linear term frequency normalization
- (saturation). The default value is `1.2`.
- `b`::
- Controls to what degree document length normalizes tf values.
- The default value is `0.75`.
- `discount_overlaps`::
- Determines whether overlap tokens (Tokens with
- 0 position increment) are ignored when computing norm. By default this
- is true, meaning overlap tokens do not count when computing norms.
- Type name: `BM25`
- [float]
- [[classic-similarity]]
- ==== Classic similarity
- The classic similarity that is based on the TF/IDF model. This
- similarity has the following option:
- `discount_overlaps`::
- Determines whether overlap tokens (Tokens with
- 0 position increment) are ignored when computing norm. By default this
- is true, meaning overlap tokens do not count when computing norms.
- Type name: `classic`
- [float]
- [[drf]]
- ==== DFR similarity
- Similarity that implements the
- http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
- from randomness] framework. This similarity has the following options:
- [horizontal]
- `basic_model`::
- Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
- `after_effect`::
- Possible values: `no`, `b` and `l`.
- `normalization`::
- Possible values: `no`, `h1`, `h2`, `h3` and `z`.
- All options but the first option need a normalization value.
- Type name: `DFR`
- [float]
- [[dfi]]
- ==== DFI similarity
- Similarity that implements the http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf[divergence from independence]
- model.
- This similarity has the following options:
- [horizontal]
- `independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.
- Type name: `DFI`
- [float]
- [[ib]]
- ==== IB similarity.
- http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
- based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
- sequence is primarily determined by the repetitive usage of its basic elements.
- For written texts this challenge would correspond to comparing the writing styles of different authors.
- This similarity has the following options:
- [horizontal]
- `distribution`:: Possible values: `ll` and `spl`.
- `lambda`:: Possible values: `df` and `ttf`.
- `normalization`:: Same as in `DFR` similarity.
- Type name: `IB`
- [float]
- [[lm_dirichlet]]
- ==== LM Dirichlet similarity.
- http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
- Dirichlet similarity] . This similarity has the following options:
- [horizontal]
- `mu`:: Default to `2000`.
- Type name: `LMDirichlet`
- [float]
- [[lm_jelinek_mercer]]
- ==== LM Jelinek Mercer similarity.
- http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
- Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
- [horizontal]
- `lambda`:: The optimal value depends on both the collection and the query. The optimal value is around `0.1`
- for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.
- Type name: `LMJelinekMercer`
- [float]
- [[scripted_similarity]]
- ==== Scripted similarity
- A similarity that allows you to use a script in order to specify how scores
- should be computed. For instance, the below example shows how to reimplement
- TF-IDF:
- [source,js]
- --------------------------------------------------
- PUT /index
- {
- "settings": {
- "number_of_shards": 1,
- "similarity": {
- "scripted_tfidf": {
- "type": "scripted",
- "script": {
- "source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
- }
- }
- }
- },
- "mappings": {
- "doc": {
- "properties": {
- "field": {
- "type": "text",
- "similarity": "scripted_tfidf"
- }
- }
- }
- }
- }
- PUT /index/doc/1
- {
- "field": "foo bar foo"
- }
- PUT /index/doc/2
- {
- "field": "bar baz"
- }
- POST /index/_refresh
- GET /index/_search?explain=true
- {
- "query": {
- "query_string": {
- "query": "foo^1.7",
- "default_field": "field"
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- Which yields:
- [source,js]
- --------------------------------------------------
- {
- "took": 12,
- "timed_out": false,
- "_shards": {
- "total": 1,
- "successful": 1,
- "skipped": 0,
- "failed": 0
- },
- "hits": {
- "total": 1,
- "max_score": 1.9508477,
- "hits": [
- {
- "_shard": "[index][0]",
- "_node": "OzrdjxNtQGaqs4DmioFw9A",
- "_index": "index",
- "_type": "doc",
- "_id": "1",
- "_score": 1.9508477,
- "_source": {
- "field": "foo bar foo"
- },
- "_explanation": {
- "value": 1.9508477,
- "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
- "details": [
- {
- "value": 1.9508477,
- "description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;', options={}, params={}}]) computed from:",
- "details": [
- {
- "value": 1.0,
- "description": "weight",
- "details": []
- },
- {
- "value": 1.7,
- "description": "query.boost",
- "details": []
- },
- {
- "value": 2.0,
- "description": "field.docCount",
- "details": []
- },
- {
- "value": 4.0,
- "description": "field.sumDocFreq",
- "details": []
- },
- {
- "value": 5.0,
- "description": "field.sumTotalTermFreq",
- "details": []
- },
- {
- "value": 1.0,
- "description": "term.docFreq",
- "details": []
- },
- {
- "value": 2.0,
- "description": "term.totalTermFreq",
- "details": []
- },
- {
- "value": 2.0,
- "description": "doc.freq",
- "details": []
- },
- {
- "value": 3.0,
- "description": "doc.length",
- "details": []
- }
- ]
- }
- ]
- }
- }
- ]
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"took": 12/"took" : $body.took/]
- // TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
- You might have noticed that a significant part of the script depends on
- statistics that are the same for every document. It is possible to make the
- above slightly more efficient by providing an `weight_script` which will
- compute the document-independent part of the score and will be available
- under the `weight` variable. When no `weight_script` is provided, `weight`
- is equal to `1`. The `weight_script` has access to the same variables as
- the `script` except `doc` since it is supposed to compute a
- document-independent contribution to the score.
- The below configuration will give the same tf-idf scores but is slightly
- more efficient:
- [source,js]
- --------------------------------------------------
- PUT /index
- {
- "settings": {
- "number_of_shards": 1,
- "similarity": {
- "scripted_tfidf": {
- "type": "scripted",
- "weight_script": {
- "source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
- },
- "script": {
- "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
- }
- }
- }
- },
- "mappings": {
- "doc": {
- "properties": {
- "field": {
- "type": "text",
- "similarity": "scripted_tfidf"
- }
- }
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- ////////////////////
- [source,js]
- --------------------------------------------------
- PUT /index/doc/1
- {
- "field": "foo bar foo"
- }
- PUT /index/doc/2
- {
- "field": "bar baz"
- }
- POST /index/_refresh
- GET /index/_search?explain=true
- {
- "query": {
- "query_string": {
- "query": "foo^1.7",
- "default_field": "field"
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- // TEST[continued]
- [source,js]
- --------------------------------------------------
- {
- "took": 1,
- "timed_out": false,
- "_shards": {
- "total": 1,
- "successful": 1,
- "skipped": 0,
- "failed": 0
- },
- "hits": {
- "total": 1,
- "max_score": 1.9508477,
- "hits": [
- {
- "_shard": "[index][0]",
- "_node": "OzrdjxNtQGaqs4DmioFw9A",
- "_index": "index",
- "_type": "doc",
- "_id": "1",
- "_score": 1.9508477,
- "_source": {
- "field": "foo bar foo"
- },
- "_explanation": {
- "value": 1.9508477,
- "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
- "details": [
- {
- "value": 1.9508477,
- "description": "score from ScriptedSimilarity(weightScript=[Script{type=inline, lang='painless', idOrCode='double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;', options={}, params={}}], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;', options={}, params={}}]) computed from:",
- "details": [
- {
- "value": 2.3892908,
- "description": "weight",
- "details": []
- },
- {
- "value": 1.7,
- "description": "query.boost",
- "details": []
- },
- {
- "value": 2.0,
- "description": "field.docCount",
- "details": []
- },
- {
- "value": 4.0,
- "description": "field.sumDocFreq",
- "details": []
- },
- {
- "value": 5.0,
- "description": "field.sumTotalTermFreq",
- "details": []
- },
- {
- "value": 1.0,
- "description": "term.docFreq",
- "details": []
- },
- {
- "value": 2.0,
- "description": "term.totalTermFreq",
- "details": []
- },
- {
- "value": 2.0,
- "description": "doc.freq",
- "details": []
- },
- {
- "value": 3.0,
- "description": "doc.length",
- "details": []
- }
- ]
- }
- ]
- }
- }
- ]
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"took": 1/"took" : $body.took/]
- // TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
- ////////////////////
- Type name: `scripted`
- [float]
- [[default-base]]
- ==== Default Similarity
- By default, Elasticsearch will use whatever similarity is configured as
- `default`.
- You can change the default similarity for all fields in an index when
- it is <<indices-create-index,created>>:
- [source,js]
- --------------------------------------------------
- PUT /index
- {
- "settings": {
- "index": {
- "similarity": {
- "default": {
- "type": "classic"
- }
- }
- }
- }
- }
- --------------------------------------------------
- // CONSOLE
- If you want to change the default similarity after creating the index
- you must <<indices-open-close,close>> your index, send the following
- request and <<indices-open-close,open>> it again afterwards:
- [source,js]
- --------------------------------------------------
- POST /index/_close
- PUT /index/_settings
- {
- "index": {
- "similarity": {
- "default": {
- "type": "classic"
- }
- }
- }
- }
- POST /index/_open
- --------------------------------------------------
- // CONSOLE
- // TEST[continued]
|