similarity.asciidoc 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558
  1. [[index-modules-similarity]]
  2. == Similarity module
  3. A similarity (scoring / ranking model) defines how matching documents
  4. are scored. Similarity is per field, meaning that via the mapping one
  5. can define a different similarity per field.
  6. Similarity is only applicable for text type and keyword type fields.
  7. Configuring a custom similarity is considered an expert feature and the
  8. builtin similarities are most likely sufficient as is described in
  9. <<similarity>>.
  10. [discrete]
  11. [[configuration]]
  12. === Configuring a similarity
  13. Most existing or custom Similarities have configuration options which
  14. can be configured via the index settings as shown below. The index
  15. options can be provided when creating an index or updating index
  16. settings.
  17. [source,console]
  18. --------------------------------------------------
  19. PUT /index
  20. {
  21. "settings": {
  22. "index": {
  23. "similarity": {
  24. "my_similarity": {
  25. "type": "DFR",
  26. "basic_model": "g",
  27. "after_effect": "l",
  28. "normalization": "h2",
  29. "normalization.h2.c": "3.0"
  30. }
  31. }
  32. }
  33. }
  34. }
  35. --------------------------------------------------
  36. Here we configure the DFR similarity so it can be referenced as
  37. `my_similarity` in mappings as is illustrate in the below example:
  38. [source,console]
  39. --------------------------------------------------
  40. PUT /index/_mapping
  41. {
  42. "properties" : {
  43. "title" : { "type" : "text", "similarity" : "my_similarity" }
  44. }
  45. }
  46. --------------------------------------------------
  47. // TEST[continued]
  48. [discrete]
  49. === Available similarities
  50. [discrete]
  51. [[bm25]]
  52. ==== BM25 similarity (*default*)
  53. TF/IDF based similarity that has built-in tf normalization and
  54. is supposed to work better for short fields (like names). See
  55. {wikipedia}/Okapi_BM25[Okapi_BM25] for more details.
  56. This similarity has the following options:
  57. [horizontal]
  58. `k1`::
  59. Controls non-linear term frequency normalization
  60. (saturation). The default value is `1.2`.
  61. `b`::
  62. Controls to what degree document length normalizes tf values.
  63. The default value is `0.75`.
  64. `discount_overlaps`::
  65. Determines whether overlap tokens (Tokens with
  66. 0 position increment) are ignored when computing norm. By default this
  67. is true, meaning overlap tokens do not count when computing norms.
  68. Type name: `BM25`
  69. [discrete]
  70. [[dfr]]
  71. ==== DFR similarity
  72. Similarity that implements the
  73. {lucene-core-javadoc}/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
  74. from randomness] framework. This similarity has the following options:
  75. [horizontal]
  76. `basic_model`::
  77. Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`g`],
  78. {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIF.html[`if`],
  79. {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIn.html[`in`] and
  80. {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIne.html[`ine`].
  81. `after_effect`::
  82. Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectB.html[`b`] and
  83. {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectL.html[`l`].
  84. `normalization`::
  85. Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/Normalization.NoNormalization.html[`no`],
  86. {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h1`],
  87. {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH2.html[`h2`],
  88. {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH3.html[`h3`] and
  89. {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationZ.html[`z`].
  90. All options but the first option need a normalization value.
  91. Type name: `DFR`
  92. [discrete]
  93. [[dfi]]
  94. ==== DFI similarity
  95. Similarity that implements the https://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf[divergence from independence]
  96. model.
  97. This similarity has the following options:
  98. [horizontal]
  99. `independence_measure`:: Possible values
  100. {lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceStandardized.html[`standardized`],
  101. {lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceSaturated.html[`saturated`],
  102. {lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceChiSquared.html[`chisquared`].
  103. When using this similarity, it is highly recommended *not* to remove stop words to get
  104. good relevance. Also beware that terms whose frequency is less than the expected
  105. frequency will get a score equal to 0.
  106. Type name: `DFI`
  107. [discrete]
  108. [[ib]]
  109. ==== IB similarity.
  110. {lucene-core-javadoc}/org/apache/lucene/search/similarities/IBSimilarity.html[Information
  111. based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
  112. sequence is primarily determined by the repetitive usage of its basic elements.
  113. For written texts this challenge would correspond to comparing the writing styles of different authors.
  114. This similarity has the following options:
  115. [horizontal]
  116. `distribution`:: Possible values:
  117. {lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionLL.html[`ll`] and
  118. {lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionSPL.html[`spl`].
  119. `lambda`:: Possible values:
  120. {lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaDF.html[`df`] and
  121. {lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaTTF.html[`ttf`].
  122. `normalization`:: Same as in `DFR` similarity.
  123. Type name: `IB`
  124. [discrete]
  125. [[lm_dirichlet]]
  126. ==== LM Dirichlet similarity.
  127. {lucene-core-javadoc}/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
  128. Dirichlet similarity] . This similarity has the following options:
  129. [horizontal]
  130. `mu`:: Default to `2000`.
  131. The scoring formula in the paper assigns negative scores to terms that have
  132. fewer occurrences than predicted by the language model, which is illegal to
  133. Lucene, so such terms get a score of 0.
  134. Type name: `LMDirichlet`
  135. [discrete]
  136. [[lm_jelinek_mercer]]
  137. ==== LM Jelinek Mercer similarity.
  138. {lucene-core-javadoc}/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
  139. Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
  140. [horizontal]
  141. `lambda`:: The optimal value depends on both the collection and the query. The optimal value is around `0.1`
  142. for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.
  143. Type name: `LMJelinekMercer`
  144. [discrete]
  145. [[scripted_similarity]]
  146. ==== Scripted similarity
  147. A similarity that allows you to use a script in order to specify how scores
  148. should be computed. For instance, the below example shows how to reimplement
  149. TF-IDF:
  150. [source,console]
  151. --------------------------------------------------
  152. PUT /index
  153. {
  154. "settings": {
  155. "number_of_shards": 1,
  156. "similarity": {
  157. "scripted_tfidf": {
  158. "type": "scripted",
  159. "script": {
  160. "source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
  161. }
  162. }
  163. }
  164. },
  165. "mappings": {
  166. "properties": {
  167. "field": {
  168. "type": "text",
  169. "similarity": "scripted_tfidf"
  170. }
  171. }
  172. }
  173. }
  174. PUT /index/_doc/1
  175. {
  176. "field": "foo bar foo"
  177. }
  178. PUT /index/_doc/2
  179. {
  180. "field": "bar baz"
  181. }
  182. POST /index/_refresh
  183. GET /index/_search?explain=true
  184. {
  185. "query": {
  186. "query_string": {
  187. "query": "foo^1.7",
  188. "default_field": "field"
  189. }
  190. }
  191. }
  192. --------------------------------------------------
  193. Which yields:
  194. [source,console-result]
  195. --------------------------------------------------
  196. {
  197. "took": 12,
  198. "timed_out": false,
  199. "_shards": {
  200. "total": 1,
  201. "successful": 1,
  202. "skipped": 0,
  203. "failed": 0
  204. },
  205. "hits": {
  206. "total": {
  207. "value": 1,
  208. "relation": "eq"
  209. },
  210. "max_score": 1.9508477,
  211. "hits": [
  212. {
  213. "_shard": "[index][0]",
  214. "_node": "OzrdjxNtQGaqs4DmioFw9A",
  215. "_index": "index",
  216. "_id": "1",
  217. "_score": 1.9508477,
  218. "_source": {
  219. "field": "foo bar foo"
  220. },
  221. "_explanation": {
  222. "value": 1.9508477,
  223. "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
  224. "details": [
  225. {
  226. "value": 1.9508477,
  227. "description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;', options={}, params={}}]) computed from:",
  228. "details": [
  229. {
  230. "value": 1.0,
  231. "description": "weight",
  232. "details": []
  233. },
  234. {
  235. "value": 1.7,
  236. "description": "query.boost",
  237. "details": []
  238. },
  239. {
  240. "value": 2,
  241. "description": "field.docCount",
  242. "details": []
  243. },
  244. {
  245. "value": 4,
  246. "description": "field.sumDocFreq",
  247. "details": []
  248. },
  249. {
  250. "value": 5,
  251. "description": "field.sumTotalTermFreq",
  252. "details": []
  253. },
  254. {
  255. "value": 1,
  256. "description": "term.docFreq",
  257. "details": []
  258. },
  259. {
  260. "value": 2,
  261. "description": "term.totalTermFreq",
  262. "details": []
  263. },
  264. {
  265. "value": 2.0,
  266. "description": "doc.freq",
  267. "details": []
  268. },
  269. {
  270. "value": 3,
  271. "description": "doc.length",
  272. "details": []
  273. }
  274. ]
  275. }
  276. ]
  277. }
  278. }
  279. ]
  280. }
  281. }
  282. --------------------------------------------------
  283. // TESTRESPONSE[s/"took": 12/"took" : $body.took/]
  284. // TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
  285. WARNING: While scripted similarities provide a lot of flexibility, there is
  286. a set of rules that they need to satisfy. Failing to do so could make
  287. Elasticsearch silently return wrong top hits or fail with internal errors at
  288. search time:
  289. - Returned scores must be positive.
  290. - All other variables remaining equal, scores must not decrease when
  291. `doc.freq` increases.
  292. - All other variables remaining equal, scores must not increase when
  293. `doc.length` increases.
  294. You might have noticed that a significant part of the above script depends on
  295. statistics that are the same for every document. It is possible to make the
  296. above slightly more efficient by providing an `weight_script` which will
  297. compute the document-independent part of the score and will be available
  298. under the `weight` variable. When no `weight_script` is provided, `weight`
  299. is equal to `1`. The `weight_script` has access to the same variables as
  300. the `script` except `doc` since it is supposed to compute a
  301. document-independent contribution to the score.
  302. The below configuration will give the same tf-idf scores but is slightly
  303. more efficient:
  304. [source,console]
  305. --------------------------------------------------
  306. PUT /index
  307. {
  308. "settings": {
  309. "number_of_shards": 1,
  310. "similarity": {
  311. "scripted_tfidf": {
  312. "type": "scripted",
  313. "weight_script": {
  314. "source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
  315. },
  316. "script": {
  317. "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
  318. }
  319. }
  320. }
  321. },
  322. "mappings": {
  323. "properties": {
  324. "field": {
  325. "type": "text",
  326. "similarity": "scripted_tfidf"
  327. }
  328. }
  329. }
  330. }
  331. --------------------------------------------------
  332. ////////////////////
  333. [source,console]
  334. --------------------------------------------------
  335. PUT /index/_doc/1
  336. {
  337. "field": "foo bar foo"
  338. }
  339. PUT /index/_doc/2
  340. {
  341. "field": "bar baz"
  342. }
  343. POST /index/_refresh
  344. GET /index/_search?explain=true
  345. {
  346. "query": {
  347. "query_string": {
  348. "query": "foo^1.7",
  349. "default_field": "field"
  350. }
  351. }
  352. }
  353. --------------------------------------------------
  354. // TEST[continued]
  355. [source,console-result]
  356. --------------------------------------------------
  357. {
  358. "took": 1,
  359. "timed_out": false,
  360. "_shards": {
  361. "total": 1,
  362. "successful": 1,
  363. "skipped": 0,
  364. "failed": 0
  365. },
  366. "hits": {
  367. "total": {
  368. "value": 1,
  369. "relation": "eq"
  370. },
  371. "max_score": 1.9508477,
  372. "hits": [
  373. {
  374. "_shard": "[index][0]",
  375. "_node": "OzrdjxNtQGaqs4DmioFw9A",
  376. "_index": "index",
  377. "_id": "1",
  378. "_score": 1.9508477,
  379. "_source": {
  380. "field": "foo bar foo"
  381. },
  382. "_explanation": {
  383. "value": 1.9508477,
  384. "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
  385. "details": [
  386. {
  387. "value": 1.9508477,
  388. "description": "score from ScriptedSimilarity(weightScript=[Script{type=inline, lang='painless', idOrCode='double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;', options={}, params={}}], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;', options={}, params={}}]) computed from:",
  389. "details": [
  390. {
  391. "value": 2.3892908,
  392. "description": "weight",
  393. "details": []
  394. },
  395. {
  396. "value": 1.7,
  397. "description": "query.boost",
  398. "details": []
  399. },
  400. {
  401. "value": 2,
  402. "description": "field.docCount",
  403. "details": []
  404. },
  405. {
  406. "value": 4,
  407. "description": "field.sumDocFreq",
  408. "details": []
  409. },
  410. {
  411. "value": 5,
  412. "description": "field.sumTotalTermFreq",
  413. "details": []
  414. },
  415. {
  416. "value": 1,
  417. "description": "term.docFreq",
  418. "details": []
  419. },
  420. {
  421. "value": 2,
  422. "description": "term.totalTermFreq",
  423. "details": []
  424. },
  425. {
  426. "value": 2.0,
  427. "description": "doc.freq",
  428. "details": []
  429. },
  430. {
  431. "value": 3,
  432. "description": "doc.length",
  433. "details": []
  434. }
  435. ]
  436. }
  437. ]
  438. }
  439. }
  440. ]
  441. }
  442. }
  443. --------------------------------------------------
  444. // TESTRESPONSE[s/"took": 1/"took" : $body.took/]
  445. // TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
  446. ////////////////////
  447. Type name: `scripted`
  448. [discrete]
  449. [[default-base]]
  450. ==== Default Similarity
  451. By default, Elasticsearch will use whatever similarity is configured as
  452. `default`.
  453. You can change the default similarity for all fields in an index when
  454. it is <<indices-create-index,created>>:
  455. [source,console]
  456. --------------------------------------------------
  457. PUT /index
  458. {
  459. "settings": {
  460. "index": {
  461. "similarity": {
  462. "default": {
  463. "type": "boolean"
  464. }
  465. }
  466. }
  467. }
  468. }
  469. --------------------------------------------------
  470. If you want to change the default similarity after creating the index
  471. you must <<indices-open-close,close>> your index, send the following
  472. request and <<indices-open-close,open>> it again afterwards:
  473. [source,console]
  474. --------------------------------------------------
  475. POST /index/_close
  476. PUT /index/_settings
  477. {
  478. "index": {
  479. "similarity": {
  480. "default": {
  481. "type": "boolean"
  482. }
  483. }
  484. }
  485. }
  486. POST /index/_open
  487. --------------------------------------------------
  488. // TEST[continued]