similarity.asciidoc 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544
  1. [[index-modules-similarity]]
  2. == Similarity module
  3. A similarity (scoring / ranking model) defines how matching documents
  4. are scored. Similarity is per field, meaning that via the mapping one
  5. can define a different similarity per field.
  6. Configuring a custom similarity is considered a expert feature and the
  7. builtin similarities are most likely sufficient as is described in
  8. <<similarity>>.
  9. [float]
  10. [[configuration]]
  11. === Configuring a similarity
  12. Most existing or custom Similarities have configuration options which
  13. can be configured via the index settings as shown below. The index
  14. options can be provided when creating an index or updating index
  15. settings.
  16. [source,js]
  17. --------------------------------------------------
  18. PUT /index
  19. {
  20. "settings" : {
  21. "index" : {
  22. "similarity" : {
  23. "my_similarity" : {
  24. "type" : "DFR",
  25. "basic_model" : "g",
  26. "after_effect" : "l",
  27. "normalization" : "h2",
  28. "normalization.h2.c" : "3.0"
  29. }
  30. }
  31. }
  32. }
  33. }
  34. --------------------------------------------------
  35. // CONSOLE
  36. Here we configure the DFRSimilarity so it can be referenced as
  37. `my_similarity` in mappings as is illustrate in the below example:
  38. [source,js]
  39. --------------------------------------------------
  40. PUT /index/_mapping/book
  41. {
  42. "properties" : {
  43. "title" : { "type" : "text", "similarity" : "my_similarity" }
  44. }
  45. }
  46. --------------------------------------------------
  47. // CONSOLE
  48. // TEST[continued]
  49. [float]
  50. === Available similarities
  51. [float]
  52. [[bm25]]
  53. ==== BM25 similarity (*default*)
  54. TF/IDF based similarity that has built-in tf normalization and
  55. is supposed to work better for short fields (like names). See
  56. http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
  57. This similarity has the following options:
  58. [horizontal]
  59. `k1`::
  60. Controls non-linear term frequency normalization
  61. (saturation). The default value is `1.2`.
  62. `b`::
  63. Controls to what degree document length normalizes tf values.
  64. The default value is `0.75`.
  65. `discount_overlaps`::
  66. Determines whether overlap tokens (Tokens with
  67. 0 position increment) are ignored when computing norm. By default this
  68. is true, meaning overlap tokens do not count when computing norms.
  69. Type name: `BM25`
  70. [float]
  71. [[classic-similarity]]
  72. ==== Classic similarity
  73. The classic similarity that is based on the TF/IDF model. This
  74. similarity has the following option:
  75. `discount_overlaps`::
  76. Determines whether overlap tokens (Tokens with
  77. 0 position increment) are ignored when computing norm. By default this
  78. is true, meaning overlap tokens do not count when computing norms.
  79. Type name: `classic`
  80. [float]
  81. [[drf]]
  82. ==== DFR similarity
  83. Similarity that implements the
  84. http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
  85. from randomness] framework. This similarity has the following options:
  86. [horizontal]
  87. `basic_model`::
  88. Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
  89. `after_effect`::
  90. Possible values: `no`, `b` and `l`.
  91. `normalization`::
  92. Possible values: `no`, `h1`, `h2`, `h3` and `z`.
  93. All options but the first option need a normalization value.
  94. Type name: `DFR`
  95. [float]
  96. [[dfi]]
  97. ==== DFI similarity
  98. Similarity that implements the http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf[divergence from independence]
  99. model.
  100. This similarity has the following options:
  101. [horizontal]
  102. `independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.
  103. Type name: `DFI`
  104. [float]
  105. [[ib]]
  106. ==== IB similarity.
  107. http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
  108. based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
  109. sequence is primarily determined by the repetitive usage of its basic elements.
  110. For written texts this challenge would correspond to comparing the writing styles of different authors.
  111. This similarity has the following options:
  112. [horizontal]
  113. `distribution`:: Possible values: `ll` and `spl`.
  114. `lambda`:: Possible values: `df` and `ttf`.
  115. `normalization`:: Same as in `DFR` similarity.
  116. Type name: `IB`
  117. [float]
  118. [[lm_dirichlet]]
  119. ==== LM Dirichlet similarity.
  120. http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
  121. Dirichlet similarity] . This similarity has the following options:
  122. [horizontal]
  123. `mu`:: Default to `2000`.
  124. Type name: `LMDirichlet`
  125. [float]
  126. [[lm_jelinek_mercer]]
  127. ==== LM Jelinek Mercer similarity.
  128. http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
  129. Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
  130. [horizontal]
  131. `lambda`:: The optimal value depends on both the collection and the query. The optimal value is around `0.1`
  132. for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.
  133. Type name: `LMJelinekMercer`
  134. [float]
  135. [[scripted_similarity]]
  136. ==== Scripted similarity
  137. A similarity that allows you to use a script in order to specify how scores
  138. should be computed. For instance, the below example shows how to reimplement
  139. TF-IDF:
  140. [source,js]
  141. --------------------------------------------------
  142. PUT /index
  143. {
  144. "settings": {
  145. "number_of_shards": 1,
  146. "similarity": {
  147. "scripted_tfidf": {
  148. "type": "scripted",
  149. "script": {
  150. "source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
  151. }
  152. }
  153. }
  154. },
  155. "mappings": {
  156. "doc": {
  157. "properties": {
  158. "field": {
  159. "type": "text",
  160. "similarity": "scripted_tfidf"
  161. }
  162. }
  163. }
  164. }
  165. }
  166. PUT /index/doc/1
  167. {
  168. "field": "foo bar foo"
  169. }
  170. PUT /index/doc/2
  171. {
  172. "field": "bar baz"
  173. }
  174. POST /index/_refresh
  175. GET /index/_search?explain=true
  176. {
  177. "query": {
  178. "query_string": {
  179. "query": "foo^1.7",
  180. "default_field": "field"
  181. }
  182. }
  183. }
  184. --------------------------------------------------
  185. // CONSOLE
  186. Which yields:
  187. [source,js]
  188. --------------------------------------------------
  189. {
  190. "took": 12,
  191. "timed_out": false,
  192. "_shards": {
  193. "total": 1,
  194. "successful": 1,
  195. "skipped": 0,
  196. "failed": 0
  197. },
  198. "hits": {
  199. "total": 1,
  200. "max_score": 1.9508477,
  201. "hits": [
  202. {
  203. "_shard": "[index][0]",
  204. "_node": "OzrdjxNtQGaqs4DmioFw9A",
  205. "_index": "index",
  206. "_type": "doc",
  207. "_id": "1",
  208. "_score": 1.9508477,
  209. "_source": {
  210. "field": "foo bar foo"
  211. },
  212. "_explanation": {
  213. "value": 1.9508477,
  214. "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
  215. "details": [
  216. {
  217. "value": 1.9508477,
  218. "description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;', options={}, params={}}]) computed from:",
  219. "details": [
  220. {
  221. "value": 1.0,
  222. "description": "weight",
  223. "details": []
  224. },
  225. {
  226. "value": 1.7,
  227. "description": "query.boost",
  228. "details": []
  229. },
  230. {
  231. "value": 2.0,
  232. "description": "field.docCount",
  233. "details": []
  234. },
  235. {
  236. "value": 4.0,
  237. "description": "field.sumDocFreq",
  238. "details": []
  239. },
  240. {
  241. "value": 5.0,
  242. "description": "field.sumTotalTermFreq",
  243. "details": []
  244. },
  245. {
  246. "value": 1.0,
  247. "description": "term.docFreq",
  248. "details": []
  249. },
  250. {
  251. "value": 2.0,
  252. "description": "term.totalTermFreq",
  253. "details": []
  254. },
  255. {
  256. "value": 2.0,
  257. "description": "doc.freq",
  258. "details": []
  259. },
  260. {
  261. "value": 3.0,
  262. "description": "doc.length",
  263. "details": []
  264. }
  265. ]
  266. }
  267. ]
  268. }
  269. }
  270. ]
  271. }
  272. }
  273. --------------------------------------------------
  274. // TESTRESPONSE[s/"took": 12/"took" : $body.took/]
  275. // TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
  276. You might have noticed that a significant part of the script depends on
  277. statistics that are the same for every document. It is possible to make the
  278. above slightly more efficient by providing an `weight_script` which will
  279. compute the document-independent part of the score and will be available
  280. under the `weight` variable. When no `weight_script` is provided, `weight`
  281. is equal to `1`. The `weight_script` has access to the same variables as
  282. the `script` except `doc` since it is supposed to compute a
  283. document-independent contribution to the score.
  284. The below configuration will give the same tf-idf scores but is slightly
  285. more efficient:
  286. [source,js]
  287. --------------------------------------------------
  288. PUT /index
  289. {
  290. "settings": {
  291. "number_of_shards": 1,
  292. "similarity": {
  293. "scripted_tfidf": {
  294. "type": "scripted",
  295. "weight_script": {
  296. "source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
  297. },
  298. "script": {
  299. "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
  300. }
  301. }
  302. }
  303. },
  304. "mappings": {
  305. "doc": {
  306. "properties": {
  307. "field": {
  308. "type": "text",
  309. "similarity": "scripted_tfidf"
  310. }
  311. }
  312. }
  313. }
  314. }
  315. --------------------------------------------------
  316. // CONSOLE
  317. ////////////////////
  318. [source,js]
  319. --------------------------------------------------
  320. PUT /index/doc/1
  321. {
  322. "field": "foo bar foo"
  323. }
  324. PUT /index/doc/2
  325. {
  326. "field": "bar baz"
  327. }
  328. POST /index/_refresh
  329. GET /index/_search?explain=true
  330. {
  331. "query": {
  332. "query_string": {
  333. "query": "foo^1.7",
  334. "default_field": "field"
  335. }
  336. }
  337. }
  338. --------------------------------------------------
  339. // CONSOLE
  340. // TEST[continued]
  341. [source,js]
  342. --------------------------------------------------
  343. {
  344. "took": 1,
  345. "timed_out": false,
  346. "_shards": {
  347. "total": 1,
  348. "successful": 1,
  349. "skipped": 0,
  350. "failed": 0
  351. },
  352. "hits": {
  353. "total": 1,
  354. "max_score": 1.9508477,
  355. "hits": [
  356. {
  357. "_shard": "[index][0]",
  358. "_node": "OzrdjxNtQGaqs4DmioFw9A",
  359. "_index": "index",
  360. "_type": "doc",
  361. "_id": "1",
  362. "_score": 1.9508477,
  363. "_source": {
  364. "field": "foo bar foo"
  365. },
  366. "_explanation": {
  367. "value": 1.9508477,
  368. "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
  369. "details": [
  370. {
  371. "value": 1.9508477,
  372. "description": "score from ScriptedSimilarity(weightScript=[Script{type=inline, lang='painless', idOrCode='double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;', options={}, params={}}], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;', options={}, params={}}]) computed from:",
  373. "details": [
  374. {
  375. "value": 2.3892908,
  376. "description": "weight",
  377. "details": []
  378. },
  379. {
  380. "value": 1.7,
  381. "description": "query.boost",
  382. "details": []
  383. },
  384. {
  385. "value": 2.0,
  386. "description": "field.docCount",
  387. "details": []
  388. },
  389. {
  390. "value": 4.0,
  391. "description": "field.sumDocFreq",
  392. "details": []
  393. },
  394. {
  395. "value": 5.0,
  396. "description": "field.sumTotalTermFreq",
  397. "details": []
  398. },
  399. {
  400. "value": 1.0,
  401. "description": "term.docFreq",
  402. "details": []
  403. },
  404. {
  405. "value": 2.0,
  406. "description": "term.totalTermFreq",
  407. "details": []
  408. },
  409. {
  410. "value": 2.0,
  411. "description": "doc.freq",
  412. "details": []
  413. },
  414. {
  415. "value": 3.0,
  416. "description": "doc.length",
  417. "details": []
  418. }
  419. ]
  420. }
  421. ]
  422. }
  423. }
  424. ]
  425. }
  426. }
  427. --------------------------------------------------
  428. // TESTRESPONSE[s/"took": 1/"took" : $body.took/]
  429. // TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
  430. ////////////////////
  431. Type name: `scripted`
  432. [float]
  433. [[default-base]]
  434. ==== Default Similarity
  435. By default, Elasticsearch will use whatever similarity is configured as
  436. `default`.
  437. You can change the default similarity for all fields in an index when
  438. it is <<indices-create-index,created>>:
  439. [source,js]
  440. --------------------------------------------------
  441. PUT /index
  442. {
  443. "settings": {
  444. "index": {
  445. "similarity": {
  446. "default": {
  447. "type": "classic"
  448. }
  449. }
  450. }
  451. }
  452. }
  453. --------------------------------------------------
  454. // CONSOLE
  455. If you want to change the default similarity after creating the index
  456. you must <<indices-open-close,close>> your index, send the following
  457. request and <<indices-open-close,open>> it again afterwards:
  458. [source,js]
  459. --------------------------------------------------
  460. POST /index/_close
  461. PUT /index/_settings
  462. {
  463. "index": {
  464. "similarity": {
  465. "default": {
  466. "type": "classic"
  467. }
  468. }
  469. }
  470. }
  471. POST /index/_open
  472. --------------------------------------------------
  473. // CONSOLE
  474. // TEST[continued]