vector-functions.asciidoc 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423
  1. [[vector-functions]]
  2. ===== Functions for vector fields
  3. NOTE: During vector functions' calculation, all matched documents are
  4. linearly scanned. Thus, expect the query time grow linearly
  5. with the number of matched documents. For this reason, we recommend
  6. to limit the number of matched documents with a `query` parameter.
  7. This is the list of available vector functions and vector access methods:
  8. 1. <<vector-functions-cosine,`cosineSimilarity`>> – calculates cosine similarity
  9. 2. <<vector-functions-dot-product,`dotProduct`>> – calculates dot product
  10. 3. <<vector-functions-l1,`l1norm`>> – calculates L^1^ distance
  11. 4. <<vector-functions-hamming,`hamming`>> – calculates Hamming distance
  12. 5. <<vector-functions-l2,`l2norm`>> - calculates L^2^ distance
  13. 6. <<vector-functions-accessing-vectors,`doc[<field>].vectorValue`>> – returns a vector's value as an array of floats
  14. 7. <<vector-functions-accessing-vectors,`doc[<field>].magnitude`>> – returns a vector's magnitude
  15. NOTE: The `cosineSimilarity` function is not supported for `bit` vectors.
  16. NOTE: The recommended way to access dense vectors is through the
  17. `cosineSimilarity`, `dotProduct`, `l1norm` or `l2norm` functions. Please note
  18. however, that you should call these functions only once per script. For example,
  19. don’t use these functions in a loop to calculate the similarity between a
  20. document vector and multiple other vectors. If you need that functionality,
  21. reimplement these functions yourself by
  22. <<vector-functions-accessing-vectors,accessing vector values directly>>.
  23. Let's create an index with a `dense_vector` mapping and index a couple
  24. of documents into it.
  25. [source,console]
  26. --------------------------------------------------
  27. PUT my-index-000001
  28. {
  29. "mappings": {
  30. "properties": {
  31. "my_dense_vector": {
  32. "type": "dense_vector",
  33. "index": false,
  34. "dims": 3
  35. },
  36. "my_byte_dense_vector": {
  37. "type": "dense_vector",
  38. "index": false,
  39. "dims": 3,
  40. "element_type": "byte"
  41. },
  42. "status" : {
  43. "type" : "keyword"
  44. }
  45. }
  46. }
  47. }
  48. PUT my-index-000001/_doc/1
  49. {
  50. "my_dense_vector": [0.5, 10, 6],
  51. "my_byte_dense_vector": [0, 10, 6],
  52. "status" : "published"
  53. }
  54. PUT my-index-000001/_doc/2
  55. {
  56. "my_dense_vector": [-0.5, 10, 10],
  57. "my_byte_dense_vector": [0, 10, 10],
  58. "status" : "published"
  59. }
  60. POST my-index-000001/_refresh
  61. --------------------------------------------------
  62. // TESTSETUP
  63. [[vector-functions-cosine]]
  64. ====== Cosine similarity
  65. The `cosineSimilarity` function calculates the measure of
  66. cosine similarity between a given query vector and document vectors.
  67. [source,console]
  68. --------------------------------------------------
  69. GET my-index-000001/_search
  70. {
  71. "query": {
  72. "script_score": {
  73. "query" : {
  74. "bool" : {
  75. "filter" : {
  76. "term" : {
  77. "status" : "published" <1>
  78. }
  79. }
  80. }
  81. },
  82. "script": {
  83. "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0", <2>
  84. "params": {
  85. "query_vector": [4, 3.4, -0.2] <3>
  86. }
  87. }
  88. }
  89. }
  90. }
  91. --------------------------------------------------
  92. <1> To restrict the number of documents on which script score calculation is applied, provide a filter.
  93. <2> The script adds 1.0 to the cosine similarity to prevent the score from being negative.
  94. <3> To take advantage of the script optimizations, provide a query vector as a script parameter.
  95. NOTE: If a document's dense vector field has a number of dimensions
  96. different from the query's vector, an error will be thrown.
  97. [[vector-functions-dot-product]]
  98. ====== Dot product
  99. The `dotProduct` function calculates the measure of
  100. dot product between a given query vector and document vectors.
  101. [source,console]
  102. --------------------------------------------------
  103. GET my-index-000001/_search
  104. {
  105. "query": {
  106. "script_score": {
  107. "query" : {
  108. "bool" : {
  109. "filter" : {
  110. "term" : {
  111. "status" : "published"
  112. }
  113. }
  114. }
  115. },
  116. "script": {
  117. "source": """
  118. double value = dotProduct(params.query_vector, 'my_dense_vector');
  119. return sigmoid(1, Math.E, -value); <1>
  120. """,
  121. "params": {
  122. "query_vector": [4, 3.4, -0.2]
  123. }
  124. }
  125. }
  126. }
  127. }
  128. --------------------------------------------------
  129. <1> Using the standard sigmoid function prevents scores from being negative.
  130. [[vector-functions-l1]]
  131. ====== L^1^ distance (Manhattan distance)
  132. The `l1norm` function calculates L^1^ distance
  133. (Manhattan distance) between a given query vector and
  134. document vectors.
  135. [source,console]
  136. --------------------------------------------------
  137. GET my-index-000001/_search
  138. {
  139. "query": {
  140. "script_score": {
  141. "query" : {
  142. "bool" : {
  143. "filter" : {
  144. "term" : {
  145. "status" : "published"
  146. }
  147. }
  148. }
  149. },
  150. "script": {
  151. "source": "1 / (1 + l1norm(params.queryVector, 'my_dense_vector'))", <1>
  152. "params": {
  153. "queryVector": [4, 3.4, -0.2]
  154. }
  155. }
  156. }
  157. }
  158. }
  159. --------------------------------------------------
  160. <1> Unlike `cosineSimilarity` that represent similarity, `l1norm` and
  161. `l2norm` shown below represent distances or differences. This means, that
  162. the more similar the vectors are, the lower the scores will be that are
  163. produced by the `l1norm` and `l2norm` functions.
  164. Thus, as we need more similar vectors to score higher,
  165. we reversed the output from `l1norm` and `l2norm`. Also, to avoid
  166. division by 0 when a document vector matches the query exactly,
  167. we added `1` in the denominator.
  168. [[vector-functions-hamming]]
  169. ====== Hamming distance
  170. The `hamming` function calculates {wikipedia}/Hamming_distance[Hamming distance] between a given query vector and
  171. document vectors. It is only available for byte and bit vectors.
  172. [source,console]
  173. --------------------------------------------------
  174. GET my-index-000001/_search
  175. {
  176. "query": {
  177. "script_score": {
  178. "query" : {
  179. "bool" : {
  180. "filter" : {
  181. "term" : {
  182. "status" : "published"
  183. }
  184. }
  185. }
  186. },
  187. "script": {
  188. "source": "(24 - hamming(params.queryVector, 'my_byte_dense_vector')) / 24", <1>
  189. "params": {
  190. "queryVector": [4, 3, 0]
  191. }
  192. }
  193. }
  194. }
  195. }
  196. --------------------------------------------------
  197. <1> Calculate the Hamming distance and normalize it by the bits to get a score between 0 and 1.
  198. [[vector-functions-l2]]
  199. ====== L^2^ distance (Euclidean distance)
  200. The `l2norm` function calculates L^2^ distance
  201. (Euclidean distance) between a given query vector and
  202. document vectors.
  203. [source,console]
  204. --------------------------------------------------
  205. GET my-index-000001/_search
  206. {
  207. "query": {
  208. "script_score": {
  209. "query" : {
  210. "bool" : {
  211. "filter" : {
  212. "term" : {
  213. "status" : "published"
  214. }
  215. }
  216. }
  217. },
  218. "script": {
  219. "source": "1 / (1 + l2norm(params.queryVector, 'my_dense_vector'))",
  220. "params": {
  221. "queryVector": [4, 3.4, -0.2]
  222. }
  223. }
  224. }
  225. }
  226. }
  227. --------------------------------------------------
  228. [[vector-functions-missing-values]]
  229. ====== Checking for missing values
  230. If a document doesn't have a value for a vector field on which a vector function
  231. is executed, an error will be thrown.
  232. You can check if a document has a value for the field `my_vector` with
  233. `doc['my_vector'].size() == 0`. Your overall script can look like this:
  234. [source,js]
  235. --------------------------------------------------
  236. "source": "doc['my_vector'].size() == 0 ? 0 : cosineSimilarity(params.queryVector, 'my_vector')"
  237. --------------------------------------------------
  238. // NOTCONSOLE
  239. [[vector-functions-accessing-vectors]]
  240. ====== Accessing vectors directly
  241. You can access vector values directly through the following functions:
  242. - `doc[<field>].vectorValue` – returns a vector's value as an array of floats
  243. NOTE: For `bit` vectors, it does return a `float[]`, where each element represents 8 bits.
  244. - `doc[<field>].magnitude` – returns a vector's magnitude as a float
  245. (for vectors created prior to version 7.5 the magnitude is not stored.
  246. So this function calculates it anew every time it is called).
  247. NOTE: For `bit` vectors, this is just the square root of the sum of `1` bits.
  248. For example, the script below implements a cosine similarity using these
  249. two functions:
  250. [source,console]
  251. --------------------------------------------------
  252. GET my-index-000001/_search
  253. {
  254. "query": {
  255. "script_score": {
  256. "query" : {
  257. "bool" : {
  258. "filter" : {
  259. "term" : {
  260. "status" : "published"
  261. }
  262. }
  263. }
  264. },
  265. "script": {
  266. "source": """
  267. float[] v = doc['my_dense_vector'].vectorValue;
  268. float vm = doc['my_dense_vector'].magnitude;
  269. float dotProduct = 0;
  270. for (int i = 0; i < v.length; i++) {
  271. dotProduct += v[i] * params.queryVector[i];
  272. }
  273. return dotProduct / (vm * (float) params.queryVectorMag);
  274. """,
  275. "params": {
  276. "queryVector": [4, 3.4, -0.2],
  277. "queryVectorMag": 5.25357
  278. }
  279. }
  280. }
  281. }
  282. }
  283. --------------------------------------------------
  284. [[vector-functions-bit-vectors]]
  285. ====== Bit vectors and vector functions
  286. When using `bit` vectors, not all the vector functions are available. The supported functions are:
  287. * <<vector-functions-hamming,`hamming`>> – calculates Hamming distance, the sum of the bitwise XOR of the two vectors
  288. * <<vector-functions-l1,`l1norm`>> – calculates L^1^ distance, this is simply the `hamming` distance
  289. * <<vector-functions-l2,`l2norm`>> - calculates L^2^ distance, this is the square root of the `hamming` distance
  290. * <<vector-functions-dot-product,`dotProduct`>> – calculates dot product. When comparing two `bit` vectors,
  291. this is the sum of the bitwise AND of the two vectors. If providing `float[]` or `byte[]`, who has `dims` number of elements, as a query vector, the `dotProduct` is
  292. the sum of the floating point values using the stored `bit` vector as a mask.
  293. Here is an example of using dot-product with bit vectors.
  294. [source,console]
  295. --------------------------------------------------
  296. PUT my-index-bit-vectors
  297. {
  298. "mappings": {
  299. "properties": {
  300. "my_dense_vector": {
  301. "type": "dense_vector",
  302. "index": false,
  303. "element_type": "bit",
  304. "dims": 40 <1>
  305. }
  306. }
  307. }
  308. }
  309. PUT my-index-bit-vectors/_doc/1
  310. {
  311. "my_dense_vector": [8, 5, -15, 1, -7] <2>
  312. }
  313. PUT my-index-bit-vectors/_doc/2
  314. {
  315. "my_dense_vector": [-1, 115, -3, 4, -128]
  316. }
  317. PUT my-index-bit-vectors/_doc/3
  318. {
  319. "my_dense_vector": [2, 18, -5, 0, -124]
  320. }
  321. POST my-index-bit-vectors/_refresh
  322. --------------------------------------------------
  323. // TEST[continued]
  324. <1> The number of dimensions or bits for the `bit` vector.
  325. <2> This vector represents 5 bytes, or `5 * 8 = 40` bits, which equals the configured dimensions
  326. [source,console]
  327. --------------------------------------------------
  328. GET my-index-bit-vectors/_search
  329. {
  330. "query": {
  331. "script_score": {
  332. "query" : {
  333. "match_all": {}
  334. },
  335. "script": {
  336. "source": "dotProduct(params.query_vector, 'my_dense_vector')",
  337. "params": {
  338. "query_vector": [8, 5, -15, 1, -7] <1>
  339. }
  340. }
  341. }
  342. }
  343. }
  344. --------------------------------------------------
  345. // TEST[continued]
  346. <1> This vector is 40 bits, and thus will compute a bitwise `&` operation with the stored vectors.
  347. [source,console]
  348. --------------------------------------------------
  349. GET my-index-bit-vectors/_search
  350. {
  351. "query": {
  352. "script_score": {
  353. "query" : {
  354. "match_all": {}
  355. },
  356. "script": {
  357. "source": "dotProduct(params.query_vector, 'my_dense_vector')",
  358. "params": {
  359. "query_vector": [0.23, 1.45, 3.67, 4.89, -0.56, 2.34, 3.21, 1.78, -2.45, 0.98, -0.12, 3.45, 4.56, 2.78, 1.23, 0.67, 3.89, 4.12, -2.34, 1.56, 0.78, 3.21, 4.12, 2.45, -1.67, 0.34, -3.45, 4.56, -2.78, 1.23, -0.67, 3.89, -4.34, 2.12, -1.56, 0.78, -3.21, 4.45, 2.12, 1.67] <1>
  360. }
  361. }
  362. }
  363. }
  364. }
  365. --------------------------------------------------
  366. // TEST[continued]
  367. <1> This vector is 40 individual dimensions, and thus will sum the floating point values using the stored `bit` vector as a mask.
  368. Currently, the `cosineSimilarity` function is not supported for `bit` vectors.