termvectors.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458
  1. [[docs-termvectors]]
  2. == Term Vectors
  3. Returns information and statistics on terms in the fields of a particular
  4. document. The document could be stored in the index or artificially provided
  5. by the user. Term vectors are <<realtime,realtime>> by default, not near
  6. realtime. This can be changed by setting `realtime` parameter to `false`.
  7. [source,js]
  8. --------------------------------------------------
  9. GET /twitter/_doc/1/_termvectors
  10. --------------------------------------------------
  11. // CONSOLE
  12. // TEST[setup:twitter]
  13. Optionally, you can specify the fields for which the information is
  14. retrieved either with a parameter in the url
  15. [source,js]
  16. --------------------------------------------------
  17. GET /twitter/_doc/1/_termvectors?fields=message
  18. --------------------------------------------------
  19. // CONSOLE
  20. // TEST[setup:twitter]
  21. or by adding the requested fields in the request body (see
  22. example below). Fields can also be specified with wildcards
  23. in similar way to the <<query-dsl-multi-match-query,multi match query>>
  24. [WARNING]
  25. Note that the usage of `/_termvector` is deprecated in 2.0, and replaced by `/_termvectors`.
  26. [float]
  27. === Return values
  28. Three types of values can be requested: _term information_, _term statistics_
  29. and _field statistics_. By default, all term information and field
  30. statistics are returned for all fields but no term statistics.
  31. [float]
  32. ==== Term information
  33. * term frequency in the field (always returned)
  34. * term positions (`positions` : true)
  35. * start and end offsets (`offsets` : true)
  36. * term payloads (`payloads` : true), as base64 encoded bytes
  37. If the requested information wasn't stored in the index, it will be
  38. computed on the fly if possible. Additionally, term vectors could be computed
  39. for documents not even existing in the index, but instead provided by the user.
  40. [WARNING]
  41. ======
  42. Start and end offsets assume UTF-16 encoding is being used. If you want to use
  43. these offsets in order to get the original text that produced this token, you
  44. should make sure that the string you are taking a sub-string of is also encoded
  45. using UTF-16.
  46. ======
  47. [float]
  48. ==== Term statistics
  49. Setting `term_statistics` to `true` (default is `false`) will
  50. return
  51. * total term frequency (how often a term occurs in all documents) +
  52. * document frequency (the number of documents containing the current
  53. term)
  54. By default these values are not returned since term statistics can
  55. have a serious performance impact.
  56. [float]
  57. ==== Field statistics
  58. Setting `field_statistics` to `false` (default is `true`) will
  59. omit :
  60. * document count (how many documents contain this field)
  61. * sum of document frequencies (the sum of document frequencies for all
  62. terms in this field)
  63. * sum of total term frequencies (the sum of total term frequencies of
  64. each term in this field)
  65. [float]
  66. ==== Terms Filtering
  67. With the parameter `filter`, the terms returned could also be filtered based
  68. on their tf-idf scores. This could be useful in order find out a good
  69. characteristic vector of a document. This feature works in a similar manner to
  70. the <<mlt-query-term-selection,second phase>> of the
  71. <<query-dsl-mlt-query,More Like This Query>>. See <<docs-termvectors-terms-filtering,example 5>>
  72. for usage.
  73. The following sub-parameters are supported:
  74. [horizontal]
  75. `max_num_terms`::
  76. Maximum number of terms that must be returned per field. Defaults to `25`.
  77. `min_term_freq`::
  78. Ignore words with less than this frequency in the source doc. Defaults to `1`.
  79. `max_term_freq`::
  80. Ignore words with more than this frequency in the source doc. Defaults to unbounded.
  81. `min_doc_freq`::
  82. Ignore terms which do not occur in at least this many docs. Defaults to `1`.
  83. `max_doc_freq`::
  84. Ignore words which occur in more than this many docs. Defaults to unbounded.
  85. `min_word_length`::
  86. The minimum word length below which words will be ignored. Defaults to `0`.
  87. `max_word_length`::
  88. The maximum word length above which words will be ignored. Defaults to unbounded (`0`).
  89. [float]
  90. === Behaviour
  91. The term and field statistics are not accurate. Deleted documents
  92. are not taken into account. The information is only retrieved for the
  93. shard the requested document resides in.
  94. The term and field statistics are therefore only useful as relative measures
  95. whereas the absolute numbers have no meaning in this context. By default,
  96. when requesting term vectors of artificial documents, a shard to get the statistics
  97. from is randomly selected. Use `routing` only to hit a particular shard.
  98. [float]
  99. ==== Example: Returning stored term vectors
  100. First, we create an index that stores term vectors, payloads etc. :
  101. [source,js]
  102. --------------------------------------------------
  103. PUT /twitter/
  104. { "mappings": {
  105. "_doc": {
  106. "properties": {
  107. "text": {
  108. "type": "text",
  109. "term_vector": "with_positions_offsets_payloads",
  110. "store" : true,
  111. "analyzer" : "fulltext_analyzer"
  112. },
  113. "fullname": {
  114. "type": "text",
  115. "term_vector": "with_positions_offsets_payloads",
  116. "analyzer" : "fulltext_analyzer"
  117. }
  118. }
  119. }
  120. },
  121. "settings" : {
  122. "index" : {
  123. "number_of_shards" : 1,
  124. "number_of_replicas" : 0
  125. },
  126. "analysis": {
  127. "analyzer": {
  128. "fulltext_analyzer": {
  129. "type": "custom",
  130. "tokenizer": "whitespace",
  131. "filter": [
  132. "lowercase",
  133. "type_as_payload"
  134. ]
  135. }
  136. }
  137. }
  138. }
  139. }
  140. --------------------------------------------------
  141. // CONSOLE
  142. Second, we add some documents:
  143. [source,js]
  144. --------------------------------------------------
  145. PUT /twitter/_doc/1
  146. {
  147. "fullname" : "John Doe",
  148. "text" : "twitter test test test "
  149. }
  150. PUT /twitter/_doc/2
  151. {
  152. "fullname" : "Jane Doe",
  153. "text" : "Another twitter test ..."
  154. }
  155. --------------------------------------------------
  156. // CONSOLE
  157. // TEST[continued]
  158. The following request returns all information and statistics for field
  159. `text` in document `1` (John Doe):
  160. [source,js]
  161. --------------------------------------------------
  162. GET /twitter/_doc/1/_termvectors
  163. {
  164. "fields" : ["text"],
  165. "offsets" : true,
  166. "payloads" : true,
  167. "positions" : true,
  168. "term_statistics" : true,
  169. "field_statistics" : true
  170. }
  171. --------------------------------------------------
  172. // CONSOLE
  173. // TEST[continued]
  174. Response:
  175. [source,js]
  176. --------------------------------------------------
  177. {
  178. "_id": "1",
  179. "_index": "twitter",
  180. "_type": "_doc",
  181. "_version": 1,
  182. "found": true,
  183. "took": 6,
  184. "term_vectors": {
  185. "text": {
  186. "field_statistics": {
  187. "doc_count": 2,
  188. "sum_doc_freq": 6,
  189. "sum_ttf": 8
  190. },
  191. "terms": {
  192. "test": {
  193. "doc_freq": 2,
  194. "term_freq": 3,
  195. "tokens": [
  196. {
  197. "end_offset": 12,
  198. "payload": "d29yZA==",
  199. "position": 1,
  200. "start_offset": 8
  201. },
  202. {
  203. "end_offset": 17,
  204. "payload": "d29yZA==",
  205. "position": 2,
  206. "start_offset": 13
  207. },
  208. {
  209. "end_offset": 22,
  210. "payload": "d29yZA==",
  211. "position": 3,
  212. "start_offset": 18
  213. }
  214. ],
  215. "ttf": 4
  216. },
  217. "twitter": {
  218. "doc_freq": 2,
  219. "term_freq": 1,
  220. "tokens": [
  221. {
  222. "end_offset": 7,
  223. "payload": "d29yZA==",
  224. "position": 0,
  225. "start_offset": 0
  226. }
  227. ],
  228. "ttf": 2
  229. }
  230. }
  231. }
  232. }
  233. }
  234. --------------------------------------------------
  235. // TEST[continued]
  236. // TESTRESPONSE[s/"took": 6/"took": "$body.took"/]
  237. [float]
  238. ==== Example: Generating term vectors on the fly
  239. Term vectors which are not explicitly stored in the index are automatically
  240. computed on the fly. The following request returns all information and statistics for the
  241. fields in document `1`, even though the terms haven't been explicitly stored in the index.
  242. Note that for the field `text`, the terms are not re-generated.
  243. [source,js]
  244. --------------------------------------------------
  245. GET /twitter/_doc/1/_termvectors
  246. {
  247. "fields" : ["text", "some_field_without_term_vectors"],
  248. "offsets" : true,
  249. "positions" : true,
  250. "term_statistics" : true,
  251. "field_statistics" : true
  252. }
  253. --------------------------------------------------
  254. // CONSOLE
  255. // TEST[continued]
  256. [[docs-termvectors-artificial-doc]]
  257. [float]
  258. ==== Example: Artificial documents
  259. Term vectors can also be generated for artificial documents,
  260. that is for documents not present in the index. For example, the following request would
  261. return the same results as in example 1. The mapping used is determined by the
  262. `index` and `type`.
  263. *If dynamic mapping is turned on (default), the document fields not in the original
  264. mapping will be dynamically created.*
  265. [source,js]
  266. --------------------------------------------------
  267. GET /twitter/_doc/_termvectors
  268. {
  269. "doc" : {
  270. "fullname" : "John Doe",
  271. "text" : "twitter test test test"
  272. }
  273. }
  274. --------------------------------------------------
  275. // CONSOLE
  276. // TEST[continued]
  277. [[docs-termvectors-per-field-analyzer]]
  278. [float]
  279. ===== Per-field analyzer
  280. Additionally, a different analyzer than the one at the field may be provided
  281. by using the `per_field_analyzer` parameter. This is useful in order to
  282. generate term vectors in any fashion, especially when using artificial
  283. documents. When providing an analyzer for a field that already stores term
  284. vectors, the term vectors will be re-generated.
  285. [source,js]
  286. --------------------------------------------------
  287. GET /twitter/_doc/_termvectors
  288. {
  289. "doc" : {
  290. "fullname" : "John Doe",
  291. "text" : "twitter test test test"
  292. },
  293. "fields": ["fullname"],
  294. "per_field_analyzer" : {
  295. "fullname": "keyword"
  296. }
  297. }
  298. --------------------------------------------------
  299. // CONSOLE
  300. // TEST[continued]
  301. Response:
  302. [source,js]
  303. --------------------------------------------------
  304. {
  305. "_index": "twitter",
  306. "_type": "_doc",
  307. "_version": 0,
  308. "found": true,
  309. "took": 6,
  310. "term_vectors": {
  311. "fullname": {
  312. "field_statistics": {
  313. "sum_doc_freq": 2,
  314. "doc_count": 4,
  315. "sum_ttf": 4
  316. },
  317. "terms": {
  318. "John Doe": {
  319. "term_freq": 1,
  320. "tokens": [
  321. {
  322. "position": 0,
  323. "start_offset": 0,
  324. "end_offset": 8
  325. }
  326. ]
  327. }
  328. }
  329. }
  330. }
  331. }
  332. --------------------------------------------------
  333. // TEST[continued]
  334. // TESTRESPONSE[s/"took": 6/"took": "$body.took"/]
  335. // TESTRESPONSE[s/"sum_doc_freq": 2/"sum_doc_freq": "$body.term_vectors.fullname.field_statistics.sum_doc_freq"/]
  336. // TESTRESPONSE[s/"doc_count": 4/"doc_count": "$body.term_vectors.fullname.field_statistics.doc_count"/]
  337. // TESTRESPONSE[s/"sum_ttf": 4/"sum_ttf": "$body.term_vectors.fullname.field_statistics.sum_ttf"/]
  338. [[docs-termvectors-terms-filtering]]
  339. [float]
  340. ==== Example: Terms filtering
  341. Finally, the terms returned could be filtered based on their tf-idf scores. In
  342. the example below we obtain the three most "interesting" keywords from the
  343. artificial document having the given "plot" field value. Notice
  344. that the keyword "Tony" or any stop words are not part of the response, as
  345. their tf-idf must be too low.
  346. [source,js]
  347. --------------------------------------------------
  348. GET /imdb/_doc/_termvectors
  349. {
  350. "doc": {
  351. "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
  352. },
  353. "term_statistics" : true,
  354. "field_statistics" : true,
  355. "positions": false,
  356. "offsets": false,
  357. "filter" : {
  358. "max_num_terms" : 3,
  359. "min_term_freq" : 1,
  360. "min_doc_freq" : 1
  361. }
  362. }
  363. --------------------------------------------------
  364. // CONSOLE
  365. // TEST[skip:no imdb test index]
  366. Response:
  367. [source,js]
  368. --------------------------------------------------
  369. {
  370. "_index": "imdb",
  371. "_type": "_doc",
  372. "_version": 0,
  373. "found": true,
  374. "term_vectors": {
  375. "plot": {
  376. "field_statistics": {
  377. "sum_doc_freq": 3384269,
  378. "doc_count": 176214,
  379. "sum_ttf": 3753460
  380. },
  381. "terms": {
  382. "armored": {
  383. "doc_freq": 27,
  384. "ttf": 27,
  385. "term_freq": 1,
  386. "score": 9.74725
  387. },
  388. "industrialist": {
  389. "doc_freq": 88,
  390. "ttf": 88,
  391. "term_freq": 1,
  392. "score": 8.590818
  393. },
  394. "stark": {
  395. "doc_freq": 44,
  396. "ttf": 47,
  397. "term_freq": 1,
  398. "score": 9.272792
  399. }
  400. }
  401. }
  402. }
  403. }
  404. --------------------------------------------------
  405. // TESTRESPONSE