123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482 |
- [[docs-termvectors]]
- === Term vectors API
- ++++
- <titleabbrev>Term vectors</titleabbrev>
- ++++
- Retrieves information and statistics for terms in the fields of a particular document.
- [source,console]
- --------------------------------------------------
- GET /twitter/_termvectors/1
- --------------------------------------------------
- // TEST[setup:twitter]
- [[docs-termvectors-api-request]]
- ==== {api-request-title}
- `GET /<index>/_termvectors/<_id>`
- [[docs-termvectors-api-desc]]
- ==== {api-description-title}
- You can retrieve term vectors for documents stored in the index or
- for _artificial_ documents passed in the body of the request.
- You can specify the fields you are interested in through the `fields` parameter,
- or by adding the fields to the request body.
- [source,console]
- --------------------------------------------------
- GET /twitter/_termvectors/1?fields=message
- --------------------------------------------------
- // TEST[setup:twitter]
- Fields can be specified using wildcards, similar to the <<query-dsl-multi-match-query,multi match query>>.
- Term vectors are <<realtime,real-time>> by default, not near real-time.
- This can be changed by setting `realtime` parameter to `false`.
- You can request three types of values: _term information_, _term statistics_
- and _field statistics_. By default, all term information and field
- statistics are returned for all fields but term statistics are excluded.
- [[docs-termvectors-api-term-info]]
- ===== Term information
- * term frequency in the field (always returned)
- * term positions (`positions` : true)
- * start and end offsets (`offsets` : true)
- * term payloads (`payloads` : true), as base64 encoded bytes
- If the requested information wasn't stored in the index, it will be
- computed on the fly if possible. Additionally, term vectors could be computed
- for documents not even existing in the index, but instead provided by the user.
- [WARNING]
- ======
- Start and end offsets assume UTF-16 encoding is being used. If you want to use
- these offsets in order to get the original text that produced this token, you
- should make sure that the string you are taking a sub-string of is also encoded
- using UTF-16.
- ======
- [[docs-termvectors-api-term-stats]]
- ===== Term statistics
- Setting `term_statistics` to `true` (default is `false`) will
- return
- * total term frequency (how often a term occurs in all documents) +
- * document frequency (the number of documents containing the current
- term)
- By default these values are not returned since term statistics can
- have a serious performance impact.
- [[docs-termvectors-api-field-stats]]
- ===== Field statistics
- Setting `field_statistics` to `false` (default is `true`) will
- omit :
- * document count (how many documents contain this field)
- * sum of document frequencies (the sum of document frequencies for all
- terms in this field)
- * sum of total term frequencies (the sum of total term frequencies of
- each term in this field)
- [[docs-termvectors-api-terms-filtering]]
- ===== Terms filtering
- With the parameter `filter`, the terms returned could also be filtered based
- on their tf-idf scores. This could be useful in order find out a good
- characteristic vector of a document. This feature works in a similar manner to
- the <<mlt-query-term-selection,second phase>> of the
- <<query-dsl-mlt-query,More Like This Query>>. See <<docs-termvectors-terms-filtering,example 5>>
- for usage.
- The following sub-parameters are supported:
- [horizontal]
- `max_num_terms`::
- Maximum number of terms that must be returned per field. Defaults to `25`.
- `min_term_freq`::
- Ignore words with less than this frequency in the source doc. Defaults to `1`.
- `max_term_freq`::
- Ignore words with more than this frequency in the source doc. Defaults to unbounded.
- `min_doc_freq`::
- Ignore terms which do not occur in at least this many docs. Defaults to `1`.
- `max_doc_freq`::
- Ignore words which occur in more than this many docs. Defaults to unbounded.
- `min_word_length`::
- The minimum word length below which words will be ignored. Defaults to `0`.
- `max_word_length`::
- The maximum word length above which words will be ignored. Defaults to unbounded (`0`).
- [[docs-termvectors-api-behavior]]
- ==== Behaviour
- The term and field statistics are not accurate. Deleted documents
- are not taken into account. The information is only retrieved for the
- shard the requested document resides in.
- The term and field statistics are therefore only useful as relative measures
- whereas the absolute numbers have no meaning in this context. By default,
- when requesting term vectors of artificial documents, a shard to get the statistics
- from is randomly selected. Use `routing` only to hit a particular shard.
- [[docs-termvectors-api-path-params]]
- ==== {api-path-parms-title}
- `<index>`::
- (Required, string) Name of the index that contains the document.
- `<_id>`::
- (Optional, string) Unique identifier of the document.
- [[docs-termvectors-api-query-params]]
- ==== {api-query-parms-title}
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=fields]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=field_statistics]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=offsets]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=payloads]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=positions]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=preference]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=routing]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=realtime]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=term_statistics]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=version]
- include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=version_type]
- [[docs-termvectors-api-example]]
- ==== {api-examples-title}
- [[docs-termvectors-api-stored-termvectors]]
- ===== Returning stored term vectors
- First, we create an index that stores term vectors, payloads etc. :
- [source,console]
- --------------------------------------------------
- PUT /twitter
- { "mappings": {
- "properties": {
- "text": {
- "type": "text",
- "term_vector": "with_positions_offsets_payloads",
- "store" : true,
- "analyzer" : "fulltext_analyzer"
- },
- "fullname": {
- "type": "text",
- "term_vector": "with_positions_offsets_payloads",
- "analyzer" : "fulltext_analyzer"
- }
- }
- },
- "settings" : {
- "index" : {
- "number_of_shards" : 1,
- "number_of_replicas" : 0
- },
- "analysis": {
- "analyzer": {
- "fulltext_analyzer": {
- "type": "custom",
- "tokenizer": "whitespace",
- "filter": [
- "lowercase",
- "type_as_payload"
- ]
- }
- }
- }
- }
- }
- --------------------------------------------------
- Second, we add some documents:
- [source,console]
- --------------------------------------------------
- PUT /twitter/_doc/1
- {
- "fullname" : "John Doe",
- "text" : "twitter test test test "
- }
- PUT /twitter/_doc/2?refresh=wait_for
- {
- "fullname" : "Jane Doe",
- "text" : "Another twitter test ..."
- }
- --------------------------------------------------
- // TEST[continued]
- The following request returns all information and statistics for field
- `text` in document `1` (John Doe):
- [source,console]
- --------------------------------------------------
- GET /twitter/_termvectors/1
- {
- "fields" : ["text"],
- "offsets" : true,
- "payloads" : true,
- "positions" : true,
- "term_statistics" : true,
- "field_statistics" : true
- }
- --------------------------------------------------
- // TEST[continued]
- Response:
- [source,console-result]
- --------------------------------------------------
- {
- "_id": "1",
- "_index": "twitter",
- "_version": 1,
- "found": true,
- "took": 6,
- "term_vectors": {
- "text": {
- "field_statistics": {
- "doc_count": 2,
- "sum_doc_freq": 6,
- "sum_ttf": 8
- },
- "terms": {
- "test": {
- "doc_freq": 2,
- "term_freq": 3,
- "tokens": [
- {
- "end_offset": 12,
- "payload": "d29yZA==",
- "position": 1,
- "start_offset": 8
- },
- {
- "end_offset": 17,
- "payload": "d29yZA==",
- "position": 2,
- "start_offset": 13
- },
- {
- "end_offset": 22,
- "payload": "d29yZA==",
- "position": 3,
- "start_offset": 18
- }
- ],
- "ttf": 4
- },
- "twitter": {
- "doc_freq": 2,
- "term_freq": 1,
- "tokens": [
- {
- "end_offset": 7,
- "payload": "d29yZA==",
- "position": 0,
- "start_offset": 0
- }
- ],
- "ttf": 2
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[continued]
- // TESTRESPONSE[s/"took": 6/"took": "$body.took"/]
- [[docs-termvectors-api-generate-termvectors]]
- ===== Generating term vectors on the fly
- Term vectors which are not explicitly stored in the index are automatically
- computed on the fly. The following request returns all information and statistics for the
- fields in document `1`, even though the terms haven't been explicitly stored in the index.
- Note that for the field `text`, the terms are not re-generated.
- [source,console]
- --------------------------------------------------
- GET /twitter/_termvectors/1
- {
- "fields" : ["text", "some_field_without_term_vectors"],
- "offsets" : true,
- "positions" : true,
- "term_statistics" : true,
- "field_statistics" : true
- }
- --------------------------------------------------
- // TEST[continued]
- [[docs-termvectors-artificial-doc]]
- ===== Artificial documents
- Term vectors can also be generated for artificial documents,
- that is for documents not present in the index. For example, the following request would
- return the same results as in example 1. The mapping used is determined by the `index`.
- *If dynamic mapping is turned on (default), the document fields not in the original
- mapping will be dynamically created.*
- [source,console]
- --------------------------------------------------
- GET /twitter/_termvectors
- {
- "doc" : {
- "fullname" : "John Doe",
- "text" : "twitter test test test"
- }
- }
- --------------------------------------------------
- // TEST[continued]
- [[docs-termvectors-per-field-analyzer]]
- ====== Per-field analyzer
- Additionally, a different analyzer than the one at the field may be provided
- by using the `per_field_analyzer` parameter. This is useful in order to
- generate term vectors in any fashion, especially when using artificial
- documents. When providing an analyzer for a field that already stores term
- vectors, the term vectors will be re-generated.
- [source,console]
- --------------------------------------------------
- GET /twitter/_termvectors
- {
- "doc" : {
- "fullname" : "John Doe",
- "text" : "twitter test test test"
- },
- "fields": ["fullname"],
- "per_field_analyzer" : {
- "fullname": "keyword"
- }
- }
- --------------------------------------------------
- // TEST[continued]
- Response:
- [source,console-result]
- --------------------------------------------------
- {
- "_index": "twitter",
- "_version": 0,
- "found": true,
- "took": 6,
- "term_vectors": {
- "fullname": {
- "field_statistics": {
- "sum_doc_freq": 2,
- "doc_count": 4,
- "sum_ttf": 4
- },
- "terms": {
- "John Doe": {
- "term_freq": 1,
- "tokens": [
- {
- "position": 0,
- "start_offset": 0,
- "end_offset": 8
- }
- ]
- }
- }
- }
- }
- }
- --------------------------------------------------
- // TEST[continued]
- // TESTRESPONSE[s/"took": 6/"took": "$body.took"/]
- // TESTRESPONSE[s/"sum_doc_freq": 2/"sum_doc_freq": "$body.term_vectors.fullname.field_statistics.sum_doc_freq"/]
- // TESTRESPONSE[s/"doc_count": 4/"doc_count": "$body.term_vectors.fullname.field_statistics.doc_count"/]
- // TESTRESPONSE[s/"sum_ttf": 4/"sum_ttf": "$body.term_vectors.fullname.field_statistics.sum_ttf"/]
- [[docs-termvectors-terms-filtering]]
- ===== Terms filtering
- Finally, the terms returned could be filtered based on their tf-idf scores. In
- the example below we obtain the three most "interesting" keywords from the
- artificial document having the given "plot" field value. Notice
- that the keyword "Tony" or any stop words are not part of the response, as
- their tf-idf must be too low.
- [source,console]
- --------------------------------------------------
- GET /imdb/_termvectors
- {
- "doc": {
- "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
- },
- "term_statistics" : true,
- "field_statistics" : true,
- "positions": false,
- "offsets": false,
- "filter" : {
- "max_num_terms" : 3,
- "min_term_freq" : 1,
- "min_doc_freq" : 1
- }
- }
- --------------------------------------------------
- // TEST[skip:no imdb test index]
- Response:
- [source,console-result]
- --------------------------------------------------
- {
- "_index": "imdb",
- "_version": 0,
- "found": true,
- "term_vectors": {
- "plot": {
- "field_statistics": {
- "sum_doc_freq": 3384269,
- "doc_count": 176214,
- "sum_ttf": 3753460
- },
- "terms": {
- "armored": {
- "doc_freq": 27,
- "ttf": 27,
- "term_freq": 1,
- "score": 9.74725
- },
- "industrialist": {
- "doc_freq": 88,
- "ttf": 88,
- "term_freq": 1,
- "score": 8.590818
- },
- "stark": {
- "doc_freq": 44,
- "ttf": 47,
- "term_freq": 1,
- "score": 9.272792
- }
- }
- }
- }
- }
- --------------------------------------------------
|