123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173 |
- [[query-dsl-mlt-query]]
- === More Like This Query
- More like this query find documents that are "like" provided text by
- running it against one or more fields.
- [source,js]
- --------------------------------------------------
- {
- "more_like_this" : {
- "fields" : ["name.first", "name.last"],
- "like" : "text like this one",
- "min_term_freq" : 1,
- "max_query_terms" : 12
- }
- }
- --------------------------------------------------
- More Like This can find documents that are "like" a set of
- chosen documents. The syntax to specify one or more documents is similar to
- the <<docs-multi-get,Multi GET API>>.
- If only one document is specified, the query behaves the same as the
- <<search-more-like-this,More Like This API>>.
- [source,js]
- --------------------------------------------------
- {
- "more_like_this" : {
- "fields" : ["name.first", "name.last"],
- "like" : [
- {
- "_index" : "test",
- "_type" : "type",
- "_id" : "1"
- },
- {
- "_index" : "test",
- "_type" : "type",
- "_id" : "2"
- },
- "and also some text like this one!"
- ],
- "min_term_freq" : 1,
- "max_query_terms" : 12
- }
- }
- --------------------------------------------------
- Additionally, <<docs-termvectors-artificial-doc,artificial documents>> are also supported.
- This is useful in order to specify one or more documents not present in the index.
- [source,js]
- --------------------------------------------------
- {
- "more_like_this" : {
- "fields" : ["name.first", "name.last"],
- "like" : [
- {
- "_index" : "test",
- "_type" : "type",
- "doc" : {
- "name": {
- "first": "Ben",
- "last": "Grimm"
- },
- "tweet": "You got no idea what I'd... what I'd give to be invisible."
- }
- }
- },
- {
- "_index" : "test",
- "_type" : "type",
- "_id" : "2"
- }
- ],
- "min_term_freq" : 1,
- "max_query_terms" : 12
- }
- }
- --------------------------------------------------
- `more_like_this` can be shortened to `mlt`.
- Under the hood, `more_like_this` simply creates multiple `should` clauses in a `bool` query of
- interesting terms extracted from some provided text. The interesting terms are
- selected with respect to their tf-idf scores. These are controlled by
- `min_term_freq`, `min_doc_freq`, and `max_doc_freq`. The number of interesting
- terms is controlled by `max_query_terms`. While the minimum number of clauses
- that must be satisfied is controlled by `minimum_should_match`. The terms
- are extracted from the text in `like` and analyzed by the analyzer associated
- with the field, unless specified by `analyzer`. There are other parameters,
- such as `min_word_length`, `max_word_length` or `stop_words`, to control what
- terms should be considered as interesting. In order to give more weight to
- more interesting terms, each boolean clause associated with a term could be
- boosted by the term tf-idf score times some boosting factor `boost_terms`.
- When a search for multiple documents is issued, More Like This generates a
- `more_like_this` query per document field in `fields`. These `fields` are
- specified as a top level parameter or within each document request.
- IMPORTANT: The fields must be indexed and of type `string`. Additionally, when
- using `like` with documents, the fields must be either `stored`, store `term_vector`
- or `_source` must be enabled.
- The `more_like_this` top level parameters include:
- [cols="<,<",options="header",]
- |=======================================================================
- |Parameter |Description
- |`fields` |A list of the fields to run the more like this query against.
- Defaults to the `_all` field for text and to all possible fields
- for documents.
- |`like`|coming[2.0]
- Can either be some text, some documents or a combination of all, *required*.
- A document request follows the same syntax as the
- <<docs-multi-get,Multi Get API>> or <<docs-multi-termvectors,Multi Term Vectors API>>.
- In this case, the text is fetched from `fields` unless specified otherwise in each document request.
- The text is analyzed by the default analyzer at the field, unless overridden by the
- `per_field_analyzer` parameter of the <<docs-termvectors-per-field-analyzer,Term Vectors API>>.
- |`like_text` |deprecated[2.0,Replaced by `like`]
- The text to find documents like it, *required* if `ids` or `docs` are
- not specified.
- |`ids` or `docs` |deprecated[2.0,Replaced by `like`]
- A list of documents following the same syntax as the
- <<docs-multi-get,Multi GET API>> or <<docs-multi-termvectors,Multi termvectors API>>.
- The text is fetched from `fields` unless specified otherwise in each `doc`.
- The text is analyzed by the default analyzer at the field, unless specified by the
- `per_field_analyzer` parameter of the <<docs-termvectors-per-field-analyzer,Term Vectors API>>.
- |`include` |When using `like` with document requests, specifies whether the documents should be
- included from the search. Defaults to `false`.
- |`minimum_should_match`| From the generated query, the number of terms that
- must match following the <<query-dsl-minimum-should-match,minimum should
- syntax>>. (Defaults to `"30%"`).
- |`min_term_freq` |The frequency below which terms will be ignored in the
- source doc. The default frequency is `2`.
- |`max_query_terms` |The maximum number of query terms that will be
- included in any generated query. Defaults to `25`.
- |`stop_words` |An array of stop words. Any word in this set is
- considered "uninteresting" and ignored. Even if your Analyzer allows
- stopwords, you might want to tell the MoreLikeThis code to ignore them,
- as for the purposes of document similarity it seems reasonable to assume
- that "a stop word is never interesting".
- |`min_doc_freq` |The frequency at which words will be ignored which do
- not occur in at least this many docs. Defaults to `5`.
- |`max_doc_freq` |The maximum frequency in which words may still appear.
- Words that appear in more than this many docs will be ignored. Defaults
- to unbounded.
- |`min_word_length` |The minimum word length below which words will be
- ignored. Defaults to `0`.(Old name "min_word_len" is deprecated)
- |`max_word_length` |The maximum word length above which words will be
- ignored. Defaults to unbounded (`0`). (Old name "max_word_len" is deprecated)
- |`boost_terms` |Sets the boost factor to use when boosting terms.
- Defaults to deactivated (`0`). Any other value activates boosting with given
- boost factor.
- |`boost` |Sets the boost value of the query. Defaults to `1.0`.
- |`analyzer` |The analyzer that will be used to analyze the `like text`.
- Defaults to the analyzer associated with the first field in `fields`.
- |=======================================================================
|