mlt-query.asciidoc 4.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
  1. [[query-dsl-mlt-query]]
  2. === More Like This Query
  3. More like this query find documents that are "like" provided text by
  4. running it against one or more fields.
  5. [source,js]
  6. --------------------------------------------------
  7. {
  8. "more_like_this" : {
  9. "fields" : ["name.first", "name.last"],
  10. "like_text" : "text like this one",
  11. "min_term_freq" : 1,
  12. "max_query_terms" : 12
  13. }
  14. }
  15. --------------------------------------------------
  16. Additionally, More Like This can find documents that are "like" a set of
  17. chosen documents. The syntax to specify one or more documents is similar to
  18. the <<docs-multi-get,Multi GET API>>, and supports the `ids` or `docs` array.
  19. If only one document is specified, the query behaves the same as the
  20. <<search-more-like-this,More Like This API>>.
  21. [source,js]
  22. --------------------------------------------------
  23. {
  24. "more_like_this" : {
  25. "fields" : ["name.first", "name.last"],
  26. "docs" : [
  27. {
  28. "_index" : "test",
  29. "_type" : "type",
  30. "_id" : "1"
  31. },
  32. {
  33. "_index" : "test",
  34. "_type" : "type",
  35. "_id" : "2"
  36. }
  37. ],
  38. "ids" : ["3", "4"],
  39. "min_term_freq" : 1,
  40. "max_query_terms" : 12
  41. }
  42. }
  43. --------------------------------------------------
  44. `more_like_this` can be shortened to `mlt`.
  45. Under the hood, `more_like_this` simply creates multiple `should` clauses in a `bool` query of
  46. interesting terms extracted from some provided text. The interesting terms are
  47. selected with respect to their tf-idf scores. These are controlled by
  48. `min_term_freq`, `min_doc_freq`, and `max_doc_freq`. The number of interesting
  49. terms is controlled by `max_query_terms`. While the minimum number of clauses
  50. that must be satisfied is controlled by `percent_terms_to_match`. The terms
  51. are extracted from `like_text` which is analyzed by the analyzer associated
  52. with the field, unless specified by `analyzer`. There are other parameters,
  53. such as `min_word_length`, `max_word_length` or `stop_words`, to control what
  54. terms should be considered as interesting. In order to give more weight to
  55. more interesting terms, each boolean clause associated with a term could be
  56. boosted by the term tf-idf score times some boosting factor `boost_terms`.
  57. When a search for multiple `docs` is issued, More Like This generates a
  58. `more_like_this` query per document field in `fields`. These `fields` are
  59. specified as a top level parameter or within each `doc`.
  60. The `more_like_this` top level parameters include:
  61. [cols="<,<",options="header",]
  62. |=======================================================================
  63. |Parameter |Description
  64. |`fields` |A list of the fields to run the more like this query against.
  65. Defaults to the `_all` field.
  66. |`like_text` |The text to find documents like it, *required* if `ids` or `docs` are
  67. not specified.
  68. |`ids` or `docs` |A list of documents following the same syntax as the
  69. <<docs-multi-get,Multi GET API>>. This parameter is *required* if
  70. `like_text` is not specified. The texts are fetched from `fields` unless
  71. specified in each `doc`, and cannot be set to `_all`.
  72. |`include` |When using `ids` or `docs`, specifies whether the documents should be
  73. included from the search. Defaults to `false`.
  74. |`percent_terms_to_match` |The percentage of terms to match on (float
  75. value). Defaults to `0.3` (30 percent).
  76. |`min_term_freq` |The frequency below which terms will be ignored in the
  77. source doc. The default frequency is `2`.
  78. |`max_query_terms` |The maximum number of query terms that will be
  79. included in any generated query. Defaults to `25`.
  80. |`stop_words` |An array of stop words. Any word in this set is
  81. considered "uninteresting" and ignored. Even if your Analyzer allows
  82. stopwords, you might want to tell the MoreLikeThis code to ignore them,
  83. as for the purposes of document similarity it seems reasonable to assume
  84. that "a stop word is never interesting".
  85. |`min_doc_freq` |The frequency at which words will be ignored which do
  86. not occur in at least this many docs. Defaults to `5`.
  87. |`max_doc_freq` |The maximum frequency in which words may still appear.
  88. Words that appear in more than this many docs will be ignored. Defaults
  89. to unbounded.
  90. |`min_word_length` |The minimum word length below which words will be
  91. ignored. Defaults to `0`.(Old name "min_word_len" is deprecated)
  92. |`max_word_length` |The maximum word length above which words will be
  93. ignored. Defaults to unbounded (`0`). (Old name "max_word_len" is deprecated)
  94. |`boost_terms` |Sets the boost factor to use when boosting terms.
  95. Defaults to deactivated (`0`). Any other value activates boosting with given
  96. boost factor.
  97. |`boost` |Sets the boost value of the query. Defaults to `1.0`.
  98. |`analyzer` |The analyzer that will be used to analyze the text.
  99. Defaults to the analyzer associated with the field.
  100. |=======================================================================