index-search-time.asciidoc 4.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175
  1. [[analysis-index-search-time]]
  2. === Index and search analysis
  3. Text analysis occurs at two times:
  4. Index time::
  5. When a document is indexed, any <<text,`text`>> field values are analyzed.
  6. Search time::
  7. When running a <<full-text-queries,full-text search>> on a `text` field,
  8. the query string (the text the user is searching for) is analyzed.
  9. +
  10. Search time is also called _query time_.
  11. The analyzer, or set of analysis rules, used at each time is called the _index
  12. analyzer_ or _search analyzer_ respectively.
  13. [[analysis-same-index-search-analyzer]]
  14. ==== How the index and search analyzer work together
  15. In most cases, the same analyzer should be used at index and search time. This
  16. ensures the values and query strings for a field are changed into the same form
  17. of tokens. In turn, this ensures the tokens match as expected during a search.
  18. .**Example**
  19. [%collapsible]
  20. ====
  21. A document is indexed with the following value in a `text` field:
  22. [source,text]
  23. ------
  24. The QUICK brown foxes jumped over the dog!
  25. ------
  26. The index analyzer for the field converts the value into tokens and normalizes
  27. them. In this case, each of the tokens represents a word:
  28. [source,text]
  29. ------
  30. [ quick, brown, fox, jump, over, dog ]
  31. ------
  32. These tokens are then indexed.
  33. Later, a user searches the same `text` field for:
  34. [source,text]
  35. ------
  36. "Quick fox"
  37. ------
  38. The user expects this search to match the sentence indexed earlier,
  39. `The QUICK brown foxes jumped over the dog!`.
  40. However, the query string does not contain the exact words used in the
  41. document's original text:
  42. * `Quick` vs `QUICK`
  43. * `fox` vs `foxes`
  44. To account for this, the query string is analyzed using the same analyzer. This
  45. analyzer produces the following tokens:
  46. [source,text]
  47. ------
  48. [ quick, fox ]
  49. ------
  50. To execute the search, {es} compares these query string tokens to the tokens
  51. indexed in the `text` field.
  52. [options="header"]
  53. |===
  54. |Token | Query string | `text` field
  55. |`quick` | X | X
  56. |`brown` | | X
  57. |`fox` | X | X
  58. |`jump` | | X
  59. |`over` | | X
  60. |`dog` | | X
  61. |===
  62. Because the field value and query string were analyzed in the same way, they
  63. created similar tokens. The tokens `quick` and `fox` are exact matches. This
  64. means the search matches the document containing `"The QUICK brown foxes jumped
  65. over the dog!"`, just as the user expects.
  66. ====
  67. [[different-analyzers]]
  68. ==== When to use a different search analyzer
  69. While less common, it sometimes makes sense to use different analyzers at index
  70. and search time. To enable this, {es} allows you to
  71. <<specify-search-analyzer,specify a separate search analyzer>>.
  72. Generally, a separate search analyzer should only be specified when using the
  73. same form of tokens for field values and query strings would create unexpected
  74. or irrelevant search matches.
  75. [[different-analyzer-ex]]
  76. .*Example*
  77. [%collapsible]
  78. ====
  79. {es} is used to create a search engine that matches only words that start with
  80. a provided prefix. For instance, a search for `tr` should return `tram` or
  81. `trope`—but never `taxi` or `bat`.
  82. A document is added to the search engine's index; this document contains one
  83. such word in a `text` field:
  84. [source,text]
  85. ------
  86. "Apple"
  87. ------
  88. The index analyzer for the field converts the value into tokens and normalizes
  89. them. In this case, each of the tokens represents a potential prefix for
  90. the word:
  91. [source,text]
  92. ------
  93. [ a, ap, app, appl, apple]
  94. ------
  95. These tokens are then indexed.
  96. Later, a user searches the same `text` field for:
  97. [source,text]
  98. ------
  99. "appli"
  100. ------
  101. The user expects this search to match only words that start with `appli`,
  102. such as `appliance` or `application`. The search should not match `apple`.
  103. However, if the index analyzer is used to analyze this query string, it would
  104. produce the following tokens:
  105. [source,text]
  106. ------
  107. [ a, ap, app, appl, appli ]
  108. ------
  109. When {es} compares these query string tokens to the ones indexed for `apple`,
  110. it finds several matches.
  111. [options="header"]
  112. |===
  113. |Token | `appli` | `apple`
  114. |`a` | X | X
  115. |`ap` | X | X
  116. |`app` | X | X
  117. |`appl` | X | X
  118. |`appli` | | X
  119. |===
  120. This means the search would erroneously match `apple`. Not only that, it would
  121. match any word starting with `a`.
  122. To fix this, you can specify a different search analyzer for query strings used
  123. on the `text` field.
  124. In this case, you could specify a search analyzer that produces a single token
  125. rather than a set of prefixes:
  126. [source,text]
  127. ------
  128. [ appli ]
  129. ------
  130. This query string token would only match tokens for words that start with
  131. `appli`, which better aligns with the user's search expectations.
  132. ====