analyzer.asciidoc 5.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169
  1. [[analyzer]]
  2. === `analyzer`
  3. The values of <<mapping-index,`analyzed`>> string fields are passed through an
  4. <<analysis,analyzer>> to convert the string into a stream of _tokens_ or
  5. _terms_. For instance, the string `"The quick Brown Foxes."` may, depending
  6. on which analyzer is used, be analyzed to the tokens: `quick`, `brown`,
  7. `fox`. These are the actual terms that are indexed for the field, which makes
  8. it possible to search efficiently for individual words _within_ big blobs of
  9. text.
  10. This analysis process needs to happen not just at index time, but also at
  11. query time: the query string needs to be passed through the same (or a
  12. similar) analyzer so that the terms that it tries to find are in the same
  13. format as those that exist in the index.
  14. Elasticsearch ships with a number of <<analysis-analyzers,pre-defined analyzers>>,
  15. which can be used without further configuration. It also ships with many
  16. <<analysis-charfilters,character filters>>, <<analysis-tokenizers,tokenizers>>,
  17. and <<analysis-tokenfilters>> which can be combined to configure
  18. custom analyzers per index.
  19. Analyzers can be specified per-query, per-field or per-index. At index time,
  20. Elasticsearch will look for an analyzer in this order:
  21. * The `analyzer` defined in the field mapping.
  22. * An analyzer named `default` in the index settings.
  23. * The <<analysis-standard-analyzer,`standard`>> analyzer.
  24. At query time, there are a few more layers:
  25. * The `analyzer` defined in a <<full-text-queries,full-text query>>.
  26. * The `search_analyzer` defined in the field mapping.
  27. * The `analyzer` defined in the field mapping.
  28. * An analyzer named `default_search` in the index settings.
  29. * An analyzer named `default` in the index settings.
  30. * The <<analysis-standard-analyzer,`standard`>> analyzer.
  31. The easiest way to specify an analyzer for a particular field is to define it
  32. in the field mapping, as follows:
  33. [source,js]
  34. --------------------------------------------------
  35. PUT /my_index
  36. {
  37. "mappings": {
  38. "_doc": {
  39. "properties": {
  40. "text": { <1>
  41. "type": "text",
  42. "fields": {
  43. "english": { <2>
  44. "type": "text",
  45. "analyzer": "english"
  46. }
  47. }
  48. }
  49. }
  50. }
  51. }
  52. }
  53. GET my_index/_analyze <3>
  54. {
  55. "field": "text",
  56. "text": "The quick Brown Foxes."
  57. }
  58. GET my_index/_analyze <4>
  59. {
  60. "field": "text.english",
  61. "text": "The quick Brown Foxes."
  62. }
  63. --------------------------------------------------
  64. // CONSOLE
  65. <1> The `text` field uses the default `standard` analyzer`.
  66. <2> The `text.english` <<multi-fields,multi-field>> uses the `english` analyzer, which removes stop words and applies stemming.
  67. <3> This returns the tokens: [ `the`, `quick`, `brown`, `foxes` ].
  68. <4> This returns the tokens: [ `quick`, `brown`, `fox` ].
  69. [[search-quote-analyzer]]
  70. ==== `search_quote_analyzer`
  71. The `search_quote_analyzer` setting allows you to specify an analyzer for phrases, this is particularly useful when dealing with disabling
  72. stop words for phrase queries.
  73. To disable stop words for phrases a field utilising three analyzer settings will be required:
  74. 1. An `analyzer` setting for indexing all terms including stop words
  75. 2. A `search_analyzer` setting for non-phrase queries that will remove stop words
  76. 3. A `search_quote_analyzer` setting for phrase queries that will not remove stop words
  77. [source,js]
  78. --------------------------------------------------
  79. PUT my_index
  80. {
  81. "settings":{
  82. "analysis":{
  83. "analyzer":{
  84. "my_analyzer":{ <1>
  85. "type":"custom",
  86. "tokenizer":"standard",
  87. "filter":[
  88. "lowercase"
  89. ]
  90. },
  91. "my_stop_analyzer":{ <2>
  92. "type":"custom",
  93. "tokenizer":"standard",
  94. "filter":[
  95. "lowercase",
  96. "english_stop"
  97. ]
  98. }
  99. },
  100. "filter":{
  101. "english_stop":{
  102. "type":"stop",
  103. "stopwords":"_english_"
  104. }
  105. }
  106. }
  107. },
  108. "mappings":{
  109. "_doc":{
  110. "properties":{
  111. "title": {
  112. "type":"text",
  113. "analyzer":"my_analyzer", <3>
  114. "search_analyzer":"my_stop_analyzer", <4>
  115. "search_quote_analyzer":"my_analyzer" <5>
  116. }
  117. }
  118. }
  119. }
  120. }
  121. PUT my_index/_doc/1
  122. {
  123. "title":"The Quick Brown Fox"
  124. }
  125. PUT my_index/_doc/2
  126. {
  127. "title":"A Quick Brown Fox"
  128. }
  129. GET my_index/_search
  130. {
  131. "query":{
  132. "query_string":{
  133. "query":"\"the quick brown fox\"" <6>
  134. }
  135. }
  136. }
  137. --------------------------------------------------
  138. // CONSOLE
  139. <1> `my_analyzer` analyzer which tokens all terms including stop words
  140. <2> `my_stop_analyzer` analyzer which removes stop words
  141. <3> `analyzer` setting that points to the `my_analyzer` analyzer which will be used at index time
  142. <4> `search_analyzer` setting that points to the `my_stop_analyzer` and removes stop words for non-phrase queries
  143. <5> `search_quote_analyzer` setting that points to the `my_analyzer` analyzer and ensures that stop words are not removed from phrase queries
  144. <6> Since the query is wrapped in quotes it is detected as a phrase query therefore the `search_quote_analyzer` kicks in and ensures the stop words
  145. are not removed from the query. The `my_analyzer` analyzer will then return the following tokens [`the`, `quick`, `brown`, `fox`] which will match one
  146. of the documents. Meanwhile term queries will be analyzed with the `my_stop_analyzer` analyzer which will filter out stop words. So a search for either
  147. `The quick brown fox` or `A quick brown fox` will return both documents since both documents contain the following tokens [`quick`, `brown`, `fox`].
  148. Without the `search_quote_analyzer` it would not be possible to do exact matches for phrase queries as the stop words from phrase queries would be
  149. removed resulting in both documents matching.