analyzer.asciidoc 5.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
  1. [[analyzer]]
  2. === `analyzer`
  3. The values of <<mapping-index,`analyzed`>> string fields are passed through an
  4. <<analysis,analyzer>> to convert the string into a stream of _tokens_ or
  5. _terms_. For instance, the string `"The quick Brown Foxes."` may, depending
  6. on which analyzer is used, be analyzed to the tokens: `quick`, `brown`,
  7. `fox`. These are the actual terms that are indexed for the field, which makes
  8. it possible to search efficiently for individual words _within_ big blobs of
  9. text.
  10. This analysis process needs to happen not just at index time, but also at
  11. query time: the query string needs to be passed through the same (or a
  12. similar) analyzer so that the terms that it tries to find are in the same
  13. format as those that exist in the index.
  14. Elasticsearch ships with a number of <<analysis-analyzers,pre-defined analyzers>>,
  15. which can be used without further configuration. It also ships with many
  16. <<analysis-charfilters,character filters>>, <<analysis-tokenizers,tokenizers>>,
  17. and <<analysis-tokenfilters>> which can be combined to configure
  18. custom analyzers per index.
  19. Analyzers can be specified per-query, per-field or per-index. At index time,
  20. Elasticsearch will look for an analyzer in this order:
  21. * The `analyzer` defined in the field mapping.
  22. * An analyzer named `default` in the index settings.
  23. * The <<analysis-standard-analyzer,`standard`>> analyzer.
  24. At query time, there are a few more layers:
  25. * The `analyzer` defined in a <<full-text-queries,full-text query>>.
  26. * The `search_analyzer` defined in the field mapping.
  27. * The `analyzer` defined in the field mapping.
  28. * An analyzer named `default_search` in the index settings.
  29. * An analyzer named `default` in the index settings.
  30. * The <<analysis-standard-analyzer,`standard`>> analyzer.
  31. The easiest way to specify an analyzer for a particular field is to define it
  32. in the field mapping, as follows:
  33. [source,console]
  34. --------------------------------------------------
  35. PUT /my_index
  36. {
  37. "mappings": {
  38. "properties": {
  39. "text": { <1>
  40. "type": "text",
  41. "fields": {
  42. "english": { <2>
  43. "type": "text",
  44. "analyzer": "english"
  45. }
  46. }
  47. }
  48. }
  49. }
  50. }
  51. GET my_index/_analyze <3>
  52. {
  53. "field": "text",
  54. "text": "The quick Brown Foxes."
  55. }
  56. GET my_index/_analyze <4>
  57. {
  58. "field": "text.english",
  59. "text": "The quick Brown Foxes."
  60. }
  61. --------------------------------------------------
  62. <1> The `text` field uses the default `standard` analyzer`.
  63. <2> The `text.english` <<multi-fields,multi-field>> uses the `english` analyzer, which removes stop words and applies stemming.
  64. <3> This returns the tokens: [ `the`, `quick`, `brown`, `foxes` ].
  65. <4> This returns the tokens: [ `quick`, `brown`, `fox` ].
  66. [[search-quote-analyzer]]
  67. ==== `search_quote_analyzer`
  68. The `search_quote_analyzer` setting allows you to specify an analyzer for phrases, this is particularly useful when dealing with disabling
  69. stop words for phrase queries.
  70. To disable stop words for phrases a field utilising three analyzer settings will be required:
  71. 1. An `analyzer` setting for indexing all terms including stop words
  72. 2. A `search_analyzer` setting for non-phrase queries that will remove stop words
  73. 3. A `search_quote_analyzer` setting for phrase queries that will not remove stop words
  74. [source,console]
  75. --------------------------------------------------
  76. PUT my_index
  77. {
  78. "settings":{
  79. "analysis":{
  80. "analyzer":{
  81. "my_analyzer":{ <1>
  82. "type":"custom",
  83. "tokenizer":"standard",
  84. "filter":[
  85. "lowercase"
  86. ]
  87. },
  88. "my_stop_analyzer":{ <2>
  89. "type":"custom",
  90. "tokenizer":"standard",
  91. "filter":[
  92. "lowercase",
  93. "english_stop"
  94. ]
  95. }
  96. },
  97. "filter":{
  98. "english_stop":{
  99. "type":"stop",
  100. "stopwords":"_english_"
  101. }
  102. }
  103. }
  104. },
  105. "mappings":{
  106. "properties":{
  107. "title": {
  108. "type":"text",
  109. "analyzer":"my_analyzer", <3>
  110. "search_analyzer":"my_stop_analyzer", <4>
  111. "search_quote_analyzer":"my_analyzer" <5>
  112. }
  113. }
  114. }
  115. }
  116. PUT my_index/_doc/1
  117. {
  118. "title":"The Quick Brown Fox"
  119. }
  120. PUT my_index/_doc/2
  121. {
  122. "title":"A Quick Brown Fox"
  123. }
  124. GET my_index/_search
  125. {
  126. "query":{
  127. "query_string":{
  128. "query":"\"the quick brown fox\"" <6>
  129. }
  130. }
  131. }
  132. --------------------------------------------------
  133. <1> `my_analyzer` analyzer which tokens all terms including stop words
  134. <2> `my_stop_analyzer` analyzer which removes stop words
  135. <3> `analyzer` setting that points to the `my_analyzer` analyzer which will be used at index time
  136. <4> `search_analyzer` setting that points to the `my_stop_analyzer` and removes stop words for non-phrase queries
  137. <5> `search_quote_analyzer` setting that points to the `my_analyzer` analyzer and ensures that stop words are not removed from phrase queries
  138. <6> Since the query is wrapped in quotes it is detected as a phrase query therefore the `search_quote_analyzer` kicks in and ensures the stop words
  139. are not removed from the query. The `my_analyzer` analyzer will then return the following tokens [`the`, `quick`, `brown`, `fox`] which will match one
  140. of the documents. Meanwhile term queries will be analyzed with the `my_stop_analyzer` analyzer which will filter out stop words. So a search for either
  141. `The quick brown fox` or `A quick brown fox` will return both documents since both documents contain the following tokens [`quick`, `brown`, `fox`].
  142. Without the `search_quote_analyzer` it would not be possible to do exact matches for phrase queries as the stop words from phrase queries would be
  143. removed resulting in both documents matching.