analyzer.asciidoc 5.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170
  1. [[analyzer]]
  2. === `analyzer`
  3. The values of <<mapping-index,`analyzed`>> string fields are passed through an
  4. <<analysis,analyzer>> to convert the string into a stream of _tokens_ or
  5. _terms_. For instance, the string `"The quick Brown Foxes."` may, depending
  6. on which analyzer is used, be analyzed to the tokens: `quick`, `brown`,
  7. `fox`. These are the actual terms that are indexed for the field, which makes
  8. it possible to search efficiently for individual words _within_ big blobs of
  9. text.
  10. This analysis process needs to happen not just at index time, but also at
  11. query time: the query string needs to be passed through the same (or a
  12. similar) analyzer so that the terms that it tries to find are in the same
  13. format as those that exist in the index.
  14. Elasticsearch ships with a number of <<analysis-analyzers,pre-defined analyzers>>,
  15. which can be used without further configuration. It also ships with many
  16. <<analysis-charfilters,character filters>>, <<analysis-tokenizers,tokenizers>>,
  17. and <<analysis-tokenfilters>> which can be combined to configure
  18. custom analyzers per index.
  19. Analyzers can be specified per-query, per-field or per-index. At index time,
  20. Elasticsearch will look for an analyzer in this order:
  21. * The `analyzer` defined in the field mapping.
  22. * An analyzer named `default` in the index settings.
  23. * The <<analysis-standard-analyzer,`standard`>> analyzer.
  24. At query time, there are a few more layers:
  25. * The `analyzer` defined in a <<full-text-queries,full-text query>>.
  26. * The `search_analyzer` defined in the field mapping.
  27. * The `analyzer` defined in the field mapping.
  28. * An analyzer named `default_search` in the index settings.
  29. * An analyzer named `default` in the index settings.
  30. * The <<analysis-standard-analyzer,`standard`>> analyzer.
  31. The easiest way to specify an analyzer for a particular field is to define it
  32. in the field mapping, as follows:
  33. [source,js]
  34. --------------------------------------------------
  35. PUT my_index
  36. {
  37. "mappings": {
  38. "my_type": {
  39. "properties": {
  40. "text": { <1>
  41. "type": "text",
  42. "fields": {
  43. "english": { <2>
  44. "type": "text",
  45. "analyzer": "english"
  46. }
  47. }
  48. }
  49. }
  50. }
  51. }
  52. }
  53. GET my_index/_analyze?field=text <3>
  54. {
  55. "text": "The quick Brown Foxes."
  56. }
  57. GET my_index/_analyze?field=text.english <4>
  58. {
  59. "text": "The quick Brown Foxes."
  60. }
  61. --------------------------------------------------
  62. // AUTOSENSE
  63. <1> The `text` field uses the default `standard` analyzer`.
  64. <2> The `text.english` <<multi-fields,multi-field>> uses the `english` analyzer, which removes stop words and applies stemming.
  65. <3> This returns the tokens: [ `the`, `quick`, `brown`, `foxes` ].
  66. <4> This returns the tokens: [ `quick`, `brown`, `fox` ].
  67. [[search-quote-analyzer]]
  68. ==== `search_quote_analyzer`
  69. The `search_quote_analyzer` setting allows you to specify an analyzer for phrases, this is particularly useful when dealing with disabling
  70. stop words for phrase queries.
  71. To disable stop words for phrases a field utilising three analyzer settings will be required:
  72. 1. An `analyzer` setting for indexing all terms including stop words
  73. 2. A `search_analyzer` setting for non-phrase queries that will remove stop words
  74. 3. A `search_quote_analyzer` setting for phrase queries that will not remove stop words
  75. [source,js]
  76. --------------------------------------------------
  77. PUT my_index
  78. {
  79. "settings":{
  80. "analysis":{
  81. "analyzer":{
  82. "my_analyzer":{ <1>
  83. "type":"custom",
  84. "tokenizer":"standard",
  85. "filter":[
  86. "lowercase"
  87. ]
  88. },
  89. "my_stop_analyzer":{ <2>
  90. "type":"custom",
  91. "tokenizer":"standard",
  92. "filter":[
  93. "lowercase",
  94. "english_stop"
  95. ]
  96. }
  97. },
  98. "filter":{
  99. "english_stop":{
  100. "type":"stop",
  101. "stopwords":"_english_"
  102. }
  103. }
  104. }
  105. },
  106. "mappings":{
  107. "my_type":{
  108. "properties":{
  109. "title": {
  110. "type":"text",
  111. "analyzer":"my_analyzer", <3>
  112. "search_analyzer":"my_stop_analyzer", <4>
  113. "search_quote_analyzer":"my_analyzer" <5>
  114. }
  115. }
  116. }
  117. }
  118. }
  119. --------------------------------------------------
  120. // AUTOSENSE
  121. [source,js]
  122. --------------------------------------------------
  123. PUT my_index/my_type/1
  124. {
  125. "title":"The Quick Brown Fox"
  126. }
  127. PUT my_index/my_type/2
  128. {
  129. "title":"A Quick Brown Fox"
  130. }
  131. GET my_index/my_type/_search
  132. {
  133. "query":{
  134. "query_string":{
  135. "query":"\"the quick brown fox\"" <6>
  136. }
  137. }
  138. }
  139. --------------------------------------------------
  140. <1> `my_analyzer` analyzer which tokens all terms including stop words
  141. <2> `my_stop_analyzer` analyzer which removes stop words
  142. <3> `analyzer` setting that points to the `my_analyzer` analyzer which will be used at index time
  143. <4> `search_analyzer` setting that points to the `my_stop_analyzer` and removes stop words for non-phrase queries
  144. <5> `search_quote_analyzer` setting that points to the `my_analyzer` analyzer and ensures that stop words are not removed from phrase queries
  145. <6> Since the query is wrapped in quotes it is detected as a phrase query therefore the `search_quote_analyzer` kicks in and ensures the stop words
  146. are not removed from the query. The `my_analyzer` analyzer will then return the following tokens [`the`, `quick`, `brown`, `fox`] which will match one
  147. of the documents. Meanwhile term queries will be analyzed with the `my_stop_analyzer` analyzer which will filter out stop words. So a search for either
  148. `The quick brown fox` or `A quick brown fox` will return both documents since both documents contain the following tokens [`quick`, `brown`, `fox`].
  149. Without the `search_quote_analyzer` it would not be possible to do exact matches for phrase queries as the stop words from phrase queries would be
  150. removed resulting in both documents matching.