stop-analyzer.asciidoc 5.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276
  1. [[analysis-stop-analyzer]]
  2. === Stop analyzer
  3. ++++
  4. <titleabbrev>Stop</titleabbrev>
  5. ++++
  6. The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analyzer>>
  7. but adds support for removing stop words. It defaults to using the
  8. `_english_` stop words.
  9. [float]
  10. === Example output
  11. [source,console]
  12. ---------------------------
  13. POST _analyze
  14. {
  15. "analyzer": "stop",
  16. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  17. }
  18. ---------------------------
  19. /////////////////////
  20. [source,console-result]
  21. ----------------------------
  22. {
  23. "tokens": [
  24. {
  25. "token": "quick",
  26. "start_offset": 6,
  27. "end_offset": 11,
  28. "type": "word",
  29. "position": 1
  30. },
  31. {
  32. "token": "brown",
  33. "start_offset": 12,
  34. "end_offset": 17,
  35. "type": "word",
  36. "position": 2
  37. },
  38. {
  39. "token": "foxes",
  40. "start_offset": 18,
  41. "end_offset": 23,
  42. "type": "word",
  43. "position": 3
  44. },
  45. {
  46. "token": "jumped",
  47. "start_offset": 24,
  48. "end_offset": 30,
  49. "type": "word",
  50. "position": 4
  51. },
  52. {
  53. "token": "over",
  54. "start_offset": 31,
  55. "end_offset": 35,
  56. "type": "word",
  57. "position": 5
  58. },
  59. {
  60. "token": "lazy",
  61. "start_offset": 40,
  62. "end_offset": 44,
  63. "type": "word",
  64. "position": 7
  65. },
  66. {
  67. "token": "dog",
  68. "start_offset": 45,
  69. "end_offset": 48,
  70. "type": "word",
  71. "position": 8
  72. },
  73. {
  74. "token": "s",
  75. "start_offset": 49,
  76. "end_offset": 50,
  77. "type": "word",
  78. "position": 9
  79. },
  80. {
  81. "token": "bone",
  82. "start_offset": 51,
  83. "end_offset": 55,
  84. "type": "word",
  85. "position": 10
  86. }
  87. ]
  88. }
  89. ----------------------------
  90. /////////////////////
  91. The above sentence would produce the following terms:
  92. [source,text]
  93. ---------------------------
  94. [ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
  95. ---------------------------
  96. [float]
  97. === Configuration
  98. The `stop` analyzer accepts the following parameters:
  99. [horizontal]
  100. `stopwords`::
  101. A pre-defined stop words list like `_english_` or an array containing a
  102. list of stop words. Defaults to `_english_`.
  103. `stopwords_path`::
  104. The path to a file containing stop words. This path is relative to the
  105. Elasticsearch `config` directory.
  106. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  107. about stop word configuration.
  108. [float]
  109. === Example configuration
  110. In this example, we configure the `stop` analyzer to use a specified list of
  111. words as stop words:
  112. [source,console]
  113. ----------------------------
  114. PUT my_index
  115. {
  116. "settings": {
  117. "analysis": {
  118. "analyzer": {
  119. "my_stop_analyzer": {
  120. "type": "stop",
  121. "stopwords": ["the", "over"]
  122. }
  123. }
  124. }
  125. }
  126. }
  127. POST my_index/_analyze
  128. {
  129. "analyzer": "my_stop_analyzer",
  130. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  131. }
  132. ----------------------------
  133. /////////////////////
  134. [source,console-result]
  135. ----------------------------
  136. {
  137. "tokens": [
  138. {
  139. "token": "quick",
  140. "start_offset": 6,
  141. "end_offset": 11,
  142. "type": "word",
  143. "position": 1
  144. },
  145. {
  146. "token": "brown",
  147. "start_offset": 12,
  148. "end_offset": 17,
  149. "type": "word",
  150. "position": 2
  151. },
  152. {
  153. "token": "foxes",
  154. "start_offset": 18,
  155. "end_offset": 23,
  156. "type": "word",
  157. "position": 3
  158. },
  159. {
  160. "token": "jumped",
  161. "start_offset": 24,
  162. "end_offset": 30,
  163. "type": "word",
  164. "position": 4
  165. },
  166. {
  167. "token": "lazy",
  168. "start_offset": 40,
  169. "end_offset": 44,
  170. "type": "word",
  171. "position": 7
  172. },
  173. {
  174. "token": "dog",
  175. "start_offset": 45,
  176. "end_offset": 48,
  177. "type": "word",
  178. "position": 8
  179. },
  180. {
  181. "token": "s",
  182. "start_offset": 49,
  183. "end_offset": 50,
  184. "type": "word",
  185. "position": 9
  186. },
  187. {
  188. "token": "bone",
  189. "start_offset": 51,
  190. "end_offset": 55,
  191. "type": "word",
  192. "position": 10
  193. }
  194. ]
  195. }
  196. ----------------------------
  197. /////////////////////
  198. The above example produces the following terms:
  199. [source,text]
  200. ---------------------------
  201. [ quick, brown, foxes, jumped, lazy, dog, s, bone ]
  202. ---------------------------
  203. [float]
  204. === Definition
  205. It consists of:
  206. Tokenizer::
  207. * <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
  208. Token filters::
  209. * <<analysis-stop-tokenfilter,Stop Token Filter>>
  210. If you need to customize the `stop` analyzer beyond the configuration
  211. parameters then you need to recreate it as a `custom` analyzer and modify
  212. it, usually by adding token filters. This would recreate the built-in
  213. `stop` analyzer and you can use it as a starting point for further
  214. customization:
  215. [source,console]
  216. ----------------------------------------------------
  217. PUT /stop_example
  218. {
  219. "settings": {
  220. "analysis": {
  221. "filter": {
  222. "english_stop": {
  223. "type": "stop",
  224. "stopwords": "_english_" <1>
  225. }
  226. },
  227. "analyzer": {
  228. "rebuilt_stop": {
  229. "tokenizer": "lowercase",
  230. "filter": [
  231. "english_stop" <2>
  232. ]
  233. }
  234. }
  235. }
  236. }
  237. }
  238. ----------------------------------------------------
  239. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/]
  240. <1> The default stopwords can be overridden with the `stopwords`
  241. or `stopwords_path` parameters.
  242. <2> You'd add any token filters after `english_stop`.