stop-analyzer.asciidoc 5.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277
  1. [[analysis-stop-analyzer]]
  2. === Stop Analyzer
  3. The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analyzer>>
  4. but adds support for removing stop words. It defaults to using the
  5. `_english_` stop words.
  6. [float]
  7. === Example output
  8. [source,js]
  9. ---------------------------
  10. POST _analyze
  11. {
  12. "analyzer": "stop",
  13. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  14. }
  15. ---------------------------
  16. // CONSOLE
  17. /////////////////////
  18. [source,js]
  19. ----------------------------
  20. {
  21. "tokens": [
  22. {
  23. "token": "quick",
  24. "start_offset": 6,
  25. "end_offset": 11,
  26. "type": "word",
  27. "position": 1
  28. },
  29. {
  30. "token": "brown",
  31. "start_offset": 12,
  32. "end_offset": 17,
  33. "type": "word",
  34. "position": 2
  35. },
  36. {
  37. "token": "foxes",
  38. "start_offset": 18,
  39. "end_offset": 23,
  40. "type": "word",
  41. "position": 3
  42. },
  43. {
  44. "token": "jumped",
  45. "start_offset": 24,
  46. "end_offset": 30,
  47. "type": "word",
  48. "position": 4
  49. },
  50. {
  51. "token": "over",
  52. "start_offset": 31,
  53. "end_offset": 35,
  54. "type": "word",
  55. "position": 5
  56. },
  57. {
  58. "token": "lazy",
  59. "start_offset": 40,
  60. "end_offset": 44,
  61. "type": "word",
  62. "position": 7
  63. },
  64. {
  65. "token": "dog",
  66. "start_offset": 45,
  67. "end_offset": 48,
  68. "type": "word",
  69. "position": 8
  70. },
  71. {
  72. "token": "s",
  73. "start_offset": 49,
  74. "end_offset": 50,
  75. "type": "word",
  76. "position": 9
  77. },
  78. {
  79. "token": "bone",
  80. "start_offset": 51,
  81. "end_offset": 55,
  82. "type": "word",
  83. "position": 10
  84. }
  85. ]
  86. }
  87. ----------------------------
  88. // TESTRESPONSE
  89. /////////////////////
  90. The above sentence would produce the following terms:
  91. [source,text]
  92. ---------------------------
  93. [ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
  94. ---------------------------
  95. [float]
  96. === Configuration
  97. The `stop` analyzer accepts the following parameters:
  98. [horizontal]
  99. `stopwords`::
  100. A pre-defined stop words list like `_english_` or an array containing a
  101. list of stop words. Defaults to `_english_`.
  102. `stopwords_path`::
  103. The path to a file containing stop words. This path is relative to the
  104. Elasticsearch `config` directory.
  105. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  106. about stop word configuration.
  107. [float]
  108. === Example configuration
  109. In this example, we configure the `stop` analyzer to use a specified list of
  110. words as stop words:
  111. [source,js]
  112. ----------------------------
  113. PUT my_index
  114. {
  115. "settings": {
  116. "analysis": {
  117. "analyzer": {
  118. "my_stop_analyzer": {
  119. "type": "stop",
  120. "stopwords": ["the", "over"]
  121. }
  122. }
  123. }
  124. }
  125. }
  126. POST my_index/_analyze
  127. {
  128. "analyzer": "my_stop_analyzer",
  129. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  130. }
  131. ----------------------------
  132. // CONSOLE
  133. /////////////////////
  134. [source,js]
  135. ----------------------------
  136. {
  137. "tokens": [
  138. {
  139. "token": "quick",
  140. "start_offset": 6,
  141. "end_offset": 11,
  142. "type": "word",
  143. "position": 1
  144. },
  145. {
  146. "token": "brown",
  147. "start_offset": 12,
  148. "end_offset": 17,
  149. "type": "word",
  150. "position": 2
  151. },
  152. {
  153. "token": "foxes",
  154. "start_offset": 18,
  155. "end_offset": 23,
  156. "type": "word",
  157. "position": 3
  158. },
  159. {
  160. "token": "jumped",
  161. "start_offset": 24,
  162. "end_offset": 30,
  163. "type": "word",
  164. "position": 4
  165. },
  166. {
  167. "token": "lazy",
  168. "start_offset": 40,
  169. "end_offset": 44,
  170. "type": "word",
  171. "position": 7
  172. },
  173. {
  174. "token": "dog",
  175. "start_offset": 45,
  176. "end_offset": 48,
  177. "type": "word",
  178. "position": 8
  179. },
  180. {
  181. "token": "s",
  182. "start_offset": 49,
  183. "end_offset": 50,
  184. "type": "word",
  185. "position": 9
  186. },
  187. {
  188. "token": "bone",
  189. "start_offset": 51,
  190. "end_offset": 55,
  191. "type": "word",
  192. "position": 10
  193. }
  194. ]
  195. }
  196. ----------------------------
  197. // TESTRESPONSE
  198. /////////////////////
  199. The above example produces the following terms:
  200. [source,text]
  201. ---------------------------
  202. [ quick, brown, foxes, jumped, lazy, dog, s, bone ]
  203. ---------------------------
  204. [float]
  205. === Definition
  206. It consists of:
  207. Tokenizer::
  208. * <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
  209. Token filters::
  210. * <<analysis-stop-tokenfilter,Stop Token Filter>>
  211. If you need to customize the `stop` analyzer beyond the configuration
  212. parameters then you need to recreate it as a `custom` analyzer and modify
  213. it, usually by adding token filters. This would recreate the built-in
  214. `stop` analyzer and you can use it as a starting point for further
  215. customization:
  216. [source,js]
  217. ----------------------------------------------------
  218. PUT /stop_example
  219. {
  220. "settings": {
  221. "analysis": {
  222. "filter": {
  223. "english_stop": {
  224. "type": "stop",
  225. "stopwords": "_english_" <1>
  226. }
  227. },
  228. "analyzer": {
  229. "rebuilt_stop": {
  230. "tokenizer": "lowercase",
  231. "filter": [
  232. "english_stop" <2>
  233. ]
  234. }
  235. }
  236. }
  237. }
  238. }
  239. ----------------------------------------------------
  240. // CONSOLE
  241. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/]
  242. <1> The default stopwords can be overridden with the `stopwords`
  243. or `stopwords_path` parameters.
  244. <2> You'd add any token filters after `english_stop`.