standard-analyzer.asciidoc 6.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306
  1. [[analysis-standard-analyzer]]
  2. === Standard Analyzer
  3. The `standard` analyzer is the default analyzer which is used if none is
  4. specified. It provides grammar based tokenization (based on the Unicode Text
  5. Segmentation algorithm, as specified in
  6. http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
  7. for most languages.
  8. [float]
  9. === Example output
  10. [source,js]
  11. ---------------------------
  12. POST _analyze
  13. {
  14. "analyzer": "standard",
  15. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  16. }
  17. ---------------------------
  18. // CONSOLE
  19. /////////////////////
  20. [source,js]
  21. ----------------------------
  22. {
  23. "tokens": [
  24. {
  25. "token": "the",
  26. "start_offset": 0,
  27. "end_offset": 3,
  28. "type": "<ALPHANUM>",
  29. "position": 0
  30. },
  31. {
  32. "token": "2",
  33. "start_offset": 4,
  34. "end_offset": 5,
  35. "type": "<NUM>",
  36. "position": 1
  37. },
  38. {
  39. "token": "quick",
  40. "start_offset": 6,
  41. "end_offset": 11,
  42. "type": "<ALPHANUM>",
  43. "position": 2
  44. },
  45. {
  46. "token": "brown",
  47. "start_offset": 12,
  48. "end_offset": 17,
  49. "type": "<ALPHANUM>",
  50. "position": 3
  51. },
  52. {
  53. "token": "foxes",
  54. "start_offset": 18,
  55. "end_offset": 23,
  56. "type": "<ALPHANUM>",
  57. "position": 4
  58. },
  59. {
  60. "token": "jumped",
  61. "start_offset": 24,
  62. "end_offset": 30,
  63. "type": "<ALPHANUM>",
  64. "position": 5
  65. },
  66. {
  67. "token": "over",
  68. "start_offset": 31,
  69. "end_offset": 35,
  70. "type": "<ALPHANUM>",
  71. "position": 6
  72. },
  73. {
  74. "token": "the",
  75. "start_offset": 36,
  76. "end_offset": 39,
  77. "type": "<ALPHANUM>",
  78. "position": 7
  79. },
  80. {
  81. "token": "lazy",
  82. "start_offset": 40,
  83. "end_offset": 44,
  84. "type": "<ALPHANUM>",
  85. "position": 8
  86. },
  87. {
  88. "token": "dog's",
  89. "start_offset": 45,
  90. "end_offset": 50,
  91. "type": "<ALPHANUM>",
  92. "position": 9
  93. },
  94. {
  95. "token": "bone",
  96. "start_offset": 51,
  97. "end_offset": 55,
  98. "type": "<ALPHANUM>",
  99. "position": 10
  100. }
  101. ]
  102. }
  103. ----------------------------
  104. // TESTRESPONSE
  105. /////////////////////
  106. The above sentence would produce the following terms:
  107. [source,text]
  108. ---------------------------
  109. [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
  110. ---------------------------
  111. [float]
  112. === Configuration
  113. The `standard` analyzer accepts the following parameters:
  114. [horizontal]
  115. `max_token_length`::
  116. The maximum token length. If a token is seen that exceeds this length then
  117. it is split at `max_token_length` intervals. Defaults to `255`.
  118. `stopwords`::
  119. A pre-defined stop words list like `_english_` or an array containing a
  120. list of stop words. Defaults to `\_none_`.
  121. `stopwords_path`::
  122. The path to a file containing stop words.
  123. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  124. about stop word configuration.
  125. [float]
  126. === Example configuration
  127. In this example, we configure the `standard` analyzer to have a
  128. `max_token_length` of 5 (for demonstration purposes), and to use the
  129. pre-defined list of English stop words:
  130. [source,js]
  131. ----------------------------
  132. PUT my_index
  133. {
  134. "settings": {
  135. "analysis": {
  136. "analyzer": {
  137. "my_english_analyzer": {
  138. "type": "standard",
  139. "max_token_length": 5,
  140. "stopwords": "_english_"
  141. }
  142. }
  143. }
  144. }
  145. }
  146. POST my_index/_analyze
  147. {
  148. "analyzer": "my_english_analyzer",
  149. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  150. }
  151. ----------------------------
  152. // CONSOLE
  153. /////////////////////
  154. [source,js]
  155. ----------------------------
  156. {
  157. "tokens": [
  158. {
  159. "token": "2",
  160. "start_offset": 4,
  161. "end_offset": 5,
  162. "type": "<NUM>",
  163. "position": 1
  164. },
  165. {
  166. "token": "quick",
  167. "start_offset": 6,
  168. "end_offset": 11,
  169. "type": "<ALPHANUM>",
  170. "position": 2
  171. },
  172. {
  173. "token": "brown",
  174. "start_offset": 12,
  175. "end_offset": 17,
  176. "type": "<ALPHANUM>",
  177. "position": 3
  178. },
  179. {
  180. "token": "foxes",
  181. "start_offset": 18,
  182. "end_offset": 23,
  183. "type": "<ALPHANUM>",
  184. "position": 4
  185. },
  186. {
  187. "token": "jumpe",
  188. "start_offset": 24,
  189. "end_offset": 29,
  190. "type": "<ALPHANUM>",
  191. "position": 5
  192. },
  193. {
  194. "token": "d",
  195. "start_offset": 29,
  196. "end_offset": 30,
  197. "type": "<ALPHANUM>",
  198. "position": 6
  199. },
  200. {
  201. "token": "over",
  202. "start_offset": 31,
  203. "end_offset": 35,
  204. "type": "<ALPHANUM>",
  205. "position": 7
  206. },
  207. {
  208. "token": "lazy",
  209. "start_offset": 40,
  210. "end_offset": 44,
  211. "type": "<ALPHANUM>",
  212. "position": 9
  213. },
  214. {
  215. "token": "dog's",
  216. "start_offset": 45,
  217. "end_offset": 50,
  218. "type": "<ALPHANUM>",
  219. "position": 10
  220. },
  221. {
  222. "token": "bone",
  223. "start_offset": 51,
  224. "end_offset": 55,
  225. "type": "<ALPHANUM>",
  226. "position": 11
  227. }
  228. ]
  229. }
  230. ----------------------------
  231. // TESTRESPONSE
  232. /////////////////////
  233. The above example produces the following terms:
  234. [source,text]
  235. ---------------------------
  236. [ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
  237. ---------------------------
  238. [float]
  239. === Definition
  240. The `standard` analyzer consists of:
  241. Tokenizer::
  242. * <<analysis-standard-tokenizer,Standard Tokenizer>>
  243. Token Filters::
  244. * <<analysis-standard-tokenfilter,Standard Token Filter>>
  245. * <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  246. * <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  247. If you need to customize the `standard` analyzer beyond the configuration
  248. parameters then you need to recreate it as a `custom` analyzer and modify
  249. it, usually by adding token filters. This would recreate the built-in
  250. `standard` analyzer and you can use it as a starting point:
  251. [source,js]
  252. ----------------------------------------------------
  253. PUT /standard_example
  254. {
  255. "settings": {
  256. "analysis": {
  257. "analyzer": {
  258. "rebuilt_standard": {
  259. "tokenizer": "standard",
  260. "filter": [
  261. "standard",
  262. "lowercase" <1>
  263. ]
  264. }
  265. }
  266. }
  267. }
  268. }
  269. ----------------------------------------------------
  270. // CONSOLE
  271. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
  272. <1> You'd add any token filters after `lowercase`.