1
0

standard-analyzer.asciidoc 5.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281
  1. [[analysis-standard-analyzer]]
  2. === Standard Analyzer
  3. The `standard` analyzer is the default analyzer which is used if none is
  4. specified. It provides grammar based tokenization (based on the Unicode Text
  5. Segmentation algorithm, as specified in
  6. http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
  7. for most languages.
  8. [float]
  9. === Definition
  10. It consists of:
  11. Tokenizer::
  12. * <<analysis-standard-tokenizer,Standard Tokenizer>>
  13. Token Filters::
  14. * <<analysis-standard-tokenfilter,Standard Token Filter>>
  15. * <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  16. * <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  17. [float]
  18. === Example output
  19. [source,js]
  20. ---------------------------
  21. POST _analyze
  22. {
  23. "analyzer": "standard",
  24. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  25. }
  26. ---------------------------
  27. // CONSOLE
  28. /////////////////////
  29. [source,js]
  30. ----------------------------
  31. {
  32. "tokens": [
  33. {
  34. "token": "the",
  35. "start_offset": 0,
  36. "end_offset": 3,
  37. "type": "<ALPHANUM>",
  38. "position": 0
  39. },
  40. {
  41. "token": "2",
  42. "start_offset": 4,
  43. "end_offset": 5,
  44. "type": "<NUM>",
  45. "position": 1
  46. },
  47. {
  48. "token": "quick",
  49. "start_offset": 6,
  50. "end_offset": 11,
  51. "type": "<ALPHANUM>",
  52. "position": 2
  53. },
  54. {
  55. "token": "brown",
  56. "start_offset": 12,
  57. "end_offset": 17,
  58. "type": "<ALPHANUM>",
  59. "position": 3
  60. },
  61. {
  62. "token": "foxes",
  63. "start_offset": 18,
  64. "end_offset": 23,
  65. "type": "<ALPHANUM>",
  66. "position": 4
  67. },
  68. {
  69. "token": "jumped",
  70. "start_offset": 24,
  71. "end_offset": 30,
  72. "type": "<ALPHANUM>",
  73. "position": 5
  74. },
  75. {
  76. "token": "over",
  77. "start_offset": 31,
  78. "end_offset": 35,
  79. "type": "<ALPHANUM>",
  80. "position": 6
  81. },
  82. {
  83. "token": "the",
  84. "start_offset": 36,
  85. "end_offset": 39,
  86. "type": "<ALPHANUM>",
  87. "position": 7
  88. },
  89. {
  90. "token": "lazy",
  91. "start_offset": 40,
  92. "end_offset": 44,
  93. "type": "<ALPHANUM>",
  94. "position": 8
  95. },
  96. {
  97. "token": "dog's",
  98. "start_offset": 45,
  99. "end_offset": 50,
  100. "type": "<ALPHANUM>",
  101. "position": 9
  102. },
  103. {
  104. "token": "bone",
  105. "start_offset": 51,
  106. "end_offset": 55,
  107. "type": "<ALPHANUM>",
  108. "position": 10
  109. }
  110. ]
  111. }
  112. ----------------------------
  113. // TESTRESPONSE
  114. /////////////////////
  115. The above sentence would produce the following terms:
  116. [source,text]
  117. ---------------------------
  118. [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
  119. ---------------------------
  120. [float]
  121. === Configuration
  122. The `standard` analyzer accepts the following parameters:
  123. [horizontal]
  124. `max_token_length`::
  125. The maximum token length. If a token is seen that exceeds this length then
  126. it is split at `max_token_length` intervals. Defaults to `255`.
  127. `stopwords`::
  128. A pre-defined stop words list like `_english_` or an array containing a
  129. list of stop words. Defaults to `_none_`.
  130. `stopwords_path`::
  131. The path to a file containing stop words.
  132. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  133. about stop word configuration.
  134. [float]
  135. === Example configuration
  136. In this example, we configure the `standard` analyzer to have a
  137. `max_token_length` of 5 (for demonstration purposes), and to use the
  138. pre-defined list of English stop words:
  139. [source,js]
  140. ----------------------------
  141. PUT my_index
  142. {
  143. "settings": {
  144. "analysis": {
  145. "analyzer": {
  146. "my_english_analyzer": {
  147. "type": "standard",
  148. "max_token_length": 5,
  149. "stopwords": "_english_"
  150. }
  151. }
  152. }
  153. }
  154. }
  155. GET _cluster/health?wait_for_status=yellow
  156. POST my_index/_analyze
  157. {
  158. "analyzer": "my_english_analyzer",
  159. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  160. }
  161. ----------------------------
  162. // CONSOLE
  163. /////////////////////
  164. [source,js]
  165. ----------------------------
  166. {
  167. "tokens": [
  168. {
  169. "token": "2",
  170. "start_offset": 4,
  171. "end_offset": 5,
  172. "type": "<NUM>",
  173. "position": 1
  174. },
  175. {
  176. "token": "quick",
  177. "start_offset": 6,
  178. "end_offset": 11,
  179. "type": "<ALPHANUM>",
  180. "position": 2
  181. },
  182. {
  183. "token": "brown",
  184. "start_offset": 12,
  185. "end_offset": 17,
  186. "type": "<ALPHANUM>",
  187. "position": 3
  188. },
  189. {
  190. "token": "foxes",
  191. "start_offset": 18,
  192. "end_offset": 23,
  193. "type": "<ALPHANUM>",
  194. "position": 4
  195. },
  196. {
  197. "token": "jumpe",
  198. "start_offset": 24,
  199. "end_offset": 29,
  200. "type": "<ALPHANUM>",
  201. "position": 5
  202. },
  203. {
  204. "token": "d",
  205. "start_offset": 29,
  206. "end_offset": 30,
  207. "type": "<ALPHANUM>",
  208. "position": 6
  209. },
  210. {
  211. "token": "over",
  212. "start_offset": 31,
  213. "end_offset": 35,
  214. "type": "<ALPHANUM>",
  215. "position": 7
  216. },
  217. {
  218. "token": "lazy",
  219. "start_offset": 40,
  220. "end_offset": 44,
  221. "type": "<ALPHANUM>",
  222. "position": 9
  223. },
  224. {
  225. "token": "dog's",
  226. "start_offset": 45,
  227. "end_offset": 50,
  228. "type": "<ALPHANUM>",
  229. "position": 10
  230. },
  231. {
  232. "token": "bone",
  233. "start_offset": 51,
  234. "end_offset": 55,
  235. "type": "<ALPHANUM>",
  236. "position": 11
  237. }
  238. ]
  239. }
  240. ----------------------------
  241. // TESTRESPONSE
  242. /////////////////////
  243. The above example produces the following terms:
  244. [source,text]
  245. ---------------------------
  246. [ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
  247. ---------------------------