pattern-tokenizer.asciidoc 5.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268
  1. [[analysis-pattern-tokenizer]]
  2. === Pattern Tokenizer
  3. The `pattern` tokenizer uses a regular expression to either split text into
  4. terms whenever it matches a word separator, or to capture matching text as
  5. terms.
  6. The default pattern is `\W+`, which splits text whenever it encounters
  7. non-word characters.
  8. [float]
  9. === Example output
  10. [source,js]
  11. ---------------------------
  12. POST _analyze
  13. {
  14. "tokenizer": "pattern",
  15. "text": "The foo_bar_size's default is 5."
  16. }
  17. ---------------------------
  18. // CONSOLE
  19. /////////////////////
  20. [source,js]
  21. ----------------------------
  22. {
  23. "tokens": [
  24. {
  25. "token": "The",
  26. "start_offset": 0,
  27. "end_offset": 3,
  28. "type": "word",
  29. "position": 0
  30. },
  31. {
  32. "token": "foo_bar_size",
  33. "start_offset": 4,
  34. "end_offset": 16,
  35. "type": "word",
  36. "position": 1
  37. },
  38. {
  39. "token": "s",
  40. "start_offset": 17,
  41. "end_offset": 18,
  42. "type": "word",
  43. "position": 2
  44. },
  45. {
  46. "token": "default",
  47. "start_offset": 19,
  48. "end_offset": 26,
  49. "type": "word",
  50. "position": 3
  51. },
  52. {
  53. "token": "is",
  54. "start_offset": 27,
  55. "end_offset": 29,
  56. "type": "word",
  57. "position": 4
  58. },
  59. {
  60. "token": "5",
  61. "start_offset": 30,
  62. "end_offset": 31,
  63. "type": "word",
  64. "position": 5
  65. }
  66. ]
  67. }
  68. ----------------------------
  69. // TESTRESPONSE
  70. /////////////////////
  71. The above sentence would produce the following terms:
  72. [source,text]
  73. ---------------------------
  74. [ The, foo_bar_size, s, default, is, 5 ]
  75. ---------------------------
  76. [float]
  77. === Configuration
  78. The `pattern` tokenizer accepts the following parameters:
  79. [horizontal]
  80. `pattern`::
  81. A http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java regular expression], defaults to `\W+`.
  82. `flags`::
  83. Java regular expression http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary[flags].
  84. lags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.
  85. `group`::
  86. Which capture group to extract as tokens. Defaults to `-1` (split).
  87. [float]
  88. === Example configuration
  89. In this example, we configure the `pattern` tokenizer to break text into
  90. tokens when it encounters commas:
  91. [source,js]
  92. ----------------------------
  93. PUT my_index
  94. {
  95. "settings": {
  96. "analysis": {
  97. "analyzer": {
  98. "my_analyzer": {
  99. "tokenizer": "my_tokenizer"
  100. }
  101. },
  102. "tokenizer": {
  103. "my_tokenizer": {
  104. "type": "pattern",
  105. "pattern": ","
  106. }
  107. }
  108. }
  109. }
  110. }
  111. GET _cluster/health?wait_for_status=yellow
  112. POST my_index/_analyze
  113. {
  114. "analyzer": "my_analyzer",
  115. "text": "comma,separated,values"
  116. }
  117. ----------------------------
  118. // CONSOLE
  119. /////////////////////
  120. [source,js]
  121. ----------------------------
  122. {
  123. "tokens": [
  124. {
  125. "token": "comma",
  126. "start_offset": 0,
  127. "end_offset": 5,
  128. "type": "word",
  129. "position": 0
  130. },
  131. {
  132. "token": "separated",
  133. "start_offset": 6,
  134. "end_offset": 15,
  135. "type": "word",
  136. "position": 1
  137. },
  138. {
  139. "token": "values",
  140. "start_offset": 16,
  141. "end_offset": 22,
  142. "type": "word",
  143. "position": 2
  144. }
  145. ]
  146. }
  147. ----------------------------
  148. // TESTRESPONSE
  149. /////////////////////
  150. The above example produces the following terms:
  151. [source,text]
  152. ---------------------------
  153. [ comma, separated, values ]
  154. ---------------------------
  155. In the next example, we configure the `pattern` tokenizer to capture values
  156. enclosed in double quotes (ignoring embedded escaped quotes `\"`). The regex
  157. itself looks like this:
  158. "((?:\\"|[^"]|\\")*)"
  159. And reads as follows:
  160. * A literal `"`
  161. * Start capturing:
  162. ** A literal `\"` OR any character except `"`
  163. ** Repeat until no more characters match
  164. * A literal closing `"`
  165. When the pattern is specified in JSON, the `"` and `\` characters need to be
  166. escaped, so the pattern ends up looking like:
  167. \"((?:\\\\\"|[^\"]|\\\\\")+)\"
  168. [source,js]
  169. ----------------------------
  170. PUT my_index
  171. {
  172. "settings": {
  173. "analysis": {
  174. "analyzer": {
  175. "my_analyzer": {
  176. "tokenizer": "my_tokenizer"
  177. }
  178. },
  179. "tokenizer": {
  180. "my_tokenizer": {
  181. "type": "pattern",
  182. "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
  183. "group": 1
  184. }
  185. }
  186. }
  187. }
  188. }
  189. GET _cluster/health?wait_for_status=yellow
  190. POST my_index/_analyze
  191. {
  192. "analyzer": "my_analyzer",
  193. "text": "\"value\", \"value with embedded \\\" quote\""
  194. }
  195. ----------------------------
  196. // CONSOLE
  197. /////////////////////
  198. [source,js]
  199. ----------------------------
  200. {
  201. "tokens": [
  202. {
  203. "token": "value",
  204. "start_offset": 1,
  205. "end_offset": 6,
  206. "type": "word",
  207. "position": 0
  208. },
  209. {
  210. "token": "value with embedded \\\" quote",
  211. "start_offset": 10,
  212. "end_offset": 38,
  213. "type": "word",
  214. "position": 1
  215. }
  216. ]
  217. }
  218. ----------------------------
  219. // TESTRESPONSE
  220. /////////////////////
  221. The above example produces the following two terms:
  222. [source,text]
  223. ---------------------------
  224. [ value, value with embedded \" quote ]
  225. ---------------------------