pattern-tokenizer.asciidoc 4.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264
  1. [[analysis-pattern-tokenizer]]
  2. === Pattern Tokenizer
  3. The `pattern` tokenizer uses a regular expression to either split text into
  4. terms whenever it matches a word separator, or to capture matching text as
  5. terms.
  6. The default pattern is `\W+`, which splits text whenever it encounters
  7. non-word characters.
  8. [float]
  9. === Example output
  10. [source,js]
  11. ---------------------------
  12. POST _analyze
  13. {
  14. "tokenizer": "pattern",
  15. "text": "The foo_bar_size's default is 5."
  16. }
  17. ---------------------------
  18. // CONSOLE
  19. /////////////////////
  20. [source,js]
  21. ----------------------------
  22. {
  23. "tokens": [
  24. {
  25. "token": "The",
  26. "start_offset": 0,
  27. "end_offset": 3,
  28. "type": "word",
  29. "position": 0
  30. },
  31. {
  32. "token": "foo_bar_size",
  33. "start_offset": 4,
  34. "end_offset": 16,
  35. "type": "word",
  36. "position": 1
  37. },
  38. {
  39. "token": "s",
  40. "start_offset": 17,
  41. "end_offset": 18,
  42. "type": "word",
  43. "position": 2
  44. },
  45. {
  46. "token": "default",
  47. "start_offset": 19,
  48. "end_offset": 26,
  49. "type": "word",
  50. "position": 3
  51. },
  52. {
  53. "token": "is",
  54. "start_offset": 27,
  55. "end_offset": 29,
  56. "type": "word",
  57. "position": 4
  58. },
  59. {
  60. "token": "5",
  61. "start_offset": 30,
  62. "end_offset": 31,
  63. "type": "word",
  64. "position": 5
  65. }
  66. ]
  67. }
  68. ----------------------------
  69. // TESTRESPONSE
  70. /////////////////////
  71. The above sentence would produce the following terms:
  72. [source,text]
  73. ---------------------------
  74. [ The, foo_bar_size, s, default, is, 5 ]
  75. ---------------------------
  76. [float]
  77. === Configuration
  78. The `pattern` tokenizer accepts the following parameters:
  79. [horizontal]
  80. `pattern`::
  81. A http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java regular expression], defaults to `\W+`.
  82. `flags`::
  83. Java regular expression http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary[flags].
  84. lags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.
  85. `group`::
  86. Which capture group to extract as tokens. Defaults to `-1` (split).
  87. [float]
  88. === Example configuration
  89. In this example, we configure the `pattern` tokenizer to break text into
  90. tokens when it encounters commas:
  91. [source,js]
  92. ----------------------------
  93. PUT my_index
  94. {
  95. "settings": {
  96. "analysis": {
  97. "analyzer": {
  98. "my_analyzer": {
  99. "tokenizer": "my_tokenizer"
  100. }
  101. },
  102. "tokenizer": {
  103. "my_tokenizer": {
  104. "type": "pattern",
  105. "pattern": ","
  106. }
  107. }
  108. }
  109. }
  110. }
  111. POST my_index/_analyze
  112. {
  113. "analyzer": "my_analyzer",
  114. "text": "comma,separated,values"
  115. }
  116. ----------------------------
  117. // CONSOLE
  118. /////////////////////
  119. [source,js]
  120. ----------------------------
  121. {
  122. "tokens": [
  123. {
  124. "token": "comma",
  125. "start_offset": 0,
  126. "end_offset": 5,
  127. "type": "word",
  128. "position": 0
  129. },
  130. {
  131. "token": "separated",
  132. "start_offset": 6,
  133. "end_offset": 15,
  134. "type": "word",
  135. "position": 1
  136. },
  137. {
  138. "token": "values",
  139. "start_offset": 16,
  140. "end_offset": 22,
  141. "type": "word",
  142. "position": 2
  143. }
  144. ]
  145. }
  146. ----------------------------
  147. // TESTRESPONSE
  148. /////////////////////
  149. The above example produces the following terms:
  150. [source,text]
  151. ---------------------------
  152. [ comma, separated, values ]
  153. ---------------------------
  154. In the next example, we configure the `pattern` tokenizer to capture values
  155. enclosed in double quotes (ignoring embedded escaped quotes `\"`). The regex
  156. itself looks like this:
  157. "((?:\\"|[^"]|\\")*)"
  158. And reads as follows:
  159. * A literal `"`
  160. * Start capturing:
  161. ** A literal `\"` OR any character except `"`
  162. ** Repeat until no more characters match
  163. * A literal closing `"`
  164. When the pattern is specified in JSON, the `"` and `\` characters need to be
  165. escaped, so the pattern ends up looking like:
  166. \"((?:\\\\\"|[^\"]|\\\\\")+)\"
  167. [source,js]
  168. ----------------------------
  169. PUT my_index
  170. {
  171. "settings": {
  172. "analysis": {
  173. "analyzer": {
  174. "my_analyzer": {
  175. "tokenizer": "my_tokenizer"
  176. }
  177. },
  178. "tokenizer": {
  179. "my_tokenizer": {
  180. "type": "pattern",
  181. "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
  182. "group": 1
  183. }
  184. }
  185. }
  186. }
  187. }
  188. POST my_index/_analyze
  189. {
  190. "analyzer": "my_analyzer",
  191. "text": "\"value\", \"value with embedded \\\" quote\""
  192. }
  193. ----------------------------
  194. // CONSOLE
  195. /////////////////////
  196. [source,js]
  197. ----------------------------
  198. {
  199. "tokens": [
  200. {
  201. "token": "value",
  202. "start_offset": 1,
  203. "end_offset": 6,
  204. "type": "word",
  205. "position": 0
  206. },
  207. {
  208. "token": "value with embedded \\\" quote",
  209. "start_offset": 10,
  210. "end_offset": 38,
  211. "type": "word",
  212. "position": 1
  213. }
  214. ]
  215. }
  216. ----------------------------
  217. // TESTRESPONSE
  218. /////////////////////
  219. The above example produces the following two terms:
  220. [source,text]
  221. ---------------------------
  222. [ value, value with embedded \" quote ]
  223. ---------------------------