pattern-analyzer.asciidoc 8.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408
  1. [[analysis-pattern-analyzer]]
  2. === Pattern Analyzer
  3. The `pattern` analyzer uses a regular expression to split the text into terms.
  4. The regular expression should match the *token separators* not the tokens
  5. themselves. The regular expression defaults to `\W+` (or all non-word characters).
  6. [WARNING]
  7. .Beware of Pathological Regular Expressions
  8. ========================================
  9. The pattern analyzer uses
  10. http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions].
  11. A badly written regular expression could run very slowly or even throw a
  12. StackOverflowError and cause the node it is running on to exit suddenly.
  13. Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them].
  14. ========================================
  15. [float]
  16. === Example output
  17. [source,console]
  18. ---------------------------
  19. POST _analyze
  20. {
  21. "analyzer": "pattern",
  22. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  23. }
  24. ---------------------------
  25. /////////////////////
  26. [source,console-result]
  27. ----------------------------
  28. {
  29. "tokens": [
  30. {
  31. "token": "the",
  32. "start_offset": 0,
  33. "end_offset": 3,
  34. "type": "word",
  35. "position": 0
  36. },
  37. {
  38. "token": "2",
  39. "start_offset": 4,
  40. "end_offset": 5,
  41. "type": "word",
  42. "position": 1
  43. },
  44. {
  45. "token": "quick",
  46. "start_offset": 6,
  47. "end_offset": 11,
  48. "type": "word",
  49. "position": 2
  50. },
  51. {
  52. "token": "brown",
  53. "start_offset": 12,
  54. "end_offset": 17,
  55. "type": "word",
  56. "position": 3
  57. },
  58. {
  59. "token": "foxes",
  60. "start_offset": 18,
  61. "end_offset": 23,
  62. "type": "word",
  63. "position": 4
  64. },
  65. {
  66. "token": "jumped",
  67. "start_offset": 24,
  68. "end_offset": 30,
  69. "type": "word",
  70. "position": 5
  71. },
  72. {
  73. "token": "over",
  74. "start_offset": 31,
  75. "end_offset": 35,
  76. "type": "word",
  77. "position": 6
  78. },
  79. {
  80. "token": "the",
  81. "start_offset": 36,
  82. "end_offset": 39,
  83. "type": "word",
  84. "position": 7
  85. },
  86. {
  87. "token": "lazy",
  88. "start_offset": 40,
  89. "end_offset": 44,
  90. "type": "word",
  91. "position": 8
  92. },
  93. {
  94. "token": "dog",
  95. "start_offset": 45,
  96. "end_offset": 48,
  97. "type": "word",
  98. "position": 9
  99. },
  100. {
  101. "token": "s",
  102. "start_offset": 49,
  103. "end_offset": 50,
  104. "type": "word",
  105. "position": 10
  106. },
  107. {
  108. "token": "bone",
  109. "start_offset": 51,
  110. "end_offset": 55,
  111. "type": "word",
  112. "position": 11
  113. }
  114. ]
  115. }
  116. ----------------------------
  117. /////////////////////
  118. The above sentence would produce the following terms:
  119. [source,text]
  120. ---------------------------
  121. [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
  122. ---------------------------
  123. [float]
  124. === Configuration
  125. The `pattern` analyzer accepts the following parameters:
  126. [horizontal]
  127. `pattern`::
  128. A http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java regular expression], defaults to `\W+`.
  129. `flags`::
  130. Java regular expression http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary[flags].
  131. Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.
  132. `lowercase`::
  133. Should terms be lowercased or not. Defaults to `true`.
  134. `stopwords`::
  135. A pre-defined stop words list like `_english_` or an array containing a
  136. list of stop words. Defaults to `_none_`.
  137. `stopwords_path`::
  138. The path to a file containing stop words.
  139. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  140. about stop word configuration.
  141. [float]
  142. === Example configuration
  143. In this example, we configure the `pattern` analyzer to split email addresses
  144. on non-word characters or on underscores (`\W|_`), and to lower-case the result:
  145. [source,console]
  146. ----------------------------
  147. PUT my_index
  148. {
  149. "settings": {
  150. "analysis": {
  151. "analyzer": {
  152. "my_email_analyzer": {
  153. "type": "pattern",
  154. "pattern": "\\W|_", <1>
  155. "lowercase": true
  156. }
  157. }
  158. }
  159. }
  160. }
  161. POST my_index/_analyze
  162. {
  163. "analyzer": "my_email_analyzer",
  164. "text": "John_Smith@foo-bar.com"
  165. }
  166. ----------------------------
  167. <1> The backslashes in the pattern need to be escaped when specifying the
  168. pattern as a JSON string.
  169. /////////////////////
  170. [source,console-result]
  171. ----------------------------
  172. {
  173. "tokens": [
  174. {
  175. "token": "john",
  176. "start_offset": 0,
  177. "end_offset": 4,
  178. "type": "word",
  179. "position": 0
  180. },
  181. {
  182. "token": "smith",
  183. "start_offset": 5,
  184. "end_offset": 10,
  185. "type": "word",
  186. "position": 1
  187. },
  188. {
  189. "token": "foo",
  190. "start_offset": 11,
  191. "end_offset": 14,
  192. "type": "word",
  193. "position": 2
  194. },
  195. {
  196. "token": "bar",
  197. "start_offset": 15,
  198. "end_offset": 18,
  199. "type": "word",
  200. "position": 3
  201. },
  202. {
  203. "token": "com",
  204. "start_offset": 19,
  205. "end_offset": 22,
  206. "type": "word",
  207. "position": 4
  208. }
  209. ]
  210. }
  211. ----------------------------
  212. /////////////////////
  213. The above example produces the following terms:
  214. [source,text]
  215. ---------------------------
  216. [ john, smith, foo, bar, com ]
  217. ---------------------------
  218. [float]
  219. ==== CamelCase tokenizer
  220. The following more complicated example splits CamelCase text into tokens:
  221. [source,console]
  222. --------------------------------------------------
  223. PUT my_index
  224. {
  225. "settings": {
  226. "analysis": {
  227. "analyzer": {
  228. "camel": {
  229. "type": "pattern",
  230. "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
  231. }
  232. }
  233. }
  234. }
  235. }
  236. GET my_index/_analyze
  237. {
  238. "analyzer": "camel",
  239. "text": "MooseX::FTPClass2_beta"
  240. }
  241. --------------------------------------------------
  242. /////////////////////
  243. [source,console-result]
  244. ----------------------------
  245. {
  246. "tokens": [
  247. {
  248. "token": "moose",
  249. "start_offset": 0,
  250. "end_offset": 5,
  251. "type": "word",
  252. "position": 0
  253. },
  254. {
  255. "token": "x",
  256. "start_offset": 5,
  257. "end_offset": 6,
  258. "type": "word",
  259. "position": 1
  260. },
  261. {
  262. "token": "ftp",
  263. "start_offset": 8,
  264. "end_offset": 11,
  265. "type": "word",
  266. "position": 2
  267. },
  268. {
  269. "token": "class",
  270. "start_offset": 11,
  271. "end_offset": 16,
  272. "type": "word",
  273. "position": 3
  274. },
  275. {
  276. "token": "2",
  277. "start_offset": 16,
  278. "end_offset": 17,
  279. "type": "word",
  280. "position": 4
  281. },
  282. {
  283. "token": "beta",
  284. "start_offset": 18,
  285. "end_offset": 22,
  286. "type": "word",
  287. "position": 5
  288. }
  289. ]
  290. }
  291. ----------------------------
  292. /////////////////////
  293. The above example produces the following terms:
  294. [source,text]
  295. ---------------------------
  296. [ moose, x, ftp, class, 2, beta ]
  297. ---------------------------
  298. The regex above is easier to understand as:
  299. [source,regex]
  300. --------------------------------------------------
  301. ([^\p{L}\d]+) # swallow non letters and numbers,
  302. | (?<=\D)(?=\d) # or non-number followed by number,
  303. | (?<=\d)(?=\D) # or number followed by non-number,
  304. | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
  305. (?=\p{Lu}) # followed by upper case,
  306. | (?<=\p{Lu}) # or upper case
  307. (?=\p{Lu} # followed by upper case
  308. [\p{L}&&[^\p{Lu}]] # then lower case
  309. )
  310. --------------------------------------------------
  311. [float]
  312. === Definition
  313. The `pattern` anlayzer consists of:
  314. Tokenizer::
  315. * <<analysis-pattern-tokenizer,Pattern Tokenizer>>
  316. Token Filters::
  317. * <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  318. * <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  319. If you need to customize the `pattern` analyzer beyond the configuration
  320. parameters then you need to recreate it as a `custom` analyzer and modify
  321. it, usually by adding token filters. This would recreate the built-in
  322. `pattern` analyzer and you can use it as a starting point for further
  323. customization:
  324. [source,console]
  325. ----------------------------------------------------
  326. PUT /pattern_example
  327. {
  328. "settings": {
  329. "analysis": {
  330. "tokenizer": {
  331. "split_on_non_word": {
  332. "type": "pattern",
  333. "pattern": "\\W+" <1>
  334. }
  335. },
  336. "analyzer": {
  337. "rebuilt_pattern": {
  338. "tokenizer": "split_on_non_word",
  339. "filter": [
  340. "lowercase" <2>
  341. ]
  342. }
  343. }
  344. }
  345. }
  346. }
  347. ----------------------------------------------------
  348. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
  349. <1> The default pattern is `\W+` which splits on non-word characters
  350. and this is where you'd change it.
  351. <2> You'd add other token filters after `lowercase`.