pattern-analyzer.asciidoc 8.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415
  1. [[analysis-pattern-analyzer]]
  2. === Pattern Analyzer
  3. The `pattern` analyzer uses a regular expression to split the text into terms.
  4. The regular expression should match the *token separators* not the tokens
  5. themselves. The regular expression defaults to `\W+` (or all non-word characters).
  6. [WARNING]
  7. .Beware of Pathological Regular Expressions
  8. ========================================
  9. The pattern analyzer uses
  10. http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions].
  11. A badly written regular expression could run very slowly or even throw a
  12. StackOverflowError and cause the node it is running on to exit suddenly.
  13. Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them].
  14. ========================================
  15. [float]
  16. === Example output
  17. [source,js]
  18. ---------------------------
  19. POST _analyze
  20. {
  21. "analyzer": "pattern",
  22. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  23. }
  24. ---------------------------
  25. // CONSOLE
  26. /////////////////////
  27. [source,js]
  28. ----------------------------
  29. {
  30. "tokens": [
  31. {
  32. "token": "the",
  33. "start_offset": 0,
  34. "end_offset": 3,
  35. "type": "word",
  36. "position": 0
  37. },
  38. {
  39. "token": "2",
  40. "start_offset": 4,
  41. "end_offset": 5,
  42. "type": "word",
  43. "position": 1
  44. },
  45. {
  46. "token": "quick",
  47. "start_offset": 6,
  48. "end_offset": 11,
  49. "type": "word",
  50. "position": 2
  51. },
  52. {
  53. "token": "brown",
  54. "start_offset": 12,
  55. "end_offset": 17,
  56. "type": "word",
  57. "position": 3
  58. },
  59. {
  60. "token": "foxes",
  61. "start_offset": 18,
  62. "end_offset": 23,
  63. "type": "word",
  64. "position": 4
  65. },
  66. {
  67. "token": "jumped",
  68. "start_offset": 24,
  69. "end_offset": 30,
  70. "type": "word",
  71. "position": 5
  72. },
  73. {
  74. "token": "over",
  75. "start_offset": 31,
  76. "end_offset": 35,
  77. "type": "word",
  78. "position": 6
  79. },
  80. {
  81. "token": "the",
  82. "start_offset": 36,
  83. "end_offset": 39,
  84. "type": "word",
  85. "position": 7
  86. },
  87. {
  88. "token": "lazy",
  89. "start_offset": 40,
  90. "end_offset": 44,
  91. "type": "word",
  92. "position": 8
  93. },
  94. {
  95. "token": "dog",
  96. "start_offset": 45,
  97. "end_offset": 48,
  98. "type": "word",
  99. "position": 9
  100. },
  101. {
  102. "token": "s",
  103. "start_offset": 49,
  104. "end_offset": 50,
  105. "type": "word",
  106. "position": 10
  107. },
  108. {
  109. "token": "bone",
  110. "start_offset": 51,
  111. "end_offset": 55,
  112. "type": "word",
  113. "position": 11
  114. }
  115. ]
  116. }
  117. ----------------------------
  118. // TESTRESPONSE
  119. /////////////////////
  120. The above sentence would produce the following terms:
  121. [source,text]
  122. ---------------------------
  123. [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
  124. ---------------------------
  125. [float]
  126. === Configuration
  127. The `pattern` analyzer accepts the following parameters:
  128. [horizontal]
  129. `pattern`::
  130. A http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java regular expression], defaults to `\W+`.
  131. `flags`::
  132. Java regular expression http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary[flags].
  133. Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.
  134. `lowercase`::
  135. Should terms be lowercased or not. Defaults to `true`.
  136. `stopwords`::
  137. A pre-defined stop words list like `_english_` or an array containing a
  138. list of stop words. Defaults to `\_none_`.
  139. `stopwords_path`::
  140. The path to a file containing stop words.
  141. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  142. about stop word configuration.
  143. [float]
  144. === Example configuration
  145. In this example, we configure the `pattern` analyzer to split email addresses
  146. on non-word characters or on underscores (`\W|_`), and to lower-case the result:
  147. [source,js]
  148. ----------------------------
  149. PUT my_index
  150. {
  151. "settings": {
  152. "analysis": {
  153. "analyzer": {
  154. "my_email_analyzer": {
  155. "type": "pattern",
  156. "pattern": "\\W|_", <1>
  157. "lowercase": true
  158. }
  159. }
  160. }
  161. }
  162. }
  163. POST my_index/_analyze
  164. {
  165. "analyzer": "my_email_analyzer",
  166. "text": "John_Smith@foo-bar.com"
  167. }
  168. ----------------------------
  169. // CONSOLE
  170. <1> The backslashes in the pattern need to be escaped when specifying the
  171. pattern as a JSON string.
  172. /////////////////////
  173. [source,js]
  174. ----------------------------
  175. {
  176. "tokens": [
  177. {
  178. "token": "john",
  179. "start_offset": 0,
  180. "end_offset": 4,
  181. "type": "word",
  182. "position": 0
  183. },
  184. {
  185. "token": "smith",
  186. "start_offset": 5,
  187. "end_offset": 10,
  188. "type": "word",
  189. "position": 1
  190. },
  191. {
  192. "token": "foo",
  193. "start_offset": 11,
  194. "end_offset": 14,
  195. "type": "word",
  196. "position": 2
  197. },
  198. {
  199. "token": "bar",
  200. "start_offset": 15,
  201. "end_offset": 18,
  202. "type": "word",
  203. "position": 3
  204. },
  205. {
  206. "token": "com",
  207. "start_offset": 19,
  208. "end_offset": 22,
  209. "type": "word",
  210. "position": 4
  211. }
  212. ]
  213. }
  214. ----------------------------
  215. // TESTRESPONSE
  216. /////////////////////
  217. The above example produces the following terms:
  218. [source,text]
  219. ---------------------------
  220. [ john, smith, foo, bar, com ]
  221. ---------------------------
  222. [float]
  223. ==== CamelCase tokenizer
  224. The following more complicated example splits CamelCase text into tokens:
  225. [source,js]
  226. --------------------------------------------------
  227. PUT my_index
  228. {
  229. "settings": {
  230. "analysis": {
  231. "analyzer": {
  232. "camel": {
  233. "type": "pattern",
  234. "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
  235. }
  236. }
  237. }
  238. }
  239. }
  240. GET my_index/_analyze
  241. {
  242. "analyzer": "camel",
  243. "text": "MooseX::FTPClass2_beta"
  244. }
  245. --------------------------------------------------
  246. // CONSOLE
  247. /////////////////////
  248. [source,js]
  249. ----------------------------
  250. {
  251. "tokens": [
  252. {
  253. "token": "moose",
  254. "start_offset": 0,
  255. "end_offset": 5,
  256. "type": "word",
  257. "position": 0
  258. },
  259. {
  260. "token": "x",
  261. "start_offset": 5,
  262. "end_offset": 6,
  263. "type": "word",
  264. "position": 1
  265. },
  266. {
  267. "token": "ftp",
  268. "start_offset": 8,
  269. "end_offset": 11,
  270. "type": "word",
  271. "position": 2
  272. },
  273. {
  274. "token": "class",
  275. "start_offset": 11,
  276. "end_offset": 16,
  277. "type": "word",
  278. "position": 3
  279. },
  280. {
  281. "token": "2",
  282. "start_offset": 16,
  283. "end_offset": 17,
  284. "type": "word",
  285. "position": 4
  286. },
  287. {
  288. "token": "beta",
  289. "start_offset": 18,
  290. "end_offset": 22,
  291. "type": "word",
  292. "position": 5
  293. }
  294. ]
  295. }
  296. ----------------------------
  297. // TESTRESPONSE
  298. /////////////////////
  299. The above example produces the following terms:
  300. [source,text]
  301. ---------------------------
  302. [ moose, x, ftp, class, 2, beta ]
  303. ---------------------------
  304. The regex above is easier to understand as:
  305. [source,regex]
  306. --------------------------------------------------
  307. ([^\p{L}\d]+) # swallow non letters and numbers,
  308. | (?<=\D)(?=\d) # or non-number followed by number,
  309. | (?<=\d)(?=\D) # or number followed by non-number,
  310. | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
  311. (?=\p{Lu}) # followed by upper case,
  312. | (?<=\p{Lu}) # or upper case
  313. (?=\p{Lu} # followed by upper case
  314. [\p{L}&&[^\p{Lu}]] # then lower case
  315. )
  316. --------------------------------------------------
  317. [float]
  318. === Definition
  319. The `pattern` anlayzer consists of:
  320. Tokenizer::
  321. * <<analysis-pattern-tokenizer,Pattern Tokenizer>>
  322. Token Filters::
  323. * <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  324. * <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  325. If you need to customize the `pattern` analyzer beyond the configuration
  326. parameters then you need to recreate it as a `custom` analyzer and modify
  327. it, usually by adding token filters. This would recreate the built-in
  328. `pattern` analyzer and you can use it as a starting point for further
  329. customization:
  330. [source,js]
  331. ----------------------------------------------------
  332. PUT /pattern_example
  333. {
  334. "settings": {
  335. "analysis": {
  336. "tokenizer": {
  337. "split_on_non_word": {
  338. "type": "pattern",
  339. "pattern": "\\W+" <1>
  340. }
  341. },
  342. "analyzer": {
  343. "rebuilt_pattern": {
  344. "tokenizer": "split_on_non_word",
  345. "filter": [
  346. "lowercase" <2>
  347. ]
  348. }
  349. }
  350. }
  351. }
  352. }
  353. ----------------------------------------------------
  354. // CONSOLE
  355. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
  356. <1> The default pattern is `\W+` which splits on non-word characters
  357. and this is where you'd change it.
  358. <2> You'd add other token filters after `lowercase`.