pattern-analyzer.asciidoc 7.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381
  1. [[analysis-pattern-analyzer]]
  2. === Pattern Analyzer
  3. The `pattern` analyzer uses a regular expression to split the text into terms.
  4. The regular expression should match the *token separators* not the tokens
  5. themselves. The regular expression defaults to `\W+` (or all non-word characters).
  6. [WARNING]
  7. .Beware of Pathological Regular Expressions
  8. ========================================
  9. The pattern analyzer uses
  10. http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions].
  11. A badly written regular expression could run very slowly or even throw a
  12. StackOverflowError and cause the node it is running on to exit suddenly.
  13. Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them].
  14. ========================================
  15. [float]
  16. === Definition
  17. It consists of:
  18. Tokenizer::
  19. * <<analysis-pattern-tokenizer,Pattern Tokenizer>>
  20. Token Filters::
  21. * <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  22. * <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  23. [float]
  24. === Example output
  25. [source,js]
  26. ---------------------------
  27. POST _analyze
  28. {
  29. "analyzer": "pattern",
  30. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  31. }
  32. ---------------------------
  33. // CONSOLE
  34. /////////////////////
  35. [source,js]
  36. ----------------------------
  37. {
  38. "tokens": [
  39. {
  40. "token": "the",
  41. "start_offset": 0,
  42. "end_offset": 3,
  43. "type": "word",
  44. "position": 0
  45. },
  46. {
  47. "token": "2",
  48. "start_offset": 4,
  49. "end_offset": 5,
  50. "type": "word",
  51. "position": 1
  52. },
  53. {
  54. "token": "quick",
  55. "start_offset": 6,
  56. "end_offset": 11,
  57. "type": "word",
  58. "position": 2
  59. },
  60. {
  61. "token": "brown",
  62. "start_offset": 12,
  63. "end_offset": 17,
  64. "type": "word",
  65. "position": 3
  66. },
  67. {
  68. "token": "foxes",
  69. "start_offset": 18,
  70. "end_offset": 23,
  71. "type": "word",
  72. "position": 4
  73. },
  74. {
  75. "token": "jumped",
  76. "start_offset": 24,
  77. "end_offset": 30,
  78. "type": "word",
  79. "position": 5
  80. },
  81. {
  82. "token": "over",
  83. "start_offset": 31,
  84. "end_offset": 35,
  85. "type": "word",
  86. "position": 6
  87. },
  88. {
  89. "token": "the",
  90. "start_offset": 36,
  91. "end_offset": 39,
  92. "type": "word",
  93. "position": 7
  94. },
  95. {
  96. "token": "lazy",
  97. "start_offset": 40,
  98. "end_offset": 44,
  99. "type": "word",
  100. "position": 8
  101. },
  102. {
  103. "token": "dog",
  104. "start_offset": 45,
  105. "end_offset": 48,
  106. "type": "word",
  107. "position": 9
  108. },
  109. {
  110. "token": "s",
  111. "start_offset": 49,
  112. "end_offset": 50,
  113. "type": "word",
  114. "position": 10
  115. },
  116. {
  117. "token": "bone",
  118. "start_offset": 51,
  119. "end_offset": 55,
  120. "type": "word",
  121. "position": 11
  122. }
  123. ]
  124. }
  125. ----------------------------
  126. // TESTRESPONSE
  127. /////////////////////
  128. The above sentence would produce the following terms:
  129. [source,text]
  130. ---------------------------
  131. [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
  132. ---------------------------
  133. [float]
  134. === Configuration
  135. The `pattern` analyzer accepts the following parameters:
  136. [horizontal]
  137. `pattern`::
  138. A http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java regular expression], defaults to `\W+`.
  139. `flags`::
  140. Java regular expression http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary[flags].
  141. Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.
  142. `lowercase`::
  143. Should terms be lowercased or not. Defaults to `true`.
  144. `stopwords`::
  145. A pre-defined stop words list like `_english_` or an array containing a
  146. list of stop words. Defaults to `\_none_`.
  147. `stopwords_path`::
  148. The path to a file containing stop words.
  149. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  150. about stop word configuration.
  151. [float]
  152. === Example configuration
  153. In this example, we configure the `pattern` analyzer to split email addresses
  154. on non-word characters or on underscores (`\W|_`), and to lower-case the result:
  155. [source,js]
  156. ----------------------------
  157. PUT my_index
  158. {
  159. "settings": {
  160. "analysis": {
  161. "analyzer": {
  162. "my_email_analyzer": {
  163. "type": "pattern",
  164. "pattern": "\\W|_", <1>
  165. "lowercase": true
  166. }
  167. }
  168. }
  169. }
  170. }
  171. POST my_index/_analyze
  172. {
  173. "analyzer": "my_email_analyzer",
  174. "text": "John_Smith@foo-bar.com"
  175. }
  176. ----------------------------
  177. // CONSOLE
  178. <1> The backslashes in the pattern need to be escaped when specifying the
  179. pattern as a JSON string.
  180. /////////////////////
  181. [source,js]
  182. ----------------------------
  183. {
  184. "tokens": [
  185. {
  186. "token": "john",
  187. "start_offset": 0,
  188. "end_offset": 4,
  189. "type": "word",
  190. "position": 0
  191. },
  192. {
  193. "token": "smith",
  194. "start_offset": 5,
  195. "end_offset": 10,
  196. "type": "word",
  197. "position": 1
  198. },
  199. {
  200. "token": "foo",
  201. "start_offset": 11,
  202. "end_offset": 14,
  203. "type": "word",
  204. "position": 2
  205. },
  206. {
  207. "token": "bar",
  208. "start_offset": 15,
  209. "end_offset": 18,
  210. "type": "word",
  211. "position": 3
  212. },
  213. {
  214. "token": "com",
  215. "start_offset": 19,
  216. "end_offset": 22,
  217. "type": "word",
  218. "position": 4
  219. }
  220. ]
  221. }
  222. ----------------------------
  223. // TESTRESPONSE
  224. /////////////////////
  225. The above example produces the following terms:
  226. [source,text]
  227. ---------------------------
  228. [ john, smith, foo, bar, com ]
  229. ---------------------------
  230. [float]
  231. ==== CamelCase tokenizer
  232. The following more complicated example splits CamelCase text into tokens:
  233. [source,js]
  234. --------------------------------------------------
  235. PUT my_index
  236. {
  237. "settings": {
  238. "analysis": {
  239. "analyzer": {
  240. "camel": {
  241. "type": "pattern",
  242. "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
  243. }
  244. }
  245. }
  246. }
  247. }
  248. GET my_index/_analyze
  249. {
  250. "analyzer": "camel",
  251. "text": "MooseX::FTPClass2_beta"
  252. }
  253. --------------------------------------------------
  254. // CONSOLE
  255. /////////////////////
  256. [source,js]
  257. ----------------------------
  258. {
  259. "tokens": [
  260. {
  261. "token": "moose",
  262. "start_offset": 0,
  263. "end_offset": 5,
  264. "type": "word",
  265. "position": 0
  266. },
  267. {
  268. "token": "x",
  269. "start_offset": 5,
  270. "end_offset": 6,
  271. "type": "word",
  272. "position": 1
  273. },
  274. {
  275. "token": "ftp",
  276. "start_offset": 8,
  277. "end_offset": 11,
  278. "type": "word",
  279. "position": 2
  280. },
  281. {
  282. "token": "class",
  283. "start_offset": 11,
  284. "end_offset": 16,
  285. "type": "word",
  286. "position": 3
  287. },
  288. {
  289. "token": "2",
  290. "start_offset": 16,
  291. "end_offset": 17,
  292. "type": "word",
  293. "position": 4
  294. },
  295. {
  296. "token": "beta",
  297. "start_offset": 18,
  298. "end_offset": 22,
  299. "type": "word",
  300. "position": 5
  301. }
  302. ]
  303. }
  304. ----------------------------
  305. // TESTRESPONSE
  306. /////////////////////
  307. The above example produces the following terms:
  308. [source,text]
  309. ---------------------------
  310. [ moose, x, ftp, class, 2, beta ]
  311. ---------------------------
  312. The regex above is easier to understand as:
  313. [source,js]
  314. --------------------------------------------------
  315. ([^\p{L}\d]+) # swallow non letters and numbers,
  316. | (?<=\D)(?=\d) # or non-number followed by number,
  317. | (?<=\d)(?=\D) # or number followed by non-number,
  318. | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
  319. (?=\p{Lu}) # followed by upper case,
  320. | (?<=\p{Lu}) # or upper case
  321. (?=\p{Lu} # followed by upper case
  322. [\p{L}&&[^\p{Lu}]] # then lower case
  323. )
  324. --------------------------------------------------