analysis-nori.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550
  1. [[analysis-nori]]
  2. === Korean (nori) analysis plugin
  3. The Korean (nori) Analysis plugin integrates Lucene nori analysis
  4. module into elasticsearch. It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary]
  5. to perform morphological analysis of Korean texts.
  6. :plugin_name: analysis-nori
  7. include::install_remove.asciidoc[]
  8. [[analysis-nori-analyzer]]
  9. ==== `nori` analyzer
  10. The `nori` analyzer consists of the following tokenizer and token filters:
  11. * <<analysis-nori-tokenizer,`nori_tokenizer`>>
  12. * <<analysis-nori-speech,`nori_part_of_speech`>> token filter
  13. * <<analysis-nori-readingform,`nori_readingform`>> token filter
  14. * {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter
  15. It supports the `decompound_mode` and `user_dictionary` settings from
  16. <<analysis-nori-tokenizer,`nori_tokenizer`>> and the `stoptags` setting from
  17. <<analysis-nori-speech,`nori_part_of_speech`>>.
  18. [[analysis-nori-tokenizer]]
  19. ==== `nori_tokenizer`
  20. The `nori_tokenizer` accepts the following settings:
  21. `decompound_mode`::
  22. +
  23. --
  24. The decompound mode determines how the tokenizer handles compound tokens.
  25. It can be set to:
  26. `none`::
  27. No decomposition for compounds. Example output:
  28. 가거도항
  29. 가곡역
  30. `discard`::
  31. Decomposes compounds and discards the original form (*default*). Example output:
  32. 가곡역 => 가곡, 역
  33. `mixed`::
  34. Decomposes compounds and keeps the original form. Example output:
  35. 가곡역 => 가곡역, 가곡, 역
  36. --
  37. `discard_punctuation`::
  38. Whether punctuation should be discarded from the output. Defaults to `true`.
  39. `lenient`::
  40. Whether the `user_dictionary` should be deduplicated on the provided `text`.
  41. False by default causing duplicates to generate an error.
  42. `user_dictionary`::
  43. +
  44. --
  45. The Nori tokenizer uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary] by default.
  46. A `user_dictionary` with custom nouns (`NNG`) may be appended to the default dictionary.
  47. The dictionary should have the following format:
  48. [source,txt]
  49. -----------------------
  50. <token> [<token 1> ... <token n>]
  51. -----------------------
  52. The first token is mandatory and represents the custom noun that should be added in
  53. the dictionary. For compound nouns the custom segmentation can be provided
  54. after the first token (`[<token 1> ... <token n>]`). The segmentation of the
  55. custom compound nouns is controlled by the `decompound_mode` setting.
  56. As a demonstration of how the user dictionary can be used, save the following
  57. dictionary to `$ES_HOME/config/userdict_ko.txt`:
  58. [source,txt]
  59. -----------------------
  60. c++ <1>
  61. C쁠쁠
  62. 세종
  63. 세종시 세종 시 <2>
  64. -----------------------
  65. <1> A simple noun
  66. <2> A compound noun (`세종시`) followed by its decomposition: `세종` and `시`.
  67. Then create an analyzer as follows:
  68. [source,console]
  69. --------------------------------------------------
  70. PUT nori_sample
  71. {
  72. "settings": {
  73. "index": {
  74. "analysis": {
  75. "tokenizer": {
  76. "nori_user_dict": {
  77. "type": "nori_tokenizer",
  78. "decompound_mode": "mixed",
  79. "discard_punctuation": "false",
  80. "user_dictionary": "userdict_ko.txt",
  81. "lenient": "true"
  82. }
  83. },
  84. "analyzer": {
  85. "my_analyzer": {
  86. "type": "custom",
  87. "tokenizer": "nori_user_dict"
  88. }
  89. }
  90. }
  91. }
  92. }
  93. }
  94. GET nori_sample/_analyze
  95. {
  96. "analyzer": "my_analyzer",
  97. "text": "세종시" <1>
  98. }
  99. --------------------------------------------------
  100. <1> Sejong city
  101. The above `analyze` request returns the following:
  102. [source,console-result]
  103. --------------------------------------------------
  104. {
  105. "tokens" : [ {
  106. "token" : "세종시",
  107. "start_offset" : 0,
  108. "end_offset" : 3,
  109. "type" : "word",
  110. "position" : 0,
  111. "positionLength" : 2 <1>
  112. }, {
  113. "token" : "세종",
  114. "start_offset" : 0,
  115. "end_offset" : 2,
  116. "type" : "word",
  117. "position" : 0
  118. }, {
  119. "token" : "시",
  120. "start_offset" : 2,
  121. "end_offset" : 3,
  122. "type" : "word",
  123. "position" : 1
  124. }]
  125. }
  126. --------------------------------------------------
  127. <1> This is a compound token that spans two positions (`mixed` mode).
  128. --
  129. `user_dictionary_rules`::
  130. +
  131. --
  132. You can also inline the rules directly in the tokenizer definition using
  133. the `user_dictionary_rules` option:
  134. [source,console]
  135. --------------------------------------------------
  136. PUT nori_sample
  137. {
  138. "settings": {
  139. "index": {
  140. "analysis": {
  141. "tokenizer": {
  142. "nori_user_dict": {
  143. "type": "nori_tokenizer",
  144. "decompound_mode": "mixed",
  145. "user_dictionary_rules": ["c++", "C쁠쁠", "세종", "세종시 세종 시"]
  146. }
  147. },
  148. "analyzer": {
  149. "my_analyzer": {
  150. "type": "custom",
  151. "tokenizer": "nori_user_dict"
  152. }
  153. }
  154. }
  155. }
  156. }
  157. }
  158. --------------------------------------------------
  159. --
  160. The `nori_tokenizer` sets a number of additional attributes per token that are used by token filters
  161. to modify the stream.
  162. You can view all these additional attributes with the following request:
  163. [source,console]
  164. --------------------------------------------------
  165. GET _analyze
  166. {
  167. "tokenizer": "nori_tokenizer",
  168. "text": "뿌리가 깊은 나무는", <1>
  169. "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
  170. "explain": true
  171. }
  172. --------------------------------------------------
  173. <1> A tree with deep roots
  174. Which responds with:
  175. [source,console-result]
  176. --------------------------------------------------
  177. {
  178. "detail": {
  179. "custom_analyzer": true,
  180. "charfilters": [],
  181. "tokenizer": {
  182. "name": "nori_tokenizer",
  183. "tokens": [
  184. {
  185. "token": "뿌리",
  186. "start_offset": 0,
  187. "end_offset": 2,
  188. "type": "word",
  189. "position": 0,
  190. "leftPOS": "NNG(General Noun)",
  191. "morphemes": null,
  192. "posType": "MORPHEME",
  193. "reading": null,
  194. "rightPOS": "NNG(General Noun)"
  195. },
  196. {
  197. "token": "가",
  198. "start_offset": 2,
  199. "end_offset": 3,
  200. "type": "word",
  201. "position": 1,
  202. "leftPOS": "J(Ending Particle)",
  203. "morphemes": null,
  204. "posType": "MORPHEME",
  205. "reading": null,
  206. "rightPOS": "J(Ending Particle)"
  207. },
  208. {
  209. "token": "깊",
  210. "start_offset": 4,
  211. "end_offset": 5,
  212. "type": "word",
  213. "position": 2,
  214. "leftPOS": "VA(Adjective)",
  215. "morphemes": null,
  216. "posType": "MORPHEME",
  217. "reading": null,
  218. "rightPOS": "VA(Adjective)"
  219. },
  220. {
  221. "token": "은",
  222. "start_offset": 5,
  223. "end_offset": 6,
  224. "type": "word",
  225. "position": 3,
  226. "leftPOS": "E(Verbal endings)",
  227. "morphemes": null,
  228. "posType": "MORPHEME",
  229. "reading": null,
  230. "rightPOS": "E(Verbal endings)"
  231. },
  232. {
  233. "token": "나무",
  234. "start_offset": 7,
  235. "end_offset": 9,
  236. "type": "word",
  237. "position": 4,
  238. "leftPOS": "NNG(General Noun)",
  239. "morphemes": null,
  240. "posType": "MORPHEME",
  241. "reading": null,
  242. "rightPOS": "NNG(General Noun)"
  243. },
  244. {
  245. "token": "는",
  246. "start_offset": 9,
  247. "end_offset": 10,
  248. "type": "word",
  249. "position": 5,
  250. "leftPOS": "J(Ending Particle)",
  251. "morphemes": null,
  252. "posType": "MORPHEME",
  253. "reading": null,
  254. "rightPOS": "J(Ending Particle)"
  255. }
  256. ]
  257. },
  258. "tokenfilters": []
  259. }
  260. }
  261. --------------------------------------------------
  262. [[analysis-nori-speech]]
  263. ==== `nori_part_of_speech` token filter
  264. The `nori_part_of_speech` token filter removes tokens that match a set of
  265. part-of-speech tags. The list of supported tags and their meanings can be found here:
  266. {lucene-core-javadoc}/../analysis/nori/org/apache/lucene/analysis/ko/POS.Tag.html[Part of speech tags]
  267. It accepts the following setting:
  268. `stoptags`::
  269. An array of part-of-speech tags that should be removed.
  270. and defaults to:
  271. [source,js]
  272. --------------------------------------------------
  273. "stoptags": [
  274. "E",
  275. "IC",
  276. "J",
  277. "MAG", "MAJ", "MM",
  278. "SP", "SSC", "SSO", "SC", "SE",
  279. "XPN", "XSA", "XSN", "XSV",
  280. "UNA", "NA", "VSV"
  281. ]
  282. --------------------------------------------------
  283. // NOTCONSOLE
  284. For example:
  285. [source,console]
  286. --------------------------------------------------
  287. PUT nori_sample
  288. {
  289. "settings": {
  290. "index": {
  291. "analysis": {
  292. "analyzer": {
  293. "my_analyzer": {
  294. "tokenizer": "nori_tokenizer",
  295. "filter": [
  296. "my_posfilter"
  297. ]
  298. }
  299. },
  300. "filter": {
  301. "my_posfilter": {
  302. "type": "nori_part_of_speech",
  303. "stoptags": [
  304. "NR" <1>
  305. ]
  306. }
  307. }
  308. }
  309. }
  310. }
  311. }
  312. GET nori_sample/_analyze
  313. {
  314. "analyzer": "my_analyzer",
  315. "text": "여섯 용이" <2>
  316. }
  317. --------------------------------------------------
  318. <1> Korean numerals should be removed (`NR`)
  319. <2> Six dragons
  320. Which responds with:
  321. [source,console-result]
  322. --------------------------------------------------
  323. {
  324. "tokens" : [ {
  325. "token" : "용",
  326. "start_offset" : 3,
  327. "end_offset" : 4,
  328. "type" : "word",
  329. "position" : 1
  330. }, {
  331. "token" : "이",
  332. "start_offset" : 4,
  333. "end_offset" : 5,
  334. "type" : "word",
  335. "position" : 2
  336. } ]
  337. }
  338. --------------------------------------------------
  339. [[analysis-nori-readingform]]
  340. ==== `nori_readingform` token filter
  341. The `nori_readingform` token filter rewrites tokens written in Hanja to their Hangul form.
  342. [source,console]
  343. --------------------------------------------------
  344. PUT nori_sample
  345. {
  346. "settings": {
  347. "index": {
  348. "analysis": {
  349. "analyzer": {
  350. "my_analyzer": {
  351. "tokenizer": "nori_tokenizer",
  352. "filter": [ "nori_readingform" ]
  353. }
  354. }
  355. }
  356. }
  357. }
  358. }
  359. GET nori_sample/_analyze
  360. {
  361. "analyzer": "my_analyzer",
  362. "text": "鄕歌" <1>
  363. }
  364. --------------------------------------------------
  365. <1> A token written in Hanja: Hyangga
  366. Which responds with:
  367. [source,console-result]
  368. --------------------------------------------------
  369. {
  370. "tokens" : [ {
  371. "token" : "향가", <1>
  372. "start_offset" : 0,
  373. "end_offset" : 2,
  374. "type" : "word",
  375. "position" : 0
  376. }]
  377. }
  378. --------------------------------------------------
  379. <1> The Hanja form is replaced by the Hangul translation.
  380. [[analysis-nori-number]]
  381. ==== `nori_number` token filter
  382. The `nori_number` token filter normalizes Korean numbers
  383. to regular Arabic decimal numbers in half-width characters.
  384. Korean numbers are often written using a combination of Hangul and Arabic numbers with various kinds of punctuation.
  385. For example, 3.2천 means 3200.
  386. This filter does this kind of normalization and allows a search for 3200 to match 3.2천 in text,
  387. but can also be used to make range facets based on the normalized numbers and so on.
  388. [NOTE]
  389. ====
  390. Notice that this analyzer uses a token composition scheme and relies on punctuation tokens
  391. being found in the token stream.
  392. Please make sure your `nori_tokenizer` has `discard_punctuation` set to false.
  393. In case punctuation characters, such as U+FF0E(.), is removed from the token stream,
  394. this filter would find input tokens 3 and 2천 and give outputs 3 and 2000 instead of 3200,
  395. which is likely not the intended result.
  396. If you want to remove punctuation characters from your index that are not part of normalized numbers,
  397. add a `stop` token filter with the punctuation you wish to remove after `nori_number` in your analyzer chain.
  398. ====
  399. Below are some examples of normalizations this filter supports.
  400. The input is untokenized text and the result is the single term attribute emitted for the input.
  401. - 영영칠 -> 7
  402. - 일영영영 -> 1000
  403. - 삼천2백2십삼 -> 3223
  404. - 조육백만오천일 -> 1000006005001
  405. - 3.2천 -> 3200
  406. - 1.2만345.67 -> 12345.67
  407. - 4,647.100 -> 4647.1
  408. - 15,7 -> 157 (be aware of this weakness)
  409. For example:
  410. [source,console]
  411. --------------------------------------------------
  412. PUT nori_sample
  413. {
  414. "settings": {
  415. "index": {
  416. "analysis": {
  417. "analyzer": {
  418. "my_analyzer": {
  419. "tokenizer": "tokenizer_discard_puncuation_false",
  420. "filter": [
  421. "part_of_speech_stop_sp", "nori_number"
  422. ]
  423. }
  424. },
  425. "tokenizer": {
  426. "tokenizer_discard_puncuation_false": {
  427. "type": "nori_tokenizer",
  428. "discard_punctuation": "false"
  429. }
  430. },
  431. "filter": {
  432. "part_of_speech_stop_sp": {
  433. "type": "nori_part_of_speech",
  434. "stoptags": ["SP"]
  435. }
  436. }
  437. }
  438. }
  439. }
  440. }
  441. GET nori_sample/_analyze
  442. {
  443. "analyzer": "my_analyzer",
  444. "text": "십만이천오백과 3.2천"
  445. }
  446. --------------------------------------------------
  447. Which results in:
  448. [source,console-result]
  449. --------------------------------------------------
  450. {
  451. "tokens" : [{
  452. "token" : "102500",
  453. "start_offset" : 0,
  454. "end_offset" : 6,
  455. "type" : "word",
  456. "position" : 0
  457. }, {
  458. "token" : "과",
  459. "start_offset" : 6,
  460. "end_offset" : 7,
  461. "type" : "word",
  462. "position" : 1
  463. }, {
  464. "token" : "3200",
  465. "start_offset" : 8,
  466. "end_offset" : 12,
  467. "type" : "word",
  468. "position" : 2
  469. }]
  470. }
  471. --------------------------------------------------