analysis-nori.asciidoc 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407
  1. [[analysis-nori]]
  2. === Korean (nori) Analysis Plugin
  3. The Korean (nori) Analysis plugin integrates Lucene nori analysis
  4. module into elasticsearch. It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary]
  5. to perform morphological analysis of Korean texts.
  6. :plugin_name: analysis-nori
  7. include::install_remove.asciidoc[]
  8. [[analysis-nori-analyzer]]
  9. ==== `nori` analyzer
  10. The `nori` analyzer consists of the following tokenizer and token filters:
  11. * <<analysis-nori-tokenizer,`nori_tokenizer`>>
  12. * <<analysis-nori-speech,`nori_part_of_speech`>> token filter
  13. * <<analysis-nori-readingform,`nori_readingform`>> token filter
  14. * {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter
  15. It supports the `decompound_mode` and `user_dictionary` settings from
  16. <<analysis-nori-tokenizer,`nori_tokenizer`>> and the `stoptags` setting from
  17. <<analysis-nori-speech,`nori_part_of_speech`>>.
  18. [[analysis-nori-tokenizer]]
  19. ==== `nori_tokenizer`
  20. The `nori_tokenizer` accepts the following settings:
  21. `decompound_mode`::
  22. +
  23. --
  24. The decompound mode determines how the tokenizer handles compound tokens.
  25. It can be set to:
  26. `none`::
  27. No decomposition for compounds. Example output:
  28. 가거도항
  29. 가곡역
  30. `discard`::
  31. Decomposes compounds and discards the original form (*default*). Example output:
  32. 가곡역 => 가곡, 역
  33. `mixed`::
  34. Decomposes compounds and keeps the original form. Example output:
  35. 가곡역 => 가곡역, 가곡, 역
  36. --
  37. `user_dictionary`::
  38. +
  39. --
  40. The Nori tokenizer uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary] by default.
  41. A `user_dictionary` with custom nouns (`NNG`) may be appended to the default dictionary.
  42. The dictionary should have the following format:
  43. [source,txt]
  44. -----------------------
  45. <token> [<token 1> ... <token n>]
  46. -----------------------
  47. The first token is mandatory and represents the custom noun that should be added in
  48. the dictionary. For compound nouns the custom segmentation can be provided
  49. after the first token (`[<token 1> ... <token n>]`). The segmentation of the
  50. custom compound nouns is controlled by the `decompound_mode` setting.
  51. --
  52. As a demonstration of how the user dictionary can be used, save the following
  53. dictionary to `$ES_HOME/config/userdict_ko.txt`:
  54. [source,txt]
  55. -----------------------
  56. c++ <1>
  57. C샤프
  58. 세종
  59. 세종시 세종 시 <2>
  60. -----------------------
  61. <1> A simple noun
  62. <2> A compound noun (`세종시`) followed by its decomposition: `세종` and `시`.
  63. Then create an analyzer as follows:
  64. [source,js]
  65. --------------------------------------------------
  66. PUT nori_sample
  67. {
  68. "settings": {
  69. "index": {
  70. "analysis": {
  71. "tokenizer": {
  72. "nori_user_dict": {
  73. "type": "nori_tokenizer",
  74. "decompound_mode": "mixed",
  75. "user_dictionary": "userdict_ko.txt"
  76. }
  77. },
  78. "analyzer": {
  79. "my_analyzer": {
  80. "type": "custom",
  81. "tokenizer": "nori_user_dict"
  82. }
  83. }
  84. }
  85. }
  86. }
  87. }
  88. GET nori_sample/_analyze
  89. {
  90. "analyzer": "my_analyzer",
  91. "text": "세종시" <1>
  92. }
  93. --------------------------------------------------
  94. // CONSOLE
  95. <1> Sejong city
  96. The above `analyze` request returns the following:
  97. [source,js]
  98. --------------------------------------------------
  99. {
  100. "tokens" : [ {
  101. "token" : "세종시",
  102. "start_offset" : 0,
  103. "end_offset" : 3,
  104. "type" : "word",
  105. "position" : 0,
  106. "positionLength" : 2 <1>
  107. }, {
  108. "token" : "세종",
  109. "start_offset" : 0,
  110. "end_offset" : 2,
  111. "type" : "word",
  112. "position" : 0
  113. }, {
  114. "token" : "시",
  115. "start_offset" : 2,
  116. "end_offset" : 3,
  117. "type" : "word",
  118. "position" : 1
  119. }]
  120. }
  121. --------------------------------------------------
  122. // TESTRESPONSE
  123. <1> This is a compound token that spans two positions (`mixed` mode).
  124. The `nori_tokenizer` sets a number of additional attributes per token that are used by token filters
  125. to modify the stream.
  126. You can view all these additional attributes with the following request:
  127. [source,js]
  128. --------------------------------------------------
  129. GET _analyze
  130. {
  131. "tokenizer": "nori_tokenizer",
  132. "text": "뿌리가 깊은 나무는", <1>
  133. "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
  134. "explain": true
  135. }
  136. --------------------------------------------------
  137. // CONSOLE
  138. <1> A tree with deep roots
  139. Which responds with:
  140. [source,js]
  141. --------------------------------------------------
  142. {
  143. "detail": {
  144. "custom_analyzer": true,
  145. "charfilters": [],
  146. "tokenizer": {
  147. "name": "nori_tokenizer",
  148. "tokens": [
  149. {
  150. "token": "뿌리",
  151. "start_offset": 0,
  152. "end_offset": 2,
  153. "type": "word",
  154. "position": 0,
  155. "leftPOS": "NNG(General Noun)",
  156. "morphemes": null,
  157. "posType": "MORPHEME",
  158. "reading": null,
  159. "rightPOS": "NNG(General Noun)"
  160. },
  161. {
  162. "token": "가",
  163. "start_offset": 2,
  164. "end_offset": 3,
  165. "type": "word",
  166. "position": 1,
  167. "leftPOS": "J(Ending Particle)",
  168. "morphemes": null,
  169. "posType": "MORPHEME",
  170. "reading": null,
  171. "rightPOS": "J(Ending Particle)"
  172. },
  173. {
  174. "token": "깊",
  175. "start_offset": 4,
  176. "end_offset": 5,
  177. "type": "word",
  178. "position": 2,
  179. "leftPOS": "VA(Adjective)",
  180. "morphemes": null,
  181. "posType": "MORPHEME",
  182. "reading": null,
  183. "rightPOS": "VA(Adjective)"
  184. },
  185. {
  186. "token": "은",
  187. "start_offset": 5,
  188. "end_offset": 6,
  189. "type": "word",
  190. "position": 3,
  191. "leftPOS": "E(Verbal endings)",
  192. "morphemes": null,
  193. "posType": "MORPHEME",
  194. "reading": null,
  195. "rightPOS": "E(Verbal endings)"
  196. },
  197. {
  198. "token": "나무",
  199. "start_offset": 7,
  200. "end_offset": 9,
  201. "type": "word",
  202. "position": 4,
  203. "leftPOS": "NNG(General Noun)",
  204. "morphemes": null,
  205. "posType": "MORPHEME",
  206. "reading": null,
  207. "rightPOS": "NNG(General Noun)"
  208. },
  209. {
  210. "token": "는",
  211. "start_offset": 9,
  212. "end_offset": 10,
  213. "type": "word",
  214. "position": 5,
  215. "leftPOS": "J(Ending Particle)",
  216. "morphemes": null,
  217. "posType": "MORPHEME",
  218. "reading": null,
  219. "rightPOS": "J(Ending Particle)"
  220. }
  221. ]
  222. },
  223. "tokenfilters": []
  224. }
  225. }
  226. --------------------------------------------------
  227. // TESTRESPONSE
  228. [[analysis-nori-speech]]
  229. ==== `nori_part_of_speech` token filter
  230. The `nori_part_of_speech` token filter removes tokens that match a set of
  231. part-of-speech tags. The list of supported tags and their meanings can be found here:
  232. {lucene-core-javadoc}/../analyzers-nori/org/apache/lucene/analysis/ko/POS.Tag.html[Part of speech tags]
  233. It accepts the following setting:
  234. `stoptags`::
  235. An array of part-of-speech tags that should be removed.
  236. and defaults to:
  237. [source,js]
  238. --------------------------------------------------
  239. "stoptags": [
  240. "E",
  241. "IC",
  242. "J",
  243. "MAG", "MAJ", "MM",
  244. "SP", "SSC", "SSO", "SC", "SE",
  245. "XPN", "XSA", "XSN", "XSV",
  246. "UNA", "NA", "VSV"
  247. ]
  248. --------------------------------------------------
  249. // NOTCONSOLE
  250. For example:
  251. [source,js]
  252. --------------------------------------------------
  253. PUT nori_sample
  254. {
  255. "settings": {
  256. "index": {
  257. "analysis": {
  258. "analyzer": {
  259. "my_analyzer": {
  260. "tokenizer": "nori_tokenizer",
  261. "filter": [
  262. "my_posfilter"
  263. ]
  264. }
  265. },
  266. "filter": {
  267. "my_posfilter": {
  268. "type": "nori_part_of_speech",
  269. "stoptags": [
  270. "NR" <1>
  271. ]
  272. }
  273. }
  274. }
  275. }
  276. }
  277. }
  278. GET nori_sample/_analyze
  279. {
  280. "analyzer": "my_analyzer",
  281. "text": "여섯 용이" <2>
  282. }
  283. --------------------------------------------------
  284. // CONSOLE
  285. <1> Korean numerals should be removed (`NR`)
  286. <2> Six dragons
  287. Which responds with:
  288. [source,js]
  289. --------------------------------------------------
  290. {
  291. "tokens" : [ {
  292. "token" : "용",
  293. "start_offset" : 3,
  294. "end_offset" : 4,
  295. "type" : "word",
  296. "position" : 1
  297. }, {
  298. "token" : "이",
  299. "start_offset" : 4,
  300. "end_offset" : 5,
  301. "type" : "word",
  302. "position" : 2
  303. } ]
  304. }
  305. --------------------------------------------------
  306. // TESTRESPONSE
  307. [[analysis-nori-readingform]]
  308. ==== `nori_readingform` token filter
  309. The `nori_readingform` token filter rewrites tokens written in Hanja to their Hangul form.
  310. [source,js]
  311. --------------------------------------------------
  312. PUT nori_sample
  313. {
  314. "settings": {
  315. "index":{
  316. "analysis":{
  317. "analyzer" : {
  318. "my_analyzer" : {
  319. "tokenizer" : "nori_tokenizer",
  320. "filter" : ["nori_readingform"]
  321. }
  322. }
  323. }
  324. }
  325. }
  326. }
  327. GET nori_sample/_analyze
  328. {
  329. "analyzer": "my_analyzer",
  330. "text": "鄕歌" <1>
  331. }
  332. --------------------------------------------------
  333. // CONSOLE
  334. <1> A token written in Hanja: Hyangga
  335. Which responds with:
  336. [source,js]
  337. --------------------------------------------------
  338. {
  339. "tokens" : [ {
  340. "token" : "향가", <1>
  341. "start_offset" : 0,
  342. "end_offset" : 2,
  343. "type" : "word",
  344. "position" : 0
  345. }]
  346. }
  347. --------------------------------------------------
  348. // TESTRESPONSE
  349. <1> The Hanja form is replaced by the Hangul translation.