analysis-kuromoji.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521
  1. [[analysis-kuromoji]]
  2. === Japanese (kuromoji) Analysis Plugin
  3. The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis
  4. module into elasticsearch.
  5. [[analysis-kuromoji-install]]
  6. [float]
  7. ==== Installation
  8. This plugin can be installed using the plugin manager:
  9. [source,sh]
  10. ----------------------------------------------------------------
  11. sudo bin/elasticsearch-plugin install analysis-kuromoji
  12. ----------------------------------------------------------------
  13. The plugin must be installed on every node in the cluster, and each node must
  14. be restarted after installation.
  15. [[analysis-kuromoji-remove]]
  16. [float]
  17. ==== Removal
  18. The plugin can be removed with the following command:
  19. [source,sh]
  20. ----------------------------------------------------------------
  21. sudo bin/elasticsearch-plugin remove analysis-kuromoji
  22. ----------------------------------------------------------------
  23. The node must be stopped before removing the plugin.
  24. [[analysis-kuromoji-analyzer]]
  25. ==== `kuromoji` analyzer
  26. The `kuromoji` analyzer consists of the following tokenizer and token filters:
  27. * <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>
  28. * <<analysis-kuromoji-baseform,`kuromoji_baseform`>> token filter
  29. * <<analysis-kuromoji-speech,`kuromoji_part_of_speech`>> token filter
  30. * {ref}/analysis-cjk-width-tokenfilter.html[`cjk_width`] token filter
  31. * <<analysis-kuromoji-stop,`ja_stop`>> token filter
  32. * <<analysis-kuromoji-stemmer,`kuromoji_stemmer`>> token filter
  33. * {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter
  34. It supports the `mode` and `user_dictionary` settings from
  35. <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>.
  36. [[analysis-kuromoji-charfilter]]
  37. ==== `kuromoji_iteration_mark` character filter
  38. The `kuromoji_iteration_mark` normalizes Japanese horizontal iteration marks
  39. (_odoriji_) to their expanded form. It accepts the following settings:
  40. `normalize_kanji`::
  41. Indicates whether kanji iteration marks should be normalize. Defaults to `true`.
  42. `normalize_kana`::
  43. Indicates whether kana iteration marks should be normalized. Defaults to `true`
  44. [[analysis-kuromoji-tokenizer]]
  45. ==== `kuromoji_tokenizer`
  46. The `kuromoji_tokenizer` accepts the following settings:
  47. `mode`::
  48. +
  49. --
  50. The tokenization mode determines how the tokenizer handles compound and
  51. unknown words. It can be set to:
  52. `normal`::
  53. Normal segmentation, no decomposition for compounds. Example output:
  54. 関西国際空港
  55. アブラカダブラ
  56. `search`::
  57. Segmentation geared towards search. This includes a decompounding process
  58. for long nouns, also including the full compound token as a synonym.
  59. Example output:
  60. 関西, 関西国際空港, 国際, 空港
  61. アブラカダブラ
  62. `extended`::
  63. Extended mode outputs unigrams for unknown words. Example output:
  64. 関西, 国際, 空港
  65. ア, ブ, ラ, カ, ダ, ブ, ラ
  66. --
  67. `discard_punctuation`::
  68. Whether punctuation should be discarded from the output. Defaults to `true`.
  69. `user_dictionary`::
  70. +
  71. --
  72. The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A `user_dictionary`
  73. may be appended to the default dictionary. The dictionary should have the following CSV format:
  74. [source,csv]
  75. -----------------------
  76. <text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
  77. -----------------------
  78. --
  79. As a demonstration of how the user dictionary can be used, save the following
  80. dictionary to `$ES_HOME/config/userdict_ja.txt`:
  81. [source,csv]
  82. -----------------------
  83. 東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
  84. -----------------------
  85. `nbest_cost`/`nbest_examples`::
  86. +
  87. --
  88. Additional expert user parameters `nbest_cost` and `nbest_examples` can be used
  89. to include additional tokens that most likely according to the statistical model.
  90. If both parameters are used, the largest number of both is applied.
  91. `nbest_cost`::
  92. The `nbest_cost` parameter specifies an additional Viterbi cost.
  93. The KuromojiTokenizer will include all tokens in Viterbi paths that are
  94. within the nbest_cost value of the best path.
  95. `nbest_examples`::
  96. The `nbest_examples` can be used to find a `nbest_cost` value based on examples.
  97. For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts,
  98. 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us
  99. 箱根 (Hakone) and 成田 (Narita).
  100. --
  101. Then create an analyzer as follows:
  102. [source,json]
  103. --------------------------------------------------
  104. PUT kuromoji_sample
  105. {
  106. "settings": {
  107. "index": {
  108. "analysis": {
  109. "tokenizer": {
  110. "kuromoji_user_dict": {
  111. "type": "kuromoji_tokenizer",
  112. "mode": "extended",
  113. "discard_punctuation": "false",
  114. "user_dictionary": "userdict_ja.txt"
  115. }
  116. },
  117. "analyzer": {
  118. "my_analyzer": {
  119. "type": "custom",
  120. "tokenizer": "kuromoji_user_dict"
  121. }
  122. }
  123. }
  124. }
  125. }
  126. }
  127. POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=東京スカイツリー
  128. --------------------------------------------------
  129. // CONSOLE
  130. The above `analyze` request returns the following:
  131. [source,json]
  132. --------------------------------------------------
  133. # Result
  134. {
  135. "tokens" : [ {
  136. "token" : "東京",
  137. "start_offset" : 0,
  138. "end_offset" : 2,
  139. "type" : "word",
  140. "position" : 1
  141. }, {
  142. "token" : "スカイツリー",
  143. "start_offset" : 2,
  144. "end_offset" : 8,
  145. "type" : "word",
  146. "position" : 2
  147. } ]
  148. }
  149. --------------------------------------------------
  150. [[analysis-kuromoji-baseform]]
  151. ==== `kuromoji_baseform` token filter
  152. The `kuromoji_baseform` token filter replaces terms with their
  153. BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.
  154. [source,json]
  155. --------------------------------------------------
  156. PUT kuromoji_sample
  157. {
  158. "settings": {
  159. "index": {
  160. "analysis": {
  161. "analyzer": {
  162. "my_analyzer": {
  163. "tokenizer": "kuromoji_tokenizer",
  164. "filter": [
  165. "kuromoji_baseform"
  166. ]
  167. }
  168. }
  169. }
  170. }
  171. }
  172. }
  173. POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=飲み
  174. --------------------------------------------------
  175. // CONSOLE
  176. [source,text]
  177. --------------------------------------------------
  178. # Result
  179. {
  180. "tokens" : [ {
  181. "token" : "飲む",
  182. "start_offset" : 0,
  183. "end_offset" : 2,
  184. "type" : "word",
  185. "position" : 1
  186. } ]
  187. }
  188. --------------------------------------------------
  189. [[analysis-kuromoji-speech]]
  190. ==== `kuromoji_part_of_speech` token filter
  191. The `kuromoji_part_of_speech` token filter removes tokens that match a set of
  192. part-of-speech tags. It accepts the following setting:
  193. `stoptags`::
  194. An array of part-of-speech tags that should be removed. It defaults to the
  195. `stoptags.txt` file embedded in the `lucene-analyzer-kuromoji.jar`.
  196. [source,json]
  197. --------------------------------------------------
  198. PUT kuromoji_sample
  199. {
  200. "settings": {
  201. "index": {
  202. "analysis": {
  203. "analyzer": {
  204. "my_analyzer": {
  205. "tokenizer": "kuromoji_tokenizer",
  206. "filter": [
  207. "my_posfilter"
  208. ]
  209. }
  210. },
  211. "filter": {
  212. "my_posfilter": {
  213. "type": "kuromoji_part_of_speech",
  214. "stoptags": [
  215. "助詞-格助詞-一般",
  216. "助詞-終助詞"
  217. ]
  218. }
  219. }
  220. }
  221. }
  222. }
  223. }
  224. POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=寿司がおいしいね
  225. --------------------------------------------------
  226. // CONSOLE
  227. [source,text]
  228. --------------------------------------------------
  229. # Result
  230. {
  231. "tokens" : [ {
  232. "token" : "寿司",
  233. "start_offset" : 0,
  234. "end_offset" : 2,
  235. "type" : "word",
  236. "position" : 1
  237. }, {
  238. "token" : "おいしい",
  239. "start_offset" : 3,
  240. "end_offset" : 7,
  241. "type" : "word",
  242. "position" : 3
  243. } ]
  244. }
  245. --------------------------------------------------
  246. [[analysis-kuromoji-readingform]]
  247. ==== `kuromoji_readingform` token filter
  248. The `kuromoji_readingform` token filter replaces the token with its reading
  249. form in either katakana or romaji. It accepts the following setting:
  250. `use_romaji`::
  251. Whether romaji reading form should be output instead of katakana. Defaults to `false`.
  252. When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set
  253. to `true`. The default when defining a custom `kuromoji_readingform`, however,
  254. is `false`. The only reason to use the custom form is if you need the
  255. katakana reading form:
  256. [source,json]
  257. --------------------------------------------------
  258. PUT kuromoji_sample
  259. {
  260. "settings": {
  261. "index":{
  262. "analysis":{
  263. "analyzer" : {
  264. "romaji_analyzer" : {
  265. "tokenizer" : "kuromoji_tokenizer",
  266. "filter" : ["romaji_readingform"]
  267. },
  268. "katakana_analyzer" : {
  269. "tokenizer" : "kuromoji_tokenizer",
  270. "filter" : ["katakana_readingform"]
  271. }
  272. },
  273. "filter" : {
  274. "romaji_readingform" : {
  275. "type" : "kuromoji_readingform",
  276. "use_romaji" : true
  277. },
  278. "katakana_readingform" : {
  279. "type" : "kuromoji_readingform",
  280. "use_romaji" : false
  281. }
  282. }
  283. }
  284. }
  285. }
  286. }
  287. POST kuromoji_sample/_analyze?analyzer=katakana_analyzer&text=寿司 <1>
  288. POST kuromoji_sample/_analyze?analyzer=romaji_analyzer&text=寿司 <2>
  289. --------------------------------------------------
  290. // CONSOLE
  291. <1> Returns `スシ`.
  292. <2> Returns `sushi`.
  293. [[analysis-kuromoji-stemmer]]
  294. ==== `kuromoji_stemmer` token filter
  295. The `kuromoji_stemmer` token filter normalizes common katakana spelling
  296. variations ending in a long sound character by removing this character
  297. (U+30FC). Only full-width katakana characters are supported.
  298. This token filter accepts the following setting:
  299. `minimum_length`::
  300. Katakana words shorter than the `minimum length` are not stemmed (default
  301. is `4`).
  302. [source,json]
  303. --------------------------------------------------
  304. PUT kuromoji_sample
  305. {
  306. "settings": {
  307. "index": {
  308. "analysis": {
  309. "analyzer": {
  310. "my_analyzer": {
  311. "tokenizer": "kuromoji_tokenizer",
  312. "filter": [
  313. "my_katakana_stemmer"
  314. ]
  315. }
  316. },
  317. "filter": {
  318. "my_katakana_stemmer": {
  319. "type": "kuromoji_stemmer",
  320. "minimum_length": 4
  321. }
  322. }
  323. }
  324. }
  325. }
  326. }
  327. POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=コピー <1>
  328. POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=サーバー <2>
  329. --------------------------------------------------
  330. // CONSOLE
  331. <1> Returns `コピー`.
  332. <2> Return `サーバ`.
  333. [[analysis-kuromoji-stop]]
  334. ===== `ja_stop` token filter
  335. The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and
  336. any other custom stopwords specified by the user. This filter only supports
  337. the predefined `_japanese_` stopwords list. If you want to use a different
  338. predefined list, then use the
  339. {ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.
  340. [source,json]
  341. --------------------------------------------------
  342. PUT kuromoji_sample
  343. {
  344. "settings": {
  345. "index": {
  346. "analysis": {
  347. "analyzer": {
  348. "analyzer_with_ja_stop": {
  349. "tokenizer": "kuromoji_tokenizer",
  350. "filter": [
  351. "ja_stop"
  352. ]
  353. }
  354. },
  355. "filter": {
  356. "ja_stop": {
  357. "type": "ja_stop",
  358. "stopwords": [
  359. "_japanese_",
  360. "ストップ"
  361. ]
  362. }
  363. }
  364. }
  365. }
  366. }
  367. }
  368. POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=ストップは消える
  369. --------------------------------------------------
  370. // CONSOLE
  371. The above request returns:
  372. [source,text]
  373. --------------------------------------------------
  374. # Result
  375. {
  376. "tokens" : [ {
  377. "token" : "消える",
  378. "start_offset" : 5,
  379. "end_offset" : 8,
  380. "type" : "word",
  381. "position" : 3
  382. } ]
  383. }
  384. --------------------------------------------------
  385. [[analysis-kuromoji-number]]
  386. ===== `kuromoji_number` token filter
  387. The `kuromoji_number` token filter normalizes Japanese numbers (kansūji)
  388. to regular Arabic decimal numbers in half-width characters.
  389. [source,json]
  390. --------------------------------------------------
  391. PUT kuromoji_sample
  392. {
  393. "settings": {
  394. "index": {
  395. "analysis": {
  396. "analyzer": {
  397. "my_analyzer": {
  398. "tokenizer": "kuromoji_tokenizer",
  399. "filter": [
  400. "kuromoji_number"
  401. ]
  402. }
  403. }
  404. }
  405. }
  406. }
  407. }
  408. POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=一〇〇〇
  409. --------------------------------------------------
  410. // CONSOLE
  411. [source,text]
  412. --------------------------------------------------
  413. # Result
  414. {
  415. "tokens" : [ {
  416. "token" : "1000",
  417. "start_offset" : 0,
  418. "end_offset" : 4,
  419. "type" : "word",
  420. "position" : 1
  421. } ]
  422. }
  423. --------------------------------------------------