analysis-kuromoji.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560
  1. [[analysis-kuromoji]]
  2. === Japanese (kuromoji) Analysis Plugin
  3. The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis
  4. module into elasticsearch.
  5. :plugin_name: analysis-kuromoji
  6. include::install_remove.asciidoc[]
  7. [[analysis-kuromoji-analyzer]]
  8. ==== `kuromoji` analyzer
  9. The `kuromoji` analyzer consists of the following tokenizer and token filters:
  10. * <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>
  11. * <<analysis-kuromoji-baseform,`kuromoji_baseform`>> token filter
  12. * <<analysis-kuromoji-speech,`kuromoji_part_of_speech`>> token filter
  13. * {ref}/analysis-cjk-width-tokenfilter.html[`cjk_width`] token filter
  14. * <<analysis-kuromoji-stop,`ja_stop`>> token filter
  15. * <<analysis-kuromoji-stemmer,`kuromoji_stemmer`>> token filter
  16. * {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter
  17. It supports the `mode` and `user_dictionary` settings from
  18. <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>.
  19. [[analysis-kuromoji-charfilter]]
  20. ==== `kuromoji_iteration_mark` character filter
  21. The `kuromoji_iteration_mark` normalizes Japanese horizontal iteration marks
  22. (_odoriji_) to their expanded form. It accepts the following settings:
  23. `normalize_kanji`::
  24. Indicates whether kanji iteration marks should be normalize. Defaults to `true`.
  25. `normalize_kana`::
  26. Indicates whether kana iteration marks should be normalized. Defaults to `true`
  27. [[analysis-kuromoji-tokenizer]]
  28. ==== `kuromoji_tokenizer`
  29. The `kuromoji_tokenizer` accepts the following settings:
  30. `mode`::
  31. +
  32. --
  33. The tokenization mode determines how the tokenizer handles compound and
  34. unknown words. It can be set to:
  35. `normal`::
  36. Normal segmentation, no decomposition for compounds. Example output:
  37. 関西国際空港
  38. アブラカダブラ
  39. `search`::
  40. Segmentation geared towards search. This includes a decompounding process
  41. for long nouns, also including the full compound token as a synonym.
  42. Example output:
  43. 関西, 関西国際空港, 国際, 空港
  44. アブラカダブラ
  45. `extended`::
  46. Extended mode outputs unigrams for unknown words. Example output:
  47. 関西, 国際, 空港
  48. ア, ブ, ラ, カ, ダ, ブ, ラ
  49. --
  50. `discard_punctuation`::
  51. Whether punctuation should be discarded from the output. Defaults to `true`.
  52. `user_dictionary`::
  53. +
  54. --
  55. The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A `user_dictionary`
  56. may be appended to the default dictionary. The dictionary should have the following CSV format:
  57. [source,csv]
  58. -----------------------
  59. <text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
  60. -----------------------
  61. --
  62. As a demonstration of how the user dictionary can be used, save the following
  63. dictionary to `$ES_HOME/config/userdict_ja.txt`:
  64. [source,csv]
  65. -----------------------
  66. 東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
  67. -----------------------
  68. --
  69. You can also inline the rules directly in the tokenizer definition using
  70. the `user_dictionary_rules` option:
  71. [source,console]
  72. --------------------------------------------------
  73. PUT nori_sample
  74. {
  75. "settings": {
  76. "index": {
  77. "analysis": {
  78. "tokenizer": {
  79. "kuromoji_user_dict": {
  80. "type": "kuromoji_tokenizer",
  81. "mode": "extended",
  82. "user_dictionary_rules": ["東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"]
  83. }
  84. },
  85. "analyzer": {
  86. "my_analyzer": {
  87. "type": "custom",
  88. "tokenizer": "kuromoji_user_dict"
  89. }
  90. }
  91. }
  92. }
  93. }
  94. }
  95. --------------------------------------------------
  96. --
  97. `nbest_cost`/`nbest_examples`::
  98. +
  99. --
  100. Additional expert user parameters `nbest_cost` and `nbest_examples` can be used
  101. to include additional tokens that most likely according to the statistical model.
  102. If both parameters are used, the largest number of both is applied.
  103. `nbest_cost`::
  104. The `nbest_cost` parameter specifies an additional Viterbi cost.
  105. The KuromojiTokenizer will include all tokens in Viterbi paths that are
  106. within the nbest_cost value of the best path.
  107. `nbest_examples`::
  108. The `nbest_examples` can be used to find a `nbest_cost` value based on examples.
  109. For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts,
  110. 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us
  111. 箱根 (Hakone) and 成田 (Narita).
  112. --
  113. Then create an analyzer as follows:
  114. [source,console]
  115. --------------------------------------------------
  116. PUT kuromoji_sample
  117. {
  118. "settings": {
  119. "index": {
  120. "analysis": {
  121. "tokenizer": {
  122. "kuromoji_user_dict": {
  123. "type": "kuromoji_tokenizer",
  124. "mode": "extended",
  125. "discard_punctuation": "false",
  126. "user_dictionary": "userdict_ja.txt"
  127. }
  128. },
  129. "analyzer": {
  130. "my_analyzer": {
  131. "type": "custom",
  132. "tokenizer": "kuromoji_user_dict"
  133. }
  134. }
  135. }
  136. }
  137. }
  138. }
  139. GET kuromoji_sample/_analyze
  140. {
  141. "analyzer": "my_analyzer",
  142. "text": "東京スカイツリー"
  143. }
  144. --------------------------------------------------
  145. The above `analyze` request returns the following:
  146. [source,console-result]
  147. --------------------------------------------------
  148. {
  149. "tokens" : [ {
  150. "token" : "東京",
  151. "start_offset" : 0,
  152. "end_offset" : 2,
  153. "type" : "word",
  154. "position" : 0
  155. }, {
  156. "token" : "スカイツリー",
  157. "start_offset" : 2,
  158. "end_offset" : 8,
  159. "type" : "word",
  160. "position" : 1
  161. } ]
  162. }
  163. --------------------------------------------------
  164. [[analysis-kuromoji-baseform]]
  165. ==== `kuromoji_baseform` token filter
  166. The `kuromoji_baseform` token filter replaces terms with their
  167. BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example:
  168. [source,console]
  169. --------------------------------------------------
  170. PUT kuromoji_sample
  171. {
  172. "settings": {
  173. "index": {
  174. "analysis": {
  175. "analyzer": {
  176. "my_analyzer": {
  177. "tokenizer": "kuromoji_tokenizer",
  178. "filter": [
  179. "kuromoji_baseform"
  180. ]
  181. }
  182. }
  183. }
  184. }
  185. }
  186. }
  187. GET kuromoji_sample/_analyze
  188. {
  189. "analyzer": "my_analyzer",
  190. "text": "飲み"
  191. }
  192. --------------------------------------------------
  193. which responds with:
  194. [source,console-result]
  195. --------------------------------------------------
  196. {
  197. "tokens" : [ {
  198. "token" : "飲む",
  199. "start_offset" : 0,
  200. "end_offset" : 2,
  201. "type" : "word",
  202. "position" : 0
  203. } ]
  204. }
  205. --------------------------------------------------
  206. [[analysis-kuromoji-speech]]
  207. ==== `kuromoji_part_of_speech` token filter
  208. The `kuromoji_part_of_speech` token filter removes tokens that match a set of
  209. part-of-speech tags. It accepts the following setting:
  210. `stoptags`::
  211. An array of part-of-speech tags that should be removed. It defaults to the
  212. `stoptags.txt` file embedded in the `lucene-analyzer-kuromoji.jar`.
  213. For example:
  214. [source,console]
  215. --------------------------------------------------
  216. PUT kuromoji_sample
  217. {
  218. "settings": {
  219. "index": {
  220. "analysis": {
  221. "analyzer": {
  222. "my_analyzer": {
  223. "tokenizer": "kuromoji_tokenizer",
  224. "filter": [
  225. "my_posfilter"
  226. ]
  227. }
  228. },
  229. "filter": {
  230. "my_posfilter": {
  231. "type": "kuromoji_part_of_speech",
  232. "stoptags": [
  233. "助詞-格助詞-一般",
  234. "助詞-終助詞"
  235. ]
  236. }
  237. }
  238. }
  239. }
  240. }
  241. }
  242. GET kuromoji_sample/_analyze
  243. {
  244. "analyzer": "my_analyzer",
  245. "text": "寿司がおいしいね"
  246. }
  247. --------------------------------------------------
  248. Which responds with:
  249. [source,console-result]
  250. --------------------------------------------------
  251. {
  252. "tokens" : [ {
  253. "token" : "寿司",
  254. "start_offset" : 0,
  255. "end_offset" : 2,
  256. "type" : "word",
  257. "position" : 0
  258. }, {
  259. "token" : "おいしい",
  260. "start_offset" : 3,
  261. "end_offset" : 7,
  262. "type" : "word",
  263. "position" : 2
  264. } ]
  265. }
  266. --------------------------------------------------
  267. [[analysis-kuromoji-readingform]]
  268. ==== `kuromoji_readingform` token filter
  269. The `kuromoji_readingform` token filter replaces the token with its reading
  270. form in either katakana or romaji. It accepts the following setting:
  271. `use_romaji`::
  272. Whether romaji reading form should be output instead of katakana. Defaults to `false`.
  273. When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set
  274. to `true`. The default when defining a custom `kuromoji_readingform`, however,
  275. is `false`. The only reason to use the custom form is if you need the
  276. katakana reading form:
  277. [source,console]
  278. --------------------------------------------------
  279. PUT kuromoji_sample
  280. {
  281. "settings": {
  282. "index":{
  283. "analysis":{
  284. "analyzer" : {
  285. "romaji_analyzer" : {
  286. "tokenizer" : "kuromoji_tokenizer",
  287. "filter" : ["romaji_readingform"]
  288. },
  289. "katakana_analyzer" : {
  290. "tokenizer" : "kuromoji_tokenizer",
  291. "filter" : ["katakana_readingform"]
  292. }
  293. },
  294. "filter" : {
  295. "romaji_readingform" : {
  296. "type" : "kuromoji_readingform",
  297. "use_romaji" : true
  298. },
  299. "katakana_readingform" : {
  300. "type" : "kuromoji_readingform",
  301. "use_romaji" : false
  302. }
  303. }
  304. }
  305. }
  306. }
  307. }
  308. GET kuromoji_sample/_analyze
  309. {
  310. "analyzer": "katakana_analyzer",
  311. "text": "寿司" <1>
  312. }
  313. GET kuromoji_sample/_analyze
  314. {
  315. "analyzer": "romaji_analyzer",
  316. "text": "寿司" <2>
  317. }
  318. --------------------------------------------------
  319. <1> Returns `スシ`.
  320. <2> Returns `sushi`.
  321. [[analysis-kuromoji-stemmer]]
  322. ==== `kuromoji_stemmer` token filter
  323. The `kuromoji_stemmer` token filter normalizes common katakana spelling
  324. variations ending in a long sound character by removing this character
  325. (U+30FC). Only full-width katakana characters are supported.
  326. This token filter accepts the following setting:
  327. `minimum_length`::
  328. Katakana words shorter than the `minimum length` are not stemmed (default
  329. is `4`).
  330. [source,console]
  331. --------------------------------------------------
  332. PUT kuromoji_sample
  333. {
  334. "settings": {
  335. "index": {
  336. "analysis": {
  337. "analyzer": {
  338. "my_analyzer": {
  339. "tokenizer": "kuromoji_tokenizer",
  340. "filter": [
  341. "my_katakana_stemmer"
  342. ]
  343. }
  344. },
  345. "filter": {
  346. "my_katakana_stemmer": {
  347. "type": "kuromoji_stemmer",
  348. "minimum_length": 4
  349. }
  350. }
  351. }
  352. }
  353. }
  354. }
  355. GET kuromoji_sample/_analyze
  356. {
  357. "analyzer": "my_analyzer",
  358. "text": "コピー" <1>
  359. }
  360. GET kuromoji_sample/_analyze
  361. {
  362. "analyzer": "my_analyzer",
  363. "text": "サーバー" <2>
  364. }
  365. --------------------------------------------------
  366. <1> Returns `コピー`.
  367. <2> Return `サーバ`.
  368. [[analysis-kuromoji-stop]]
  369. ==== `ja_stop` token filter
  370. The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and
  371. any other custom stopwords specified by the user. This filter only supports
  372. the predefined `_japanese_` stopwords list. If you want to use a different
  373. predefined list, then use the
  374. {ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.
  375. [source,console]
  376. --------------------------------------------------
  377. PUT kuromoji_sample
  378. {
  379. "settings": {
  380. "index": {
  381. "analysis": {
  382. "analyzer": {
  383. "analyzer_with_ja_stop": {
  384. "tokenizer": "kuromoji_tokenizer",
  385. "filter": [
  386. "ja_stop"
  387. ]
  388. }
  389. },
  390. "filter": {
  391. "ja_stop": {
  392. "type": "ja_stop",
  393. "stopwords": [
  394. "_japanese_",
  395. "ストップ"
  396. ]
  397. }
  398. }
  399. }
  400. }
  401. }
  402. }
  403. GET kuromoji_sample/_analyze
  404. {
  405. "analyzer": "analyzer_with_ja_stop",
  406. "text": "ストップは消える"
  407. }
  408. --------------------------------------------------
  409. The above request returns:
  410. [source,console-result]
  411. --------------------------------------------------
  412. {
  413. "tokens" : [ {
  414. "token" : "消える",
  415. "start_offset" : 5,
  416. "end_offset" : 8,
  417. "type" : "word",
  418. "position" : 2
  419. } ]
  420. }
  421. --------------------------------------------------
  422. [[analysis-kuromoji-number]]
  423. ==== `kuromoji_number` token filter
  424. The `kuromoji_number` token filter normalizes Japanese numbers (kansūji)
  425. to regular Arabic decimal numbers in half-width characters. For example:
  426. [source,console]
  427. --------------------------------------------------
  428. PUT kuromoji_sample
  429. {
  430. "settings": {
  431. "index": {
  432. "analysis": {
  433. "analyzer": {
  434. "my_analyzer": {
  435. "tokenizer": "kuromoji_tokenizer",
  436. "filter": [
  437. "kuromoji_number"
  438. ]
  439. }
  440. }
  441. }
  442. }
  443. }
  444. }
  445. GET kuromoji_sample/_analyze
  446. {
  447. "analyzer": "my_analyzer",
  448. "text": "一〇〇〇"
  449. }
  450. --------------------------------------------------
  451. Which results in:
  452. [source,console-result]
  453. --------------------------------------------------
  454. {
  455. "tokens" : [ {
  456. "token" : "1000",
  457. "start_offset" : 0,
  458. "end_offset" : 4,
  459. "type" : "word",
  460. "position" : 0
  461. } ]
  462. }
  463. --------------------------------------------------