lqb
/
elasticsearch
mirror of https://gitee.com/mirrors/elasticsearch.git


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521
							[[analysis-kuromoji]]
=== Japanese (kuromoji) Analysis Plugin

The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis
module into elasticsearch.

[[analysis-kuromoji-install]]
[float]
==== Installation

This plugin can be installed using the plugin manager:

[source,sh]
----------------------------------------------------------------
sudo bin/elasticsearch-plugin install analysis-kuromoji
----------------------------------------------------------------

The plugin must be installed on every node in the cluster, and each node must
be restarted after installation.

[[analysis-kuromoji-remove]]
[float]
==== Removal

The plugin can be removed with the following command:

[source,sh]
----------------------------------------------------------------
sudo bin/elasticsearch-plugin remove analysis-kuromoji
----------------------------------------------------------------

The node must be stopped before removing the plugin.

[[analysis-kuromoji-analyzer]]
==== `kuromoji` analyzer

The `kuromoji` analyzer consists of the following tokenizer and token filters:

* <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>
* <<analysis-kuromoji-baseform,`kuromoji_baseform`>> token filter
* <<analysis-kuromoji-speech,`kuromoji_part_of_speech`>> token filter
* {ref}/analysis-cjk-width-tokenfilter.html[`cjk_width`] token filter
* <<analysis-kuromoji-stop,`ja_stop`>> token filter
* <<analysis-kuromoji-stemmer,`kuromoji_stemmer`>> token filter
* {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter

It supports the `mode` and `user_dictionary` settings from
<<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>.

[[analysis-kuromoji-charfilter]]
==== `kuromoji_iteration_mark` character filter

The `kuromoji_iteration_mark` normalizes Japanese horizontal iteration marks
(_odoriji_) to their expanded form. It accepts the following settings:

`normalize_kanji`::

    Indicates whether kanji iteration marks should be normalize. Defaults to `true`.

`normalize_kana`::

    Indicates whether kana iteration marks should be normalized. Defaults to `true`


[[analysis-kuromoji-tokenizer]]
==== `kuromoji_tokenizer`

The `kuromoji_tokenizer` accepts the following settings:

`mode`::
+
--

The tokenization mode determines how the tokenizer handles compound and
unknown words.  It can be set to:

`normal`::

    Normal segmentation, no decomposition for compounds. Example output:

    関西国際空港
    アブラカダブラ

`search`::

    Segmentation geared towards search. This includes a decompounding process
    for long nouns, also including the full compound token as a synonym.
    Example output:

    関西, 関西国際空港, 国際, 空港
    アブラカダブラ

`extended`::

    Extended mode outputs unigrams for unknown words. Example output:

    関西, 国際, 空港
    ア, ブ, ラ, カ, ダ, ブ, ラ
--

`discard_punctuation`::

    Whether punctuation should be discarded from the output. Defaults to `true`.

`user_dictionary`::
+
--
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A `user_dictionary`
may be appended to the default dictionary. The dictionary should have the following CSV format:

[source,csv]
-----------------------
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
-----------------------
--

As a demonstration of how the user dictionary can be used, save the following
dictionary to `$ES_HOME/config/userdict_ja.txt`:

[source,csv]
-----------------------
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
-----------------------

`nbest_cost`/`nbest_examples`::
+
--
Additional expert user parameters `nbest_cost` and `nbest_examples` can be used
to include additional tokens that most likely according to the statistical model.
If both parameters are used, the largest number of both is applied.

`nbest_cost`::

    The `nbest_cost` parameter specifies an additional Viterbi cost.
    The KuromojiTokenizer will include all tokens in Viterbi paths that are
    within the nbest_cost value of the best path.

`nbest_examples`::

    The `nbest_examples` can be used to find a `nbest_cost` value based on examples.
    For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts,
    箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us
    箱根 (Hakone) and 成田 (Narita).
--


Then create an analyzer as follows:

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=東京スカイツリー
--------------------------------------------------
// CONSOLE

The above `analyze` request returns the following:

[source,json]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "東京",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "スカイツリー",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  } ]
}
--------------------------------------------------

[[analysis-kuromoji-baseform]]
==== `kuromoji_baseform` token filter

The `kuromoji_baseform` token filter replaces terms with their
BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_baseform"
            ]
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=飲み
--------------------------------------------------
// CONSOLE

[source,text]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "飲む",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  } ]
}
--------------------------------------------------

[[analysis-kuromoji-speech]]
==== `kuromoji_part_of_speech` token filter

The `kuromoji_part_of_speech` token filter removes tokens that match a set of
part-of-speech tags. It accepts the following setting:

`stoptags`::

    An array of part-of-speech tags that should be removed. It defaults to the
    `stoptags.txt` file embedded in the `lucene-analyzer-kuromoji.jar`.

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech",
            "stoptags": [
              "助詞-格助詞-一般",
              "助詞-終助詞"
            ]
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=寿司がおいしいね

--------------------------------------------------
// CONSOLE

[source,text]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "寿司",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "おいしい",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "word",
    "position" : 3
  } ]
}
--------------------------------------------------

[[analysis-kuromoji-readingform]]
==== `kuromoji_readingform` token filter

The `kuromoji_readingform` token filter replaces the token with its reading
form in either katakana or romaji. It accepts the following setting:

`use_romaji`::

    Whether romaji reading form should be output instead of katakana.  Defaults to `false`.

When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set
to `true`. The default when defining a custom `kuromoji_readingform`, however,
is `false`.  The only reason to use the custom form is if you need the
katakana reading form:

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "romaji_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["romaji_readingform"]
                    },
                    "katakana_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["katakana_readingform"]
                    }
                },
                "filter" : {
                    "romaji_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : true
                    },
                    "katakana_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : false
                    }
                }
            }
        }
    }
}

POST kuromoji_sample/_analyze?analyzer=katakana_analyzer&text=寿司 <1>

POST kuromoji_sample/_analyze?analyzer=romaji_analyzer&text=寿司 <2>

--------------------------------------------------
// CONSOLE

<1> Returns `スシ`.
<2> Returns `sushi`.

[[analysis-kuromoji-stemmer]]
==== `kuromoji_stemmer` token filter

The `kuromoji_stemmer` token filter normalizes common katakana spelling
variations ending in a long sound character by removing this character
(U+30FC). Only full-width katakana characters are supported.

This token filter accepts the following setting:

`minimum_length`::

    Katakana words shorter than the `minimum length` are not stemmed (default
    is `4`).


[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_katakana_stemmer"
            ]
          }
        },
        "filter": {
          "my_katakana_stemmer": {
            "type": "kuromoji_stemmer",
            "minimum_length": 4
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=コピー <1>

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=サーバー <2>

--------------------------------------------------
// CONSOLE

<1> Returns `コピー`.
<2> Return `サーバ`.


[[analysis-kuromoji-stop]]
===== `ja_stop` token filter

The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and
any other custom stopwords specified by the user. This filter only supports
the predefined `_japanese_` stopwords list.  If you want to use a different
predefined list, then use the
{ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_with_ja_stop": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "ja_stop"
            ]
          }
        },
        "filter": {
          "ja_stop": {
            "type": "ja_stop",
            "stopwords": [
              "_japanese_",
              "ストップ"
            ]
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=ストップは消える
--------------------------------------------------
// CONSOLE

The above request returns:

[source,text]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "消える",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "word",
    "position" : 3
  } ]
}
--------------------------------------------------

[[analysis-kuromoji-number]]
===== `kuromoji_number` token filter

The `kuromoji_number` token filter normalizes Japanese numbers (kansūji)
to regular Arabic decimal numbers in half-width characters.

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_number"
            ]
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=一〇〇〇

--------------------------------------------------
// CONSOLE

[source,text]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "1000",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 1
  } ]
}
--------------------------------------------------