123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557 |
- [[analysis-icu]]
- === ICU Analysis Plugin
- The ICU Analysis plugin integrates the Lucene ICU module into {es},
- adding extended Unicode support using the http://site.icu-project.org/[ICU]
- libraries, including better analysis of Asian languages, Unicode
- normalization, Unicode-aware case folding, collation support, and
- transliteration.
- [IMPORTANT]
- .ICU analysis and backwards compatibility
- ================================================
- From time to time, the ICU library receives updates such as adding new
- characters and emojis, and improving collation (sort) orders. These changes
- may or may not affect search and sort orders, depending on which characters
- sets you are using.
- While we restrict ICU upgrades to major versions, you may find that an index
- created in the previous major version will need to be reindexed in order to
- return correct (and correctly ordered) results, and to take advantage of new
- characters.
- ================================================
- :plugin_name: analysis-icu
- include::install_remove.asciidoc[]
- [[analysis-icu-analyzer]]
- ==== ICU Analyzer
- The `icu_analyzer` analyzer performs basic normalization, tokenization and character folding, using the
- `icu_normalizer` char filter, `icu_tokenizer` and `icu_folding` token filter
- The following parameters are accepted:
- [horizontal]
- `method`::
- Normalization method. Accepts `nfkc`, `nfc` or `nfkc_cf` (default)
- `mode`::
- Normalization mode. Accepts `compose` (default) or `decompose`.
- [[analysis-icu-normalization-charfilter]]
- ==== ICU Normalization Character Filter
- Normalizes characters as explained
- http://userguide.icu-project.org/transforms/normalization[here].
- It registers itself as the `icu_normalizer` character filter, which is
- available to all indices without any further configuration. The type of
- normalization can be specified with the `name` parameter, which accepts `nfc`,
- `nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
- convert `nfc` to `nfd` or `nfkc` to `nfkd` respectively:
- Which letters are normalized can be controlled by specifying the
- `unicode_set_filter` parameter, which accepts a
- https://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
- Here are two examples, the default usage and a customised character filter:
- [source,console]
- --------------------------------------------------
- PUT icu_sample
- {
- "settings": {
- "index": {
- "analysis": {
- "analyzer": {
- "nfkc_cf_normalized": { <1>
- "tokenizer": "icu_tokenizer",
- "char_filter": [
- "icu_normalizer"
- ]
- },
- "nfd_normalized": { <2>
- "tokenizer": "icu_tokenizer",
- "char_filter": [
- "nfd_normalizer"
- ]
- }
- },
- "char_filter": {
- "nfd_normalizer": {
- "type": "icu_normalizer",
- "name": "nfc",
- "mode": "decompose"
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- <1> Uses the default `nfkc_cf` normalization.
- <2> Uses the customized `nfd_normalizer` token filter, which is set to use `nfc` normalization with decomposition.
- [[analysis-icu-tokenizer]]
- ==== ICU Tokenizer
- Tokenizes text into words on word boundaries, as defined in
- https://www.unicode.org/reports/tr29/[UAX #29: Unicode Text Segmentation].
- It behaves much like the {ref}/analysis-standard-tokenizer.html[`standard` tokenizer],
- but adds better support for some Asian languages by using a dictionary-based
- approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
- using custom rules to break Myanmar and Khmer text into syllables.
- [source,console]
- --------------------------------------------------
- PUT icu_sample
- {
- "settings": {
- "index": {
- "analysis": {
- "analyzer": {
- "my_icu_analyzer": {
- "tokenizer": "icu_tokenizer"
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- ===== Rules customization
- experimental[This functionality is marked as experimental in Lucene]
- You can customize the `icu-tokenizer` behavior by specifying per-script rule files, see the
- http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[RBBI rules syntax reference]
- for a more detailed explanation.
- To add icu tokenizer rules, set the `rule_files` settings, which should contain a comma-separated list of
- `code:rulefile` pairs in the following format:
- https://unicode.org/iso15924/iso15924-codes.html[four-letter ISO 15924 script code],
- followed by a colon, then a rule file name. Rule files are placed `ES_HOME/config` directory.
- As a demonstration of how the rule files can be used, save the following user file to `$ES_HOME/config/KeywordTokenizer.rbbi`:
- [source,text]
- -----------------------
- .+ {200};
- -----------------------
- Then create an analyzer to use this rule file as follows:
- [source,console]
- --------------------------------------------------
- PUT icu_sample
- {
- "settings": {
- "index": {
- "analysis": {
- "tokenizer": {
- "icu_user_file": {
- "type": "icu_tokenizer",
- "rule_files": "Latn:KeywordTokenizer.rbbi"
- }
- },
- "analyzer": {
- "my_analyzer": {
- "type": "custom",
- "tokenizer": "icu_user_file"
- }
- }
- }
- }
- }
- }
- GET icu_sample/_analyze
- {
- "analyzer": "my_analyzer",
- "text": "Elasticsearch. Wow!"
- }
- --------------------------------------------------
- The above `analyze` request returns the following:
- [source,console-result]
- --------------------------------------------------
- {
- "tokens": [
- {
- "token": "Elasticsearch. Wow!",
- "start_offset": 0,
- "end_offset": 19,
- "type": "<ALPHANUM>",
- "position": 0
- }
- ]
- }
- --------------------------------------------------
- [[analysis-icu-normalization]]
- ==== ICU Normalization Token Filter
- Normalizes characters as explained
- http://userguide.icu-project.org/transforms/normalization[here]. It registers
- itself as the `icu_normalizer` token filter, which is available to all indices
- without any further configuration. The type of normalization can be specified
- with the `name` parameter, which accepts `nfc`, `nfkc`, and `nfkc_cf`
- (default).
- Which letters are normalized can be controlled by specifying the
- `unicode_set_filter` parameter, which accepts a
- https://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
- You should probably prefer the <<analysis-icu-normalization-charfilter,Normalization character filter>>.
- Here are two examples, the default usage and a customised token filter:
- [source,console]
- --------------------------------------------------
- PUT icu_sample
- {
- "settings": {
- "index": {
- "analysis": {
- "analyzer": {
- "nfkc_cf_normalized": { <1>
- "tokenizer": "icu_tokenizer",
- "filter": [
- "icu_normalizer"
- ]
- },
- "nfc_normalized": { <2>
- "tokenizer": "icu_tokenizer",
- "filter": [
- "nfc_normalizer"
- ]
- }
- },
- "filter": {
- "nfc_normalizer": {
- "type": "icu_normalizer",
- "name": "nfc"
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- <1> Uses the default `nfkc_cf` normalization.
- <2> Uses the customized `nfc_normalizer` token filter, which is set to use `nfc` normalization.
- [[analysis-icu-folding]]
- ==== ICU Folding Token Filter
- Case folding of Unicode characters based on `UTR#30`, like the
- {ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter]
- on steroids. It registers itself as the `icu_folding` token filter and is
- available to all indices:
- [source,console]
- --------------------------------------------------
- PUT icu_sample
- {
- "settings": {
- "index": {
- "analysis": {
- "analyzer": {
- "folded": {
- "tokenizer": "icu_tokenizer",
- "filter": [
- "icu_folding"
- ]
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- The ICU folding token filter already does Unicode normalization, so there is
- no need to use Normalize character or token filter as well.
- Which letters are folded can be controlled by specifying the
- `unicode_set_filter` parameter, which accepts a
- https://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
- The following example exempts Swedish characters from folding. It is important
- to note that both upper and lowercase forms should be specified, and that
- these filtered character are not lowercased which is why we add the
- `lowercase` filter as well:
- [source,console]
- --------------------------------------------------
- PUT icu_sample
- {
- "settings": {
- "index": {
- "analysis": {
- "analyzer": {
- "swedish_analyzer": {
- "tokenizer": "icu_tokenizer",
- "filter": [
- "swedish_folding",
- "lowercase"
- ]
- }
- },
- "filter": {
- "swedish_folding": {
- "type": "icu_folding",
- "unicode_set_filter": "[^åäöÅÄÖ]"
- }
- }
- }
- }
- }
- }
- --------------------------------------------------
- [[analysis-icu-collation]]
- ==== ICU Collation Token Filter
- [WARNING]
- ======
- This token filter has been deprecated since Lucene 5.0. Please use
- <<analysis-icu-collation-keyword-field, ICU Collation Keyword Field>>.
- ======
- [[analysis-icu-collation-keyword-field]]
- ==== ICU Collation Keyword Field
- Collations are used for sorting documents in a language-specific word order.
- The `icu_collation_keyword` field type is available to all indices and will encode
- the terms directly as bytes in a doc values field and a single indexed token just
- like a standard {ref}/keyword.html[Keyword Field].
- Defaults to using {defguide}/sorting-collations.html#uca[DUCET collation],
- which is a best-effort attempt at language-neutral sorting.
- Below is an example of how to set up a field for sorting German names in
- ``phonebook'' order:
- [source,console]
- --------------------------
- PUT my-index-000001
- {
- "mappings": {
- "properties": {
- "name": { <1>
- "type": "text",
- "fields": {
- "sort": { <2>
- "type": "icu_collation_keyword",
- "index": false,
- "language": "de",
- "country": "DE",
- "variant": "@collation=phonebook"
- }
- }
- }
- }
- }
- }
- GET /my-index-000001/_search <3>
- {
- "query": {
- "match": {
- "name": "Fritz"
- }
- },
- "sort": "name.sort"
- }
- --------------------------
- <1> The `name` field uses the `standard` analyzer, and so support full text queries.
- <2> The `name.sort` field is an `icu_collation_keyword` field that will preserve the name as
- a single token doc values, and applies the German ``phonebook'' order.
- <3> An example query which searches the `name` field and sorts on the `name.sort` field.
- ===== Parameters for ICU Collation Keyword Fields
- The following parameters are accepted by `icu_collation_keyword` fields:
- [horizontal]
- `doc_values`::
- Should the field be stored on disk in a column-stride fashion, so that it
- can later be used for sorting, aggregations, or scripting? Accepts `true`
- (default) or `false`.
- `index`::
- Should the field be searchable? Accepts `true` (default) or `false`.
- `null_value`::
- Accepts a string value which is substituted for any explicit `null`
- values. Defaults to `null`, which means the field is treated as missing.
- {ref}/ignore-above.html[`ignore_above`]::
- Strings longer than the `ignore_above` setting will be ignored.
- Checking is performed on the original string before the collation.
- The `ignore_above` setting can be updated on existing fields
- using the {ref}/indices-put-mapping.html[PUT mapping API].
- By default, there is no limit and all values will be indexed.
- `store`::
- Whether the field value should be stored and retrievable separately from
- the {ref}/mapping-source-field.html[`_source`] field. Accepts `true` or `false`
- (default).
- `fields`::
- Multi-fields allow the same string value to be indexed in multiple ways for
- different purposes, such as one field for search and a multi-field for
- sorting and aggregations.
- ===== Collation options
- `strength`::
- The strength property determines the minimum level of difference considered
- significant during comparison. Possible values are : `primary`, `secondary`,
- `tertiary`, `quaternary` or `identical`. See the
- https://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation documentation]
- for a more detailed explanation for each value. Defaults to `tertiary`
- unless otherwise specified in the collation.
- `decomposition`::
- Possible values: `no` (default, but collation-dependent) or `canonical`.
- Setting this decomposition property to `canonical` allows the Collator to
- handle unnormalized text properly, producing the same results as if the text
- were normalized. If `no` is set, it is the user's responsibility to insure
- that all text is already in the appropriate form before a comparison or before
- getting a CollationKey. Adjusting decomposition mode allows the user to select
- between faster and more complete collation behavior. Since a great many of the
- world's languages do not require text normalization, most locales set `no` as
- the default decomposition mode.
- The following options are expert only:
- `alternate`::
- Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for
- strength `quaternary` to be either shifted or non-ignorable. Which boils down
- to ignoring punctuation and whitespace.
- `case_level`::
- Possible values: `true` or `false` (default). Whether case level sorting is
- required. When strength is set to `primary` this will ignore accent
- differences.
- `case_first`::
- Possible values: `lower` or `upper`. Useful to control which case is sorted
- first when case is not ignored for strength `tertiary`. The default depends on
- the collation.
- `numeric`::
- Possible values: `true` or `false` (default) . Whether digits are sorted
- according to their numeric representation. For example the value `egg-9` is
- sorted before the value `egg-21`.
- `variable_top`::
- Single character or contraction. Controls what is variable for `alternate`.
- `hiragana_quaternary_mode`::
- Possible values: `true` or `false`. Distinguishing between Katakana and
- Hiragana characters in `quaternary` strength.
- [[analysis-icu-transform]]
- ==== ICU Transform Token Filter
- Transforms are used to process Unicode text in many different ways, such as
- case mapping, normalization, transliteration and bidirectional text handling.
- You can define which transformation you want to apply with the `id` parameter
- (defaults to `Null`), and specify text direction with the `dir` parameter
- which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
- rulesets are not yet supported.
- For example:
- [source,console]
- --------------------------------------------------
- PUT icu_sample
- {
- "settings": {
- "index": {
- "analysis": {
- "analyzer": {
- "latin": {
- "tokenizer": "keyword",
- "filter": [
- "myLatinTransform"
- ]
- }
- },
- "filter": {
- "myLatinTransform": {
- "type": "icu_transform",
- "id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" <1>
- }
- }
- }
- }
- }
- }
- GET icu_sample/_analyze
- {
- "analyzer": "latin",
- "text": "你好" <2>
- }
- GET icu_sample/_analyze
- {
- "analyzer": "latin",
- "text": "здравствуйте" <3>
- }
- GET icu_sample/_analyze
- {
- "analyzer": "latin",
- "text": "こんにちは" <4>
- }
- --------------------------------------------------
- <1> This transforms transliterates characters to Latin, and separates accents
- from their base characters, removes the accents, and then puts the
- remaining text into an unaccented form.
- <2> Returns `ni hao`.
- <3> Returns `zdravstvujte`.
- <4> Returns `kon'nichiha`.
- For more documentation, Please see the http://userguide.icu-project.org/transforms/general[user guide of ICU Transform].
|