123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246 |
- [[analysis-icu-plugin]]
- == ICU Analysis Plugin
- The http://icu-project.org/[ICU] analysis plugin allows for unicode
- normalization, collation and folding. The plugin is called
- https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].
- The plugin includes the following analysis components:
- [float]
- [[icu-normalization]]
- === ICU Normalization
- Normalizes characters as explained
- http://userguide.icu-project.org/transforms/normalization[here]. It
- registers itself by default under `icu_normalizer` or `icuNormalizer`
- using the default settings. Allows for the name parameter to be provided
- which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
- Here is a sample settings:
- [source,js]
- --------------------------------------------------
- {
- "index" : {
- "analysis" : {
- "analyzer" : {
- "normalization" : {
- "tokenizer" : "keyword",
- "filter" : ["icu_normalizer"]
- }
- }
- }
- }
- }
- --------------------------------------------------
- [float]
- [[icu-folding]]
- === ICU Folding
- Folding of unicode characters based on `UTR#30`. It registers itself
- under `icu_folding` and `icuFolding` names.
- The filter also does lowercasing, which means the lowercase filter can
- normally be left out. Sample setting:
- [source,js]
- --------------------------------------------------
- {
- "index" : {
- "analysis" : {
- "analyzer" : {
- "folding" : {
- "tokenizer" : "keyword",
- "filter" : ["icu_folding"]
- }
- }
- }
- }
- }
- --------------------------------------------------
- [float]
- [[icu-filtering]]
- ==== Filtering
- The folding can be filtered by a set of unicode characters with the
- parameter `unicodeSetFilter`. This is useful for a non-internationalized
- search engine where retaining a set of national characters which are
- primary letters in a specific language is wanted. See syntax for the
- UnicodeSet
- http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].
- The Following example exempts Swedish characters from the folding. Note
- that the filtered characters are NOT lowercased which is why we add that
- filter below.
- [source,js]
- --------------------------------------------------
- {
- "index" : {
- "analysis" : {
- "analyzer" : {
- "folding" : {
- "tokenizer" : "standard",
- "filter" : ["my_icu_folding", "lowercase"]
- }
- }
- "filter" : {
- "my_icu_folding" : {
- "type" : "icu_folding"
- "unicodeSetFilter" : "[^åäöÅÄÖ]"
- }
- }
- }
- }
- }
- --------------------------------------------------
- [float]
- [[icu-collation]]
- === ICU Collation
- Uses collation token filter. Allows to either specify the rules for
- collation (defined
- http://www.icu-project.org/userguide/Collate_Customization.html[here])
- using the `rules` parameter (can point to a location or expressed in the
- settings, location can be relative to config location), or using the
- `language` parameter (further specialized by country and variant). By
- default registers under `icu_collation` or `icuCollation` and uses the
- default locale.
- Here is a sample settings:
- [source,js]
- --------------------------------------------------
- {
- "index" : {
- "analysis" : {
- "analyzer" : {
- "collation" : {
- "tokenizer" : "keyword",
- "filter" : ["icu_collation"]
- }
- }
- }
- }
- }
- --------------------------------------------------
- And here is a sample of custom collation:
- [source,js]
- --------------------------------------------------
- {
- "index" : {
- "analysis" : {
- "analyzer" : {
- "collation" : {
- "tokenizer" : "keyword",
- "filter" : ["myCollator"]
- }
- },
- "filter" : {
- "myCollator" : {
- "type" : "icu_collation",
- "language" : "en"
- }
- }
- }
- }
- }
- --------------------------------------------------
- [float]
- ==== Options
- [horizontal]
- `strength`::
- The strength property determines the minimum level of difference considered significant during comparison.
- The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator.
- Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`.
- +
- See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed
- explanation for the specific values.
- `decomposition`::
- Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with
- `canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were
- normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form
- before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between
- faster and more complete collation behavior. Since a great many of the world's languages do not require text
- normalization, most locales set `no` as the default decomposition mode.
- [float]
- ==== Expert options:
- [horizontal]
- `alternate`::
- Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary`
- to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace.
- `caseLevel`::
- Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When
- strength is set to `primary` this will ignore accent differences.
- `caseFirst`::
- Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored
- for strength `tertiary`.
- `numeric`::
- Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For
- example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`.
- `variableTop`::
- Single character or contraction. Controls what is variable for `alternate`.
- `hiraganaQuaternaryMode`::
- Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and
- Hiragana characters in `quaternary` strength .
- [float]
- === ICU Tokenizer
- Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).
- [source,js]
- --------------------------------------------------
- {
- "index" : {
- "analysis" : {
- "analyzer" : {
- "collation" : {
- "tokenizer" : "icu_tokenizer",
- }
- }
- }
- }
- }
- --------------------------------------------------
- [float]
- === ICU Normalization CharFilter
- Normalizes characters as explained http://userguide.icu-project.org/transforms/normalization[here].
- It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings.
- Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
- Allows for the mode parameter to be provided which can include the following values: `compose` and `decompose`.
- Use `decompose` with `nfc` or `nfkc`, to get `nfd` or `nfkd`, respectively.
- Here is a sample settings:
- [source,js]
- --------------------------------------------------
- {
- "index" : {
- "analysis" : {
- "analyzer" : {
- "collation" : {
- "tokenizer" : "keyword",
- "char_filter" : ["icu_normalizer"]
- }
- }
- }
- }
- }
- --------------------------------------------------
|