| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220 | [[analysis-icu-plugin]]== ICU Analysis PluginThe http://icu-project.org/[ICU] analysis plugin allows for unicodenormalization, collation and folding. The plugin is calledhttps://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].The plugin includes the following analysis components:[float][[icu-normalization]]=== ICU NormalizationNormalizes characters as explainedhttp://userguide.icu-project.org/transforms/normalization[here]. Itregisters itself by default under `icu_normalizer` or `icuNormalizer`using the default settings. Allows for the name parameter to be providedwhich can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.Here is a sample settings:[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "normalization" : {                    "tokenizer" : "keyword",                    "filter" : ["icu_normalizer"]                }            }        }    }}--------------------------------------------------[float][[icu-folding]]=== ICU FoldingFolding of unicode characters based on `UTR#30`. It registers itselfunder `icu_folding` and `icuFolding` names.The filter also does lowercasing, which means the lowercase filter cannormally be left out. Sample setting:[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "folding" : {                    "tokenizer" : "keyword",                    "filter" : ["icu_folding"]                }            }        }    }}--------------------------------------------------[float][[icu-filtering]]==== FilteringThe folding can be filtered by a set of unicode characters with theparameter `unicodeSetFilter`. This is useful for a non-internationalizedsearch engine where retaining a set of national characters which areprimary letters in a specific language is wanted. See syntax for theUnicodeSethttp://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].The Following example exempts Swedish characters from the folding. Notethat the filtered characters are NOT lowercased which is why we add thatfilter below.[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "folding" : {                    "tokenizer" : "standard",                    "filter" : ["my_icu_folding", "lowercase"]                }            }            "filter" : {                "my_icu_folding" : {                    "type" : "icu_folding"                    "unicodeSetFilter" : "[^åäöÅÄÖ]"                }            }        }    }}--------------------------------------------------[float][[icu-collation]]=== ICU CollationUses collation token filter. Allows to either specify the rules forcollation (definedhttp://www.icu-project.org/userguide/Collate_Customization.html[here])using the `rules` parameter (can point to a location or expressed in thesettings, location can be relative to config location), or using the`language` parameter (further specialized by country and variant). Bydefault registers under `icu_collation` or `icuCollation` and uses thedefault locale.Here is a sample settings:[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "collation" : {                    "tokenizer" : "keyword",                    "filter" : ["icu_collation"]                }            }        }    }}--------------------------------------------------And here is a sample of custom collation:[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "collation" : {                    "tokenizer" : "keyword",                    "filter" : ["myCollator"]                }            },            "filter" : {                "myCollator" : {                    "type" : "icu_collation",                    "language" : "en"                }            }        }    }}--------------------------------------------------[float]==== Options[horizontal]`strength`::    The strength property determines the minimum level of difference considered significant during comparison.     The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator.     Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`. + See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed explanation for the specific values.`decomposition`::    Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with    `canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were    normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form    before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between    faster and more complete collation behavior. Since a great many of the world's languages do not require text    normalization, most locales set `no` as the default decomposition mode.[float]==== Expert options:[horizontal]`alternate`::     Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary`     to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace.`caseLevel`::    Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When     strength is set to `primary` this will ignore accent differences.`caseFirst`::    Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored    for strength `tertiary`.`numeric`::    Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For    example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`.`variableTop`::    Single character or contraction. Controls what is variable for `alternate`.`hiraganaQuaternaryMode`::    Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and    Hiragana characters in `quaternary` strength .[float]=== ICU TokenizerBreaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "collation" : {                    "tokenizer" : "icu_tokenizer",                }            }        }    }}--------------------------------------------------
 |