| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148 | [[analysis-icu-plugin]]== ICU Analysis PluginThe http://icu-project.org/[ICU] analysis plugin allows for unicodenormalization, collation and folding. The plugin is calledhttps://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].The plugin includes the following analysis components:[float]=== ICU NormalizationNormalizes characters as explainedhttp://userguide.icu-project.org/transforms/normalization[here]. Itregisters itself by default under `icu_normalizer` or `icuNormalizer`using the default settings. Allows for the name parameter to be providedwhich can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.Here is a sample settings:[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "normalization" : {                    "tokenizer" : "keyword",                    "filter" : ["icu_normalizer"]                }            }        }    }}--------------------------------------------------[float]=== ICU FoldingFolding of unicode characters based on `UTR#30`. It registers itselfunder `icu_folding` and `icuFolding` names.  The filter also does lowercasing, which means the lowercase filter cannormally be left out. Sample setting:[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "folding" : {                    "tokenizer" : "keyword",                    "filter" : ["icu_folding"]                }            }        }    }}--------------------------------------------------[float]==== FilteringThe folding can be filtered by a set of unicode characters with theparameter `unicodeSetFilter`. This is useful for a non-internationalizedsearch engine where retaining a set of national characters which areprimary letters in a specific language is wanted. See syntax for theUnicodeSethttp://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].The Following example excempt Swedish characters from the folding. Notethat the filtered characters are NOT lowercased which is why we add thatfilter below.[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "folding" : {                    "tokenizer" : "standard",                    "filter" : ["my_icu_folding", "lowercase"]                }            }            "filter" : {                "my_icu_folding" : {                    "type" : "icu_folding"                    "unicodeSetFilter" : "[^åäöÅÄÖ]"                }            }        }    }}--------------------------------------------------[float]=== ICU CollationUses collation token filter. Allows to either specify the rules forcollation (definedhttp://www.icu-project.org/userguide/Collate_Customization.html[here])using the `rules` parameter (can point to a location or expressed in thesettings, location can be relative to config location), or using the`language` parameter (further specialized by country and variant). Bydefault registers under `icu_collation` or `icuCollation` and uses thedefault locale.Here is a sample settings:[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "collation" : {                    "tokenizer" : "keyword",                    "filter" : ["icu_collation"]                }            }        }    }}--------------------------------------------------And here is a sample of custom collation:[source,js]--------------------------------------------------{    "index" : {        "analysis" : {            "analyzer" : {                "collation" : {                    "tokenizer" : "keyword",                    "filter" : ["myCollator"]                }            },            "filter" : {                "myCollator" : {                    "type" : "icu_collation",                    "language" : "en"                }            }        }    }}    --------------------------------------------------
 |