icu-plugin.asciidoc 6.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220
  1. [[analysis-icu-plugin]]
  2. == ICU Analysis Plugin
  3. The http://icu-project.org/[ICU] analysis plugin allows for unicode
  4. normalization, collation and folding. The plugin is called
  5. https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].
  6. The plugin includes the following analysis components:
  7. [float]
  8. [[icu-normalization]]
  9. === ICU Normalization
  10. Normalizes characters as explained
  11. http://userguide.icu-project.org/transforms/normalization[here]. It
  12. registers itself by default under `icu_normalizer` or `icuNormalizer`
  13. using the default settings. Allows for the name parameter to be provided
  14. which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
  15. Here is a sample settings:
  16. [source,js]
  17. --------------------------------------------------
  18. {
  19. "index" : {
  20. "analysis" : {
  21. "analyzer" : {
  22. "normalization" : {
  23. "tokenizer" : "keyword",
  24. "filter" : ["icu_normalizer"]
  25. }
  26. }
  27. }
  28. }
  29. }
  30. --------------------------------------------------
  31. [float]
  32. [[icu-folding]]
  33. === ICU Folding
  34. Folding of unicode characters based on `UTR#30`. It registers itself
  35. under `icu_folding` and `icuFolding` names.
  36. The filter also does lowercasing, which means the lowercase filter can
  37. normally be left out. Sample setting:
  38. [source,js]
  39. --------------------------------------------------
  40. {
  41. "index" : {
  42. "analysis" : {
  43. "analyzer" : {
  44. "folding" : {
  45. "tokenizer" : "keyword",
  46. "filter" : ["icu_folding"]
  47. }
  48. }
  49. }
  50. }
  51. }
  52. --------------------------------------------------
  53. [float]
  54. [[icu-filtering]]
  55. ==== Filtering
  56. The folding can be filtered by a set of unicode characters with the
  57. parameter `unicodeSetFilter`. This is useful for a non-internationalized
  58. search engine where retaining a set of national characters which are
  59. primary letters in a specific language is wanted. See syntax for the
  60. UnicodeSet
  61. http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].
  62. The Following example exempts Swedish characters from the folding. Note
  63. that the filtered characters are NOT lowercased which is why we add that
  64. filter below.
  65. [source,js]
  66. --------------------------------------------------
  67. {
  68. "index" : {
  69. "analysis" : {
  70. "analyzer" : {
  71. "folding" : {
  72. "tokenizer" : "standard",
  73. "filter" : ["my_icu_folding", "lowercase"]
  74. }
  75. }
  76. "filter" : {
  77. "my_icu_folding" : {
  78. "type" : "icu_folding"
  79. "unicodeSetFilter" : "[^åäöÅÄÖ]"
  80. }
  81. }
  82. }
  83. }
  84. }
  85. --------------------------------------------------
  86. [float]
  87. [[icu-collation]]
  88. === ICU Collation
  89. Uses collation token filter. Allows to either specify the rules for
  90. collation (defined
  91. http://www.icu-project.org/userguide/Collate_Customization.html[here])
  92. using the `rules` parameter (can point to a location or expressed in the
  93. settings, location can be relative to config location), or using the
  94. `language` parameter (further specialized by country and variant). By
  95. default registers under `icu_collation` or `icuCollation` and uses the
  96. default locale.
  97. Here is a sample settings:
  98. [source,js]
  99. --------------------------------------------------
  100. {
  101. "index" : {
  102. "analysis" : {
  103. "analyzer" : {
  104. "collation" : {
  105. "tokenizer" : "keyword",
  106. "filter" : ["icu_collation"]
  107. }
  108. }
  109. }
  110. }
  111. }
  112. --------------------------------------------------
  113. And here is a sample of custom collation:
  114. [source,js]
  115. --------------------------------------------------
  116. {
  117. "index" : {
  118. "analysis" : {
  119. "analyzer" : {
  120. "collation" : {
  121. "tokenizer" : "keyword",
  122. "filter" : ["myCollator"]
  123. }
  124. },
  125. "filter" : {
  126. "myCollator" : {
  127. "type" : "icu_collation",
  128. "language" : "en"
  129. }
  130. }
  131. }
  132. }
  133. }
  134. --------------------------------------------------
  135. [float]
  136. ==== Options
  137. [horizontal]
  138. `strength`::
  139. The strength property determines the minimum level of difference considered significant during comparison.
  140. The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator.
  141. Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`.
  142. +
  143. See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed
  144. explanation for the specific values.
  145. `decomposition`::
  146. Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with
  147. `canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were
  148. normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form
  149. before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between
  150. faster and more complete collation behavior. Since a great many of the world's languages do not require text
  151. normalization, most locales set `no` as the default decomposition mode.
  152. [float]
  153. ==== Expert options:
  154. [horizontal]
  155. `alternate`::
  156. Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary`
  157. to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace.
  158. `caseLevel`::
  159. Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When
  160. strength is set to `primary` this will ignore accent differences.
  161. `caseFirst`::
  162. Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored
  163. for strength `tertiary`.
  164. `numeric`::
  165. Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For
  166. example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`.
  167. `variableTop`::
  168. Single character or contraction. Controls what is variable for `alternate`.
  169. `hiraganaQuaternaryMode`::
  170. Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and
  171. Hiragana characters in `quaternary` strength .
  172. [float]
  173. === ICU Tokenizer
  174. Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).
  175. [source,js]
  176. --------------------------------------------------
  177. {
  178. "index" : {
  179. "analysis" : {
  180. "analyzer" : {
  181. "collation" : {
  182. "tokenizer" : "icu_tokenizer",
  183. }
  184. }
  185. }
  186. }
  187. }
  188. --------------------------------------------------