icu-plugin.asciidoc 4.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
  1. [[analysis-icu-plugin]]
  2. == ICU Analysis Plugin
  3. The http://icu-project.org/[ICU] analysis plugin allows for unicode
  4. normalization, collation and folding. The plugin is called
  5. https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].
  6. The plugin includes the following analysis components:
  7. [float]
  8. === ICU Normalization
  9. Normalizes characters as explained
  10. http://userguide.icu-project.org/transforms/normalization[here]. It
  11. registers itself by default under `icu_normalizer` or `icuNormalizer`
  12. using the default settings. Allows for the name parameter to be provided
  13. which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
  14. Here is a sample settings:
  15. [source,js]
  16. --------------------------------------------------
  17. {
  18. "index" : {
  19. "analysis" : {
  20. "analyzer" : {
  21. "normalization" : {
  22. "tokenizer" : "keyword",
  23. "filter" : ["icu_normalizer"]
  24. }
  25. }
  26. }
  27. }
  28. }
  29. --------------------------------------------------
  30. [float]
  31. === ICU Folding
  32. Folding of unicode characters based on `UTR#30`. It registers itself
  33. under `icu_folding` and `icuFolding` names.
  34. The filter also does lowercasing, which means the lowercase filter can
  35. normally be left out. Sample setting:
  36. [source,js]
  37. --------------------------------------------------
  38. {
  39. "index" : {
  40. "analysis" : {
  41. "analyzer" : {
  42. "folding" : {
  43. "tokenizer" : "keyword",
  44. "filter" : ["icu_folding"]
  45. }
  46. }
  47. }
  48. }
  49. }
  50. --------------------------------------------------
  51. [float]
  52. ==== Filtering
  53. The folding can be filtered by a set of unicode characters with the
  54. parameter `unicodeSetFilter`. This is useful for a non-internationalized
  55. search engine where retaining a set of national characters which are
  56. primary letters in a specific language is wanted. See syntax for the
  57. UnicodeSet
  58. http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].
  59. The Following example excempt Swedish characters from the folding. Note
  60. that the filtered characters are NOT lowercased which is why we add that
  61. filter below.
  62. [source,js]
  63. --------------------------------------------------
  64. {
  65. "index" : {
  66. "analysis" : {
  67. "analyzer" : {
  68. "folding" : {
  69. "tokenizer" : "standard",
  70. "filter" : ["my_icu_folding", "lowercase"]
  71. }
  72. }
  73. "filter" : {
  74. "my_icu_folding" : {
  75. "type" : "icu_folding"
  76. "unicodeSetFilter" : "[^åäöÅÄÖ]"
  77. }
  78. }
  79. }
  80. }
  81. }
  82. --------------------------------------------------
  83. [float]
  84. === ICU Collation
  85. Uses collation token filter. Allows to either specify the rules for
  86. collation (defined
  87. http://www.icu-project.org/userguide/Collate_Customization.html[here])
  88. using the `rules` parameter (can point to a location or expressed in the
  89. settings, location can be relative to config location), or using the
  90. `language` parameter (further specialized by country and variant). By
  91. default registers under `icu_collation` or `icuCollation` and uses the
  92. default locale.
  93. Here is a sample settings:
  94. [source,js]
  95. --------------------------------------------------
  96. {
  97. "index" : {
  98. "analysis" : {
  99. "analyzer" : {
  100. "collation" : {
  101. "tokenizer" : "keyword",
  102. "filter" : ["icu_collation"]
  103. }
  104. }
  105. }
  106. }
  107. }
  108. --------------------------------------------------
  109. And here is a sample of custom collation:
  110. [source,js]
  111. --------------------------------------------------
  112. {
  113. "index" : {
  114. "analysis" : {
  115. "analyzer" : {
  116. "collation" : {
  117. "tokenizer" : "keyword",
  118. "filter" : ["myCollator"]
  119. }
  120. },
  121. "filter" : {
  122. "myCollator" : {
  123. "type" : "icu_collation",
  124. "language" : "en"
  125. }
  126. }
  127. }
  128. }
  129. }
  130. --------------------------------------------------