analysis-icu.asciidoc 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512
  1. [[analysis-icu]]
  2. === ICU Analysis Plugin
  3. The ICU Analysis plugin integrates the Lucene ICU module into elasticsearch,
  4. adding extended Unicode support using the http://site.icu-project.org/[ICU]
  5. libraries, including better analysis of Asian languages, Unicode
  6. normalization, Unicode-aware case folding, collation support, and
  7. transliteration.
  8. [[analysis-icu-install]]
  9. [float]
  10. ==== Installation
  11. This plugin can be installed using the plugin manager:
  12. [source,sh]
  13. ----------------------------------------------------------------
  14. sudo bin/elasticsearch-plugin install analysis-icu
  15. ----------------------------------------------------------------
  16. The plugin must be installed on every node in the cluster, and each node must
  17. be restarted after installation.
  18. [[analysis-icu-remove]]
  19. [float]
  20. ==== Removal
  21. The plugin can be removed with the following command:
  22. [source,sh]
  23. ----------------------------------------------------------------
  24. sudo bin/elasticsearch-plugin remove analysis-icu
  25. ----------------------------------------------------------------
  26. The node must be stopped before removing the plugin.
  27. [[analysis-icu-normalization-charfilter]]
  28. ==== ICU Normalization Character Filter
  29. Normalizes characters as explained
  30. http://userguide.icu-project.org/transforms/normalization[here].
  31. It registers itself as the `icu_normalizer` character filter, which is
  32. available to all indices without any further configuration. The type of
  33. normalization can be specified with the `name` parameter, which accepts `nfc`,
  34. `nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
  35. convert `nfc` to `nfd` or `nfkc` to `nfkd` respectively:
  36. Here are two examples, the default usage and a customised character filter:
  37. [source,js]
  38. --------------------------------------------------
  39. PUT icu_sample
  40. {
  41. "settings": {
  42. "index": {
  43. "analysis": {
  44. "analyzer": {
  45. "nfkc_cf_normalized": { <1>
  46. "tokenizer": "icu_tokenizer",
  47. "char_filter": [
  48. "icu_normalizer"
  49. ]
  50. },
  51. "nfd_normalized": { <2>
  52. "tokenizer": "icu_tokenizer",
  53. "char_filter": [
  54. "nfd_normalizer"
  55. ]
  56. }
  57. },
  58. "char_filter": {
  59. "nfd_normalizer": {
  60. "type": "icu_normalizer",
  61. "name": "nfc",
  62. "mode": "decompose"
  63. }
  64. }
  65. }
  66. }
  67. }
  68. }
  69. --------------------------------------------------
  70. // CONSOLE
  71. <1> Uses the default `nfkc_cf` normalization.
  72. <2> Uses the customized `nfd_normalizer` token filter, which is set to use `nfc` normalization with decomposition.
  73. [[analysis-icu-tokenizer]]
  74. ==== ICU Tokenizer
  75. Tokenizes text into words on word boundaries, as defined in
  76. http://www.unicode.org/reports/tr29/[UAX #29: Unicode Text Segmentation].
  77. It behaves much like the {ref}/analysis-standard-tokenizer.html[`standard` tokenizer],
  78. but adds better support for some Asian languages by using a dictionary-based
  79. approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
  80. using custom rules to break Myanmar and Khmer text into syllables.
  81. [source,js]
  82. --------------------------------------------------
  83. PUT icu_sample
  84. {
  85. "settings": {
  86. "index": {
  87. "analysis": {
  88. "analyzer": {
  89. "my_icu_analyzer": {
  90. "tokenizer": "icu_tokenizer"
  91. }
  92. }
  93. }
  94. }
  95. }
  96. }
  97. --------------------------------------------------
  98. // CONSOLE
  99. ===== Rules customization
  100. experimental[]
  101. You can customize the `icu-tokenizer` behavior by specifying per-script rule files, see the
  102. http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[RBBI rules syntax reference]
  103. for a more detailed explanation.
  104. To add icu tokenizer rules, set the `rule_files` settings, which should contain a comma-separated list of
  105. `code:rulefile` pairs in the following format:
  106. http://unicode.org/iso15924/iso15924-codes.html[four-letter ISO 15924 script code],
  107. followed by a colon, then a rule file name. Rule files are placed `ES_HOME/config` directory.
  108. As a demonstration of how the rule files can be used, save the following user file to `$ES_HOME/config/KeywordTokenizer.rbbi`:
  109. [source,text]
  110. -----------------------
  111. .+ {200};
  112. -----------------------
  113. Then create an analyzer to use this rule file as follows:
  114. [source,js]
  115. --------------------------------------------------
  116. PUT icu_sample
  117. {
  118. "settings": {
  119. "index":{
  120. "analysis":{
  121. "tokenizer" : {
  122. "icu_user_file" : {
  123. "type" : "icu_tokenizer",
  124. "rule_files" : "Latn:KeywordTokenizer.rbbi"
  125. }
  126. },
  127. "analyzer" : {
  128. "my_analyzer" : {
  129. "type" : "custom",
  130. "tokenizer" : "icu_user_file"
  131. }
  132. }
  133. }
  134. }
  135. }
  136. }
  137. GET _cluster/health?wait_for_status=yellow
  138. POST icu_sample/_analyze?analyzer=my_analyzer&text=Elasticsearch. Wow!
  139. --------------------------------------------------
  140. // CONSOLE
  141. The above `analyze` request returns the following:
  142. [source,js]
  143. --------------------------------------------------
  144. {
  145. "tokens": [
  146. {
  147. "token": "Elasticsearch. Wow!",
  148. "start_offset": 0,
  149. "end_offset": 19,
  150. "type": "<ALPHANUM>",
  151. "position": 0
  152. }
  153. ]
  154. }
  155. --------------------------------------------------
  156. // TESTRESPONSE
  157. [[analysis-icu-normalization]]
  158. ==== ICU Normalization Token Filter
  159. Normalizes characters as explained
  160. http://userguide.icu-project.org/transforms/normalization[here]. It registers
  161. itself as the `icu_normalizer` token filter, which is available to all indices
  162. without any further configuration. The type of normalization can be specified
  163. with the `name` parameter, which accepts `nfc`, `nfkc`, and `nfkc_cf`
  164. (default).
  165. You should probably prefer the <<analysis-icu-normalization-charfilter,Normalization character filter>>.
  166. Here are two examples, the default usage and a customised token filter:
  167. [source,js]
  168. --------------------------------------------------
  169. PUT icu_sample
  170. {
  171. "settings": {
  172. "index": {
  173. "analysis": {
  174. "analyzer": {
  175. "nfkc_cf_normalized": { <1>
  176. "tokenizer": "icu_tokenizer",
  177. "filter": [
  178. "icu_normalizer"
  179. ]
  180. },
  181. "nfc_normalized": { <2>
  182. "tokenizer": "icu_tokenizer",
  183. "filter": [
  184. "nfc_normalizer"
  185. ]
  186. }
  187. },
  188. "filter": {
  189. "nfc_normalizer": {
  190. "type": "icu_normalizer",
  191. "name": "nfc"
  192. }
  193. }
  194. }
  195. }
  196. }
  197. }
  198. --------------------------------------------------
  199. // CONSOLE
  200. <1> Uses the default `nfkc_cf` normalization.
  201. <2> Uses the customized `nfc_normalizer` token filter, which is set to use `nfc` normalization.
  202. [[analysis-icu-folding]]
  203. ==== ICU Folding Token Filter
  204. Case folding of Unicode characters based on `UTR#30`, like the
  205. {ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter]
  206. on steroids. It registers itself as the `icu_folding` token filter and is
  207. available to all indices:
  208. [source,js]
  209. --------------------------------------------------
  210. PUT icu_sample
  211. {
  212. "settings": {
  213. "index": {
  214. "analysis": {
  215. "analyzer": {
  216. "folded": {
  217. "tokenizer": "icu_tokenizer",
  218. "filter": [
  219. "icu_folding"
  220. ]
  221. }
  222. }
  223. }
  224. }
  225. }
  226. }
  227. --------------------------------------------------
  228. // CONSOLE
  229. The ICU folding token filter already does Unicode normalization, so there is
  230. no need to use Normalize character or token filter as well.
  231. Which letters are folded can be controlled by specifying the
  232. `unicodeSetFilter` parameter, which accepts a
  233. http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
  234. The following example exempts Swedish characters from folding. It is important
  235. to note that both upper and lowercase forms should be specified, and that
  236. these filtered character are not lowercased which is why we add the
  237. `lowercase` filter as well:
  238. [source,js]
  239. --------------------------------------------------
  240. PUT icu_sample
  241. {
  242. "settings": {
  243. "index": {
  244. "analysis": {
  245. "analyzer": {
  246. "swedish_analyzer": {
  247. "tokenizer": "icu_tokenizer",
  248. "filter": [
  249. "swedish_folding",
  250. "lowercase"
  251. ]
  252. }
  253. },
  254. "filter": {
  255. "swedish_folding": {
  256. "type": "icu_folding",
  257. "unicodeSetFilter": "[^åäöÅÄÖ]"
  258. }
  259. }
  260. }
  261. }
  262. }
  263. }
  264. --------------------------------------------------
  265. // CONSOLE
  266. [[analysis-icu-collation]]
  267. ==== ICU Collation Token Filter
  268. Collations are used for sorting documents in a language-specific word order.
  269. The `icu_collation` token filter is available to all indices and defaults to
  270. using the
  271. https://www.elastic.co/guide/en/elasticsearch/guide/current/sorting-collations.html#uca[DUCET collation],
  272. which is a best-effort attempt at language-neutral sorting.
  273. Below is an example of how to set up a field for sorting German names in
  274. ``phonebook'' order:
  275. [source,js]
  276. --------------------------------------------------
  277. PUT /my_index
  278. {
  279. "settings": {
  280. "analysis": {
  281. "filter": {
  282. "german_phonebook": {
  283. "type": "icu_collation",
  284. "language": "de",
  285. "country": "DE",
  286. "variant": "@collation=phonebook"
  287. }
  288. },
  289. "analyzer": {
  290. "german_phonebook": {
  291. "tokenizer": "keyword",
  292. "filter": [ "german_phonebook" ]
  293. }
  294. }
  295. }
  296. },
  297. "mappings": {
  298. "user": {
  299. "properties": {
  300. "name": { <1>
  301. "type": "text",
  302. "fields": {
  303. "sort": { <2>
  304. "type": "text",
  305. "fielddata": true,
  306. "analyzer": "german_phonebook"
  307. }
  308. }
  309. }
  310. }
  311. }
  312. }
  313. }
  314. GET _cluster/health?wait_for_status=yellow
  315. GET _search <3>
  316. {
  317. "query": {
  318. "match": {
  319. "name": "Fritz"
  320. }
  321. },
  322. "sort": "name.sort"
  323. }
  324. --------------------------------------------------
  325. // CONSOLE
  326. <1> The `name` field uses the `standard` analyzer, and so support full text queries.
  327. <2> The `name.sort` field uses the `keyword` analyzer to preserve the name as
  328. a single token, and applies the `german_phonebook` token filter to index
  329. the value in German phonebook sort order.
  330. <3> An example query which searches the `name` field and sorts on the `name.sort` field.
  331. ===== Collation options
  332. `strength`::
  333. The strength property determines the minimum level of difference considered
  334. significant during comparison. Possible values are : `primary`, `secondary`,
  335. `tertiary`, `quaternary` or `identical`. See the
  336. http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation documentation]
  337. for a more detailed explanation for each value. Defaults to `tertiary`
  338. unless otherwise specified in the collation.
  339. `decomposition`::
  340. Possible values: `no` (default, but collation-dependent) or `canonical`.
  341. Setting this decomposition property to `canonical` allows the Collator to
  342. handle unnormalized text properly, producing the same results as if the text
  343. were normalized. If `no` is set, it is the user's responsibility to insure
  344. that all text is already in the appropriate form before a comparison or before
  345. getting a CollationKey. Adjusting decomposition mode allows the user to select
  346. between faster and more complete collation behavior. Since a great many of the
  347. world's languages do not require text normalization, most locales set `no` as
  348. the default decomposition mode.
  349. The following options are expert only:
  350. `alternate`::
  351. Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for
  352. strength `quaternary` to be either shifted or non-ignorable. Which boils down
  353. to ignoring punctuation and whitespace.
  354. `caseLevel`::
  355. Possible values: `true` or `false` (default). Whether case level sorting is
  356. required. When strength is set to `primary` this will ignore accent
  357. differences.
  358. `caseFirst`::
  359. Possible values: `lower` or `upper`. Useful to control which case is sorted
  360. first when case is not ignored for strength `tertiary`. The default depends on
  361. the collation.
  362. `numeric`::
  363. Possible values: `true` or `false` (default) . Whether digits are sorted
  364. according to their numeric representation. For example the value `egg-9` is
  365. sorted before the value `egg-21`.
  366. `variableTop`::
  367. Single character or contraction. Controls what is variable for `alternate`.
  368. `hiraganaQuaternaryMode`::
  369. Possible values: `true` or `false`. Distinguishing between Katakana and
  370. Hiragana characters in `quaternary` strength.
  371. [[analysis-icu-transform]]
  372. ==== ICU Transform Token Filter
  373. Transforms are used to process Unicode text in many different ways, such as
  374. case mapping, normalization, transliteration and bidirectional text handling.
  375. You can define which transformation you want to apply with the `id` parameter
  376. (defaults to `Null`), and specify text direction with the `dir` parameter
  377. which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
  378. rulesets are not yet supported.
  379. For example:
  380. [source,js]
  381. --------------------------------------------------
  382. PUT icu_sample
  383. {
  384. "settings": {
  385. "index": {
  386. "analysis": {
  387. "analyzer": {
  388. "latin": {
  389. "tokenizer": "keyword",
  390. "filter": [
  391. "myLatinTransform"
  392. ]
  393. }
  394. },
  395. "filter": {
  396. "myLatinTransform": {
  397. "type": "icu_transform",
  398. "id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" <1>
  399. }
  400. }
  401. }
  402. }
  403. }
  404. }
  405. GET _cluster/health?wait_for_status=yellow
  406. GET icu_sample/_analyze?analyzer=latin
  407. {
  408. "text": "你好" <2>
  409. }
  410. GET icu_sample/_analyze?analyzer=latin
  411. {
  412. "text": "здравствуйте" <3>
  413. }
  414. GET icu_sample/_analyze?analyzer=latin
  415. {
  416. "text": "こんにちは" <4>
  417. }
  418. --------------------------------------------------
  419. // CONSOLE
  420. <1> This transforms transliterates characters to Latin, and separates accents
  421. from their base characters, removes the accents, and then puts the
  422. remaining text into an unaccented form.
  423. <2> Returns `ni hao`.
  424. <3> Returns `zdravstvujte`.
  425. <4> Returns `kon'nichiha`.
  426. For more documentation, Please see the http://userguide.icu-project.org/transforms/general[user guide of ICU Transform].