analysis-icu.asciidoc 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565
  1. [[analysis-icu]]
  2. === ICU Analysis Plugin
  3. The ICU Analysis plugin integrates the Lucene ICU module into elasticsearch,
  4. adding extended Unicode support using the http://site.icu-project.org/[ICU]
  5. libraries, including better analysis of Asian languages, Unicode
  6. normalization, Unicode-aware case folding, collation support, and
  7. transliteration.
  8. [IMPORTANT]
  9. .ICU analysis and backwards compatibility
  10. ================================================
  11. From time to time, the ICU library receives updates such as adding new
  12. characters and emojis, and improving collation (sort) orders. These changes
  13. may or may not affect search and sort orders, depending on which characters
  14. sets you are using.
  15. While we restrict ICU upgrades to major versions, you may find that an index
  16. created in the previous major version will need to be reindexed in order to
  17. return correct (and correctly ordered) results, and to take advantage of new
  18. characters.
  19. ================================================
  20. :plugin_name: analysis-icu
  21. include::install_remove.asciidoc[]
  22. [[analysis-icu-analyzer]]
  23. ==== ICU Analyzer
  24. The `icu_analyzer` analyzer performs basic normalization, tokenization and character folding, using the
  25. `icu_normalizer` char filter, `icu_tokenizer` and `icu_normalizer` token filter
  26. The following parameters are accepted:
  27. [horizontal]
  28. `method`::
  29. Normalization method. Accepts `nfkc`, `nfc` or `nfkc_cf` (default)
  30. `mode`::
  31. Normalization mode. Accepts `compose` (default) or `decompose`.
  32. [[analysis-icu-normalization-charfilter]]
  33. ==== ICU Normalization Character Filter
  34. Normalizes characters as explained
  35. http://userguide.icu-project.org/transforms/normalization[here].
  36. It registers itself as the `icu_normalizer` character filter, which is
  37. available to all indices without any further configuration. The type of
  38. normalization can be specified with the `name` parameter, which accepts `nfc`,
  39. `nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
  40. convert `nfc` to `nfd` or `nfkc` to `nfkd` respectively:
  41. Which letters are normalized can be controlled by specifying the
  42. `unicode_set_filter` parameter, which accepts a
  43. http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
  44. Here are two examples, the default usage and a customised character filter:
  45. [source,js]
  46. --------------------------------------------------
  47. PUT icu_sample
  48. {
  49. "settings": {
  50. "index": {
  51. "analysis": {
  52. "analyzer": {
  53. "nfkc_cf_normalized": { <1>
  54. "tokenizer": "icu_tokenizer",
  55. "char_filter": [
  56. "icu_normalizer"
  57. ]
  58. },
  59. "nfd_normalized": { <2>
  60. "tokenizer": "icu_tokenizer",
  61. "char_filter": [
  62. "nfd_normalizer"
  63. ]
  64. }
  65. },
  66. "char_filter": {
  67. "nfd_normalizer": {
  68. "type": "icu_normalizer",
  69. "name": "nfc",
  70. "mode": "decompose"
  71. }
  72. }
  73. }
  74. }
  75. }
  76. }
  77. --------------------------------------------------
  78. // CONSOLE
  79. <1> Uses the default `nfkc_cf` normalization.
  80. <2> Uses the customized `nfd_normalizer` token filter, which is set to use `nfc` normalization with decomposition.
  81. [[analysis-icu-tokenizer]]
  82. ==== ICU Tokenizer
  83. Tokenizes text into words on word boundaries, as defined in
  84. http://www.unicode.org/reports/tr29/[UAX #29: Unicode Text Segmentation].
  85. It behaves much like the {ref}/analysis-standard-tokenizer.html[`standard` tokenizer],
  86. but adds better support for some Asian languages by using a dictionary-based
  87. approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
  88. using custom rules to break Myanmar and Khmer text into syllables.
  89. [source,js]
  90. --------------------------------------------------
  91. PUT icu_sample
  92. {
  93. "settings": {
  94. "index": {
  95. "analysis": {
  96. "analyzer": {
  97. "my_icu_analyzer": {
  98. "tokenizer": "icu_tokenizer"
  99. }
  100. }
  101. }
  102. }
  103. }
  104. }
  105. --------------------------------------------------
  106. // CONSOLE
  107. ===== Rules customization
  108. experimental[This functionality is marked as experimental in Lucene]
  109. You can customize the `icu-tokenizer` behavior by specifying per-script rule files, see the
  110. http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[RBBI rules syntax reference]
  111. for a more detailed explanation.
  112. To add icu tokenizer rules, set the `rule_files` settings, which should contain a comma-separated list of
  113. `code:rulefile` pairs in the following format:
  114. http://unicode.org/iso15924/iso15924-codes.html[four-letter ISO 15924 script code],
  115. followed by a colon, then a rule file name. Rule files are placed `ES_HOME/config` directory.
  116. As a demonstration of how the rule files can be used, save the following user file to `$ES_HOME/config/KeywordTokenizer.rbbi`:
  117. [source,text]
  118. -----------------------
  119. .+ {200};
  120. -----------------------
  121. Then create an analyzer to use this rule file as follows:
  122. [source,js]
  123. --------------------------------------------------
  124. PUT icu_sample
  125. {
  126. "settings": {
  127. "index":{
  128. "analysis":{
  129. "tokenizer" : {
  130. "icu_user_file" : {
  131. "type" : "icu_tokenizer",
  132. "rule_files" : "Latn:KeywordTokenizer.rbbi"
  133. }
  134. },
  135. "analyzer" : {
  136. "my_analyzer" : {
  137. "type" : "custom",
  138. "tokenizer" : "icu_user_file"
  139. }
  140. }
  141. }
  142. }
  143. }
  144. }
  145. GET icu_sample/_analyze
  146. {
  147. "analyzer": "my_analyzer",
  148. "text": "Elasticsearch. Wow!"
  149. }
  150. --------------------------------------------------
  151. // CONSOLE
  152. The above `analyze` request returns the following:
  153. [source,js]
  154. --------------------------------------------------
  155. {
  156. "tokens": [
  157. {
  158. "token": "Elasticsearch. Wow!",
  159. "start_offset": 0,
  160. "end_offset": 19,
  161. "type": "<ALPHANUM>",
  162. "position": 0
  163. }
  164. ]
  165. }
  166. --------------------------------------------------
  167. // TESTRESPONSE
  168. [[analysis-icu-normalization]]
  169. ==== ICU Normalization Token Filter
  170. Normalizes characters as explained
  171. http://userguide.icu-project.org/transforms/normalization[here]. It registers
  172. itself as the `icu_normalizer` token filter, which is available to all indices
  173. without any further configuration. The type of normalization can be specified
  174. with the `name` parameter, which accepts `nfc`, `nfkc`, and `nfkc_cf`
  175. (default).
  176. Which letters are normalized can be controlled by specifying the
  177. `unicode_set_filter` parameter, which accepts a
  178. http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
  179. You should probably prefer the <<analysis-icu-normalization-charfilter,Normalization character filter>>.
  180. Here are two examples, the default usage and a customised token filter:
  181. [source,js]
  182. --------------------------------------------------
  183. PUT icu_sample
  184. {
  185. "settings": {
  186. "index": {
  187. "analysis": {
  188. "analyzer": {
  189. "nfkc_cf_normalized": { <1>
  190. "tokenizer": "icu_tokenizer",
  191. "filter": [
  192. "icu_normalizer"
  193. ]
  194. },
  195. "nfc_normalized": { <2>
  196. "tokenizer": "icu_tokenizer",
  197. "filter": [
  198. "nfc_normalizer"
  199. ]
  200. }
  201. },
  202. "filter": {
  203. "nfc_normalizer": {
  204. "type": "icu_normalizer",
  205. "name": "nfc"
  206. }
  207. }
  208. }
  209. }
  210. }
  211. }
  212. --------------------------------------------------
  213. // CONSOLE
  214. <1> Uses the default `nfkc_cf` normalization.
  215. <2> Uses the customized `nfc_normalizer` token filter, which is set to use `nfc` normalization.
  216. [[analysis-icu-folding]]
  217. ==== ICU Folding Token Filter
  218. Case folding of Unicode characters based on `UTR#30`, like the
  219. {ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter]
  220. on steroids. It registers itself as the `icu_folding` token filter and is
  221. available to all indices:
  222. [source,js]
  223. --------------------------------------------------
  224. PUT icu_sample
  225. {
  226. "settings": {
  227. "index": {
  228. "analysis": {
  229. "analyzer": {
  230. "folded": {
  231. "tokenizer": "icu_tokenizer",
  232. "filter": [
  233. "icu_folding"
  234. ]
  235. }
  236. }
  237. }
  238. }
  239. }
  240. }
  241. --------------------------------------------------
  242. // CONSOLE
  243. The ICU folding token filter already does Unicode normalization, so there is
  244. no need to use Normalize character or token filter as well.
  245. Which letters are folded can be controlled by specifying the
  246. `unicode_set_filter` parameter, which accepts a
  247. http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
  248. The following example exempts Swedish characters from folding. It is important
  249. to note that both upper and lowercase forms should be specified, and that
  250. these filtered character are not lowercased which is why we add the
  251. `lowercase` filter as well:
  252. [source,js]
  253. --------------------------------------------------
  254. PUT icu_sample
  255. {
  256. "settings": {
  257. "index": {
  258. "analysis": {
  259. "analyzer": {
  260. "swedish_analyzer": {
  261. "tokenizer": "icu_tokenizer",
  262. "filter": [
  263. "swedish_folding",
  264. "lowercase"
  265. ]
  266. }
  267. },
  268. "filter": {
  269. "swedish_folding": {
  270. "type": "icu_folding",
  271. "unicode_set_filter": "[^åäöÅÄÖ]"
  272. }
  273. }
  274. }
  275. }
  276. }
  277. }
  278. --------------------------------------------------
  279. // CONSOLE
  280. [[analysis-icu-collation]]
  281. ==== ICU Collation Token Filter
  282. [WARNING]
  283. ======
  284. This token filter has been deprecated since Lucene 5.0. Please use
  285. <<analysis-icu-collation-keyword-field, ICU Collation Keyword Field>>.
  286. ======
  287. [[analysis-icu-collation-keyword-field]]
  288. ==== ICU Collation Keyword Field
  289. Collations are used for sorting documents in a language-specific word order.
  290. The `icu_collation_keyword` field type is available to all indices and will encode
  291. the terms directly as bytes in a doc values field and a single indexed token just
  292. like a standard {ref}/keyword.html[Keyword Field].
  293. Defaults to using {defguide}/sorting-collations.html#uca[DUCET collation],
  294. which is a best-effort attempt at language-neutral sorting.
  295. Below is an example of how to set up a field for sorting German names in
  296. ``phonebook'' order:
  297. [source,js]
  298. --------------------------
  299. PUT my_index
  300. {
  301. "mappings": {
  302. "properties": {
  303. "name": { <1>
  304. "type": "text",
  305. "fields": {
  306. "sort": { <2>
  307. "type": "icu_collation_keyword",
  308. "index": false,
  309. "language": "de",
  310. "country": "DE",
  311. "variant": "@collation=phonebook"
  312. }
  313. }
  314. }
  315. }
  316. }
  317. }
  318. GET _search <3>
  319. {
  320. "query": {
  321. "match": {
  322. "name": "Fritz"
  323. }
  324. },
  325. "sort": "name.sort"
  326. }
  327. --------------------------
  328. // CONSOLE
  329. <1> The `name` field uses the `standard` analyzer, and so support full text queries.
  330. <2> The `name.sort` field is an `icu_collation_keyword` field that will preserve the name as
  331. a single token doc values, and applies the German ``phonebook'' order.
  332. <3> An example query which searches the `name` field and sorts on the `name.sort` field.
  333. ===== Parameters for ICU Collation Keyword Fields
  334. The following parameters are accepted by `icu_collation_keyword` fields:
  335. [horizontal]
  336. `doc_values`::
  337. Should the field be stored on disk in a column-stride fashion, so that it
  338. can later be used for sorting, aggregations, or scripting? Accepts `true`
  339. (default) or `false`.
  340. `index`::
  341. Should the field be searchable? Accepts `true` (default) or `false`.
  342. `null_value`::
  343. Accepts a string value which is substituted for any explicit `null`
  344. values. Defaults to `null`, which means the field is treated as missing.
  345. {ref}/ignore-above.html[`ignore_above`]::
  346. Strings longer than the `ignore_above` setting will be ignored.
  347. Checking is performed on the original string before the collation.
  348. The `ignore_above` setting can be updated on existing fields
  349. using the {ref}/indices-put-mapping.html[PUT mapping API].
  350. By default, there is no limit and all values will be indexed.
  351. `store`::
  352. Whether the field value should be stored and retrievable separately from
  353. the {ref}/mapping-source-field.html[`_source`] field. Accepts `true` or `false`
  354. (default).
  355. `fields`::
  356. Multi-fields allow the same string value to be indexed in multiple ways for
  357. different purposes, such as one field for search and a multi-field for
  358. sorting and aggregations.
  359. ===== Collation options
  360. `strength`::
  361. The strength property determines the minimum level of difference considered
  362. significant during comparison. Possible values are : `primary`, `secondary`,
  363. `tertiary`, `quaternary` or `identical`. See the
  364. http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation documentation]
  365. for a more detailed explanation for each value. Defaults to `tertiary`
  366. unless otherwise specified in the collation.
  367. `decomposition`::
  368. Possible values: `no` (default, but collation-dependent) or `canonical`.
  369. Setting this decomposition property to `canonical` allows the Collator to
  370. handle unnormalized text properly, producing the same results as if the text
  371. were normalized. If `no` is set, it is the user's responsibility to insure
  372. that all text is already in the appropriate form before a comparison or before
  373. getting a CollationKey. Adjusting decomposition mode allows the user to select
  374. between faster and more complete collation behavior. Since a great many of the
  375. world's languages do not require text normalization, most locales set `no` as
  376. the default decomposition mode.
  377. The following options are expert only:
  378. `alternate`::
  379. Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for
  380. strength `quaternary` to be either shifted or non-ignorable. Which boils down
  381. to ignoring punctuation and whitespace.
  382. `case_level`::
  383. Possible values: `true` or `false` (default). Whether case level sorting is
  384. required. When strength is set to `primary` this will ignore accent
  385. differences.
  386. `case_first`::
  387. Possible values: `lower` or `upper`. Useful to control which case is sorted
  388. first when case is not ignored for strength `tertiary`. The default depends on
  389. the collation.
  390. `numeric`::
  391. Possible values: `true` or `false` (default) . Whether digits are sorted
  392. according to their numeric representation. For example the value `egg-9` is
  393. sorted before the value `egg-21`.
  394. `variable_top`::
  395. Single character or contraction. Controls what is variable for `alternate`.
  396. `hiragana_quaternary_mode`::
  397. Possible values: `true` or `false`. Distinguishing between Katakana and
  398. Hiragana characters in `quaternary` strength.
  399. [[analysis-icu-transform]]
  400. ==== ICU Transform Token Filter
  401. Transforms are used to process Unicode text in many different ways, such as
  402. case mapping, normalization, transliteration and bidirectional text handling.
  403. You can define which transformation you want to apply with the `id` parameter
  404. (defaults to `Null`), and specify text direction with the `dir` parameter
  405. which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
  406. rulesets are not yet supported.
  407. For example:
  408. [source,js]
  409. --------------------------------------------------
  410. PUT icu_sample
  411. {
  412. "settings": {
  413. "index": {
  414. "analysis": {
  415. "analyzer": {
  416. "latin": {
  417. "tokenizer": "keyword",
  418. "filter": [
  419. "myLatinTransform"
  420. ]
  421. }
  422. },
  423. "filter": {
  424. "myLatinTransform": {
  425. "type": "icu_transform",
  426. "id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" <1>
  427. }
  428. }
  429. }
  430. }
  431. }
  432. }
  433. GET icu_sample/_analyze
  434. {
  435. "analyzer": "latin",
  436. "text": "你好" <2>
  437. }
  438. GET icu_sample/_analyze
  439. {
  440. "analyzer": "latin",
  441. "text": "здравствуйте" <3>
  442. }
  443. GET icu_sample/_analyze
  444. {
  445. "analyzer": "latin",
  446. "text": "こんにちは" <4>
  447. }
  448. --------------------------------------------------
  449. // CONSOLE
  450. <1> This transforms transliterates characters to Latin, and separates accents
  451. from their base characters, removes the accents, and then puts the
  452. remaining text into an unaccented form.
  453. <2> Returns `ni hao`.
  454. <3> Returns `zdravstvujte`.
  455. <4> Returns `kon'nichiha`.
  456. For more documentation, Please see the http://userguide.icu-project.org/transforms/general[user guide of ICU Transform].