123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657 |
- [[analysis-ngram-tokenizer]]
- === NGram Tokenizer
- A tokenizer of type `nGram`.
- The following are settings that can be set for a `nGram` tokenizer type:
- [cols="<,<,<",options="header",]
- |=======================================================================
- |Setting |Description |Default value
- |`min_gram` |Minimum size in codepoints of a single n-gram |`1`.
- |`max_gram` |Maximum size in codepoints of a single n-gram |`2`.
- |`token_chars` |Characters classes to keep in the
- tokens, Elasticsearch will split on characters that don't belong to any
- of these classes. |`[]` (Keep all characters)
- |=======================================================================
- `token_chars` accepts the following character classes:
- [horizontal]
- `letter`:: for example `a`, `b`, `ï` or `京`
- `digit`:: for example `3` or `7`
- `whitespace`:: for example `" "` or `"\n"`
- `punctuation`:: for example `!` or `"`
- `symbol`:: for example `$` or `√`
- [float]
- ==== Example
- [source,js]
- --------------------------------------------------
- curl -XPUT 'localhost:9200/test' -d '
- {
- "settings" : {
- "analysis" : {
- "analyzer" : {
- "my_ngram_analyzer" : {
- "tokenizer" : "my_ngram_tokenizer"
- }
- },
- "tokenizer" : {
- "my_ngram_tokenizer" : {
- "type" : "nGram",
- "min_gram" : "2",
- "max_gram" : "3",
- "token_chars": [ "letter", "digit" ]
- }
- }
- }
- }
- }'
- curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
- # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
- --------------------------------------------------
|