| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312 | [[analysis-ngram-tokenizer]]=== N-gram tokenizer++++<titleabbrev>N-gram</titleabbrev>++++The `ngram` tokenizer first breaks text down into words whenever it encountersone of a list of specified characters, then it emits{wikipedia}/N-gram[N-grams] of each word of the specifiedlength.N-grams are like a sliding window that moves across the word - a continuoussequence of characters of the specified length. They are useful for queryinglanguages that don't use spaces or that have long compound words, like German.[discrete]=== Example outputWith the default settings, the `ngram` tokenizer treats the initial text as asingle token and produces N-grams with minimum length `1` and maximum length`2`:[source,console]---------------------------POST _analyze{  "tokenizer": "ngram",  "text": "Quick Fox"}---------------------------/////////////////////[source,console-result]----------------------------{  "tokens": [    {      "token": "Q",      "start_offset": 0,      "end_offset": 1,      "type": "word",      "position": 0    },    {      "token": "Qu",      "start_offset": 0,      "end_offset": 2,      "type": "word",      "position": 1    },    {      "token": "u",      "start_offset": 1,      "end_offset": 2,      "type": "word",      "position": 2    },    {      "token": "ui",      "start_offset": 1,      "end_offset": 3,      "type": "word",      "position": 3    },    {      "token": "i",      "start_offset": 2,      "end_offset": 3,      "type": "word",      "position": 4    },    {      "token": "ic",      "start_offset": 2,      "end_offset": 4,      "type": "word",      "position": 5    },    {      "token": "c",      "start_offset": 3,      "end_offset": 4,      "type": "word",      "position": 6    },    {      "token": "ck",      "start_offset": 3,      "end_offset": 5,      "type": "word",      "position": 7    },    {      "token": "k",      "start_offset": 4,      "end_offset": 5,      "type": "word",      "position": 8    },    {      "token": "k ",      "start_offset": 4,      "end_offset": 6,      "type": "word",      "position": 9    },    {      "token": " ",      "start_offset": 5,      "end_offset": 6,      "type": "word",      "position": 10    },    {      "token": " F",      "start_offset": 5,      "end_offset": 7,      "type": "word",      "position": 11    },    {      "token": "F",      "start_offset": 6,      "end_offset": 7,      "type": "word",      "position": 12    },    {      "token": "Fo",      "start_offset": 6,      "end_offset": 8,      "type": "word",      "position": 13    },    {      "token": "o",      "start_offset": 7,      "end_offset": 8,      "type": "word",      "position": 14    },    {      "token": "ox",      "start_offset": 7,      "end_offset": 9,      "type": "word",      "position": 15    },    {      "token": "x",      "start_offset": 8,      "end_offset": 9,      "type": "word",      "position": 16    }  ]}----------------------------/////////////////////The above sentence would produce the following terms:[source,text]---------------------------[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]---------------------------[discrete]=== ConfigurationThe `ngram` tokenizer accepts the following parameters:[horizontal]`min_gram`::    Minimum length of characters in a gram.  Defaults to `1`.`max_gram`::    Maximum length of characters in a gram.  Defaults to `2`.`token_chars`::    Character classes that should be included in a token.  Elasticsearch    will split on characters that don't belong to the classes specified.    Defaults to `[]` (keep all characters).+Character classes may be any of the following:+* `letter` --      for example `a`, `b`, `ï` or `京`* `digit` --       for example `3` or `7`* `whitespace` --  for example `" "` or `"\n"`* `punctuation` -- for example `!` or `"`* `symbol` --      for example `$` or `√`* `custom` --      custom characters which need to be set using the`custom_token_chars` setting.`custom_token_chars`::    Custom characters that should be treated as part of a token. For example,    setting this to `+-_` will make the tokenizer treat the plus, minus and    underscore sign  as part of a token.TIP:  It usually makes sense to set `min_gram` and `max_gram` to the samevalue.  The smaller the length, the more documents will match but the lowerthe quality of the matches.  The longer the length, the more specific thematches.  A tri-gram (length `3`) is a good place to start.The index level setting `index.max_ngram_diff` controls the maximum alloweddifference between `max_gram` and `min_gram`.[discrete]=== Example configurationIn this example, we configure the `ngram` tokenizer to treat letters anddigits as tokens, and to produce tri-grams (grams of length `3`):[source,console]----------------------------PUT my-index-000001{  "settings": {    "analysis": {      "analyzer": {        "my_analyzer": {          "tokenizer": "my_tokenizer"        }      },      "tokenizer": {        "my_tokenizer": {          "type": "ngram",          "min_gram": 3,          "max_gram": 3,          "token_chars": [            "letter",            "digit"          ]        }      }    }  }}POST my-index-000001/_analyze{  "analyzer": "my_analyzer",  "text": "2 Quick Foxes."}----------------------------/////////////////////[source,console-result]----------------------------{  "tokens": [    {      "token": "Qui",      "start_offset": 2,      "end_offset": 5,      "type": "word",      "position": 0    },    {      "token": "uic",      "start_offset": 3,      "end_offset": 6,      "type": "word",      "position": 1    },    {      "token": "ick",      "start_offset": 4,      "end_offset": 7,      "type": "word",      "position": 2    },    {      "token": "Fox",      "start_offset": 8,      "end_offset": 11,      "type": "word",      "position": 3    },    {      "token": "oxe",      "start_offset": 9,      "end_offset": 12,      "type": "word",      "position": 4    },    {      "token": "xes",      "start_offset": 10,      "end_offset": 13,      "type": "word",      "position": 5    }  ]}----------------------------/////////////////////The above example produces the following terms:[source,text]---------------------------[ Qui, uic, ick, Fox, oxe, xes ]---------------------------
 |