| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157 | [[analysis-tokenizers]]== TokenizersA _tokenizer_  receives a stream of characters, breaks it up into individual_tokens_ (usually individual words), and outputs a stream of _tokens_. Forinstance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breakstext into tokens whenever it sees any whitespace.  It would convert the text`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.The tokenizer is also responsible for recording the order or _position_ ofeach term (used for phrase and word proximity queries) and the start and end_character offsets_ of the original word which the term represents (used forhighlighting search snippets).Elasticsearch has a number of built in tokenizers which can be used to build<<analysis-custom-analyzer,custom analyzers>>.[float]=== Word Oriented TokenizersThe following tokenizers are usually used for tokenizing full text intoindividual words:<<analysis-standard-tokenizer,Standard Tokenizer>>::The `standard` tokenizer divides text into terms on word boundaries, asdefined by the Unicode Text Segmentation algorithm. It removes mostpunctuation symbols. It is the best choice for most languages.<<analysis-letter-tokenizer,Letter Tokenizer>>::The `letter` tokenizer divides text into terms whenever it encounters acharacter which is not a letter.<<analysis-lowercase-tokenizer,Lowercase Tokenizer>>::The `lowercase` tokenizer, like the `letter` tokenizer,  divides text intoterms whenever it encounters a character which is not a letter, but it alsolowercases all terms.<<analysis-whitespace-tokenizer,Whitespace Tokenizer>>::The `whitespace` tokenizer divides text into terms whenever it encounters anywhitespace character.<<analysis-uaxurlemail-tokenizer,UAX URL Email Tokenizer>>::The `uax_url_email` tokenizer is like the `standard` tokenizer except that itrecognises URLs and email addresses as single tokens.<<analysis-classic-tokenizer,Classic Tokenizer>>::The `classic` tokenizer is a grammar based tokenizer for the English Language.<<analysis-thai-tokenizer,Thai Tokenizer>>::The `thai` tokenizer segments Thai text into words.[float]=== Partial Word TokenizersThese tokenizers break up text or words into small fragments, for partial wordmatching:<<analysis-ngram-tokenizer,N-Gram Tokenizer>>::The `ngram` tokenizer can break up text into words when it encounters any ofa list of specified characters (e.g. whitespace or punctuation), then it returnsn-grams of each word: a sliding window of continuous letters, e.g. `quick` ->`[qu, ui, ic, ck]`.<<analysis-edgengram-tokenizer,Edge N-Gram Tokenizer>>::The `edge_ngram` tokenizer can break up text into words when it encounters any ofa list of specified characters (e.g. whitespace or punctuation), then it returnsn-grams of each word which are anchored to the start of the word, e.g. `quick` ->`[q, qu, qui, quic, quick]`.[float]=== Structured Text TokenizersThe following tokenizers are usually used with structured text likeidentifiers, email addresses, zip codes, and paths, rather than with fulltext:<<analysis-keyword-tokenizer,Keyword Tokenizer>>::The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text itis given and outputs the exact same text as a single term.  It can be combinedwith token filters like <<analysis-lowercase-tokenfilter,`lowercase`>> tonormalise the analysed terms.<<analysis-pattern-tokenizer,Pattern Tokenizer>>::The `pattern` tokenizer uses a regular expression to either split text intoterms whenever it matches a word separator, or to capture matching text asterms.<<analysis-simplepattern-tokenizer,Simple Pattern Tokenizer>>::The `simple_pattern` tokenizer uses a regular expression to capture matchingtext as terms. It uses a restricted subset of regular expression featuresand is generally faster than the `pattern` tokenizer.<<analysis-chargroup-tokenizer,Char Group Tokenizer>>::The `char_group` tokenizer is configurable through sets of characters to spliton, which is usually less expensive than running regular expressions.<<analysis-simplepatternsplit-tokenizer,Simple Pattern Split Tokenizer>>::The `simple_pattern_split` tokenizer uses the same restricted regular expressionsubset as the `simple_pattern` tokenizer, but splits the input at matches ratherthan returning the matches as terms.<<analysis-pathhierarchy-tokenizer,Path Tokenizer>>::The `path_hierarchy` tokenizer takes a hierarchical value like a filesystempath, splits on the path separator, and emits a term for each component in thetree, e.g. `/foo/bar/baz` -> `[/foo, /foo/bar, /foo/bar/baz ]`.include::tokenizers/standard-tokenizer.asciidoc[]include::tokenizers/letter-tokenizer.asciidoc[]include::tokenizers/lowercase-tokenizer.asciidoc[]include::tokenizers/whitespace-tokenizer.asciidoc[]include::tokenizers/uaxurlemail-tokenizer.asciidoc[]include::tokenizers/classic-tokenizer.asciidoc[]include::tokenizers/thai-tokenizer.asciidoc[]include::tokenizers/ngram-tokenizer.asciidoc[]include::tokenizers/edgengram-tokenizer.asciidoc[]include::tokenizers/keyword-tokenizer.asciidoc[]include::tokenizers/pattern-tokenizer.asciidoc[]include::tokenizers/chargroup-tokenizer.asciidoc[]include::tokenizers/simplepattern-tokenizer.asciidoc[]include::tokenizers/simplepatternsplit-tokenizer.asciidoc[]include::tokenizers/pathhierarchy-tokenizer.asciidoc[]
 |