1
0

tokenizers.asciidoc 5.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157
  1. [[analysis-tokenizers]]
  2. == Tokenizer reference
  3. A _tokenizer_ receives a stream of characters, breaks it up into individual
  4. _tokens_ (usually individual words), and outputs a stream of _tokens_. For
  5. instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
  6. text into tokens whenever it sees any whitespace. It would convert the text
  7. `"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
  8. The tokenizer is also responsible for recording the following:
  9. * Order or _position_ of each term (used for phrase and word proximity queries)
  10. * Start and end _character offsets_ of the original word which the term
  11. represents (used for highlighting search snippets).
  12. * _Token type_, a classification of each term produced, such as `<ALPHANUM>`,
  13. `<HANGUL>`, or `<NUM>`. Simpler analyzers only produce the `word` token type.
  14. Elasticsearch has a number of built in tokenizers which can be used to build
  15. <<analysis-custom-analyzer,custom analyzers>>.
  16. [float]
  17. === Word Oriented Tokenizers
  18. The following tokenizers are usually used for tokenizing full text into
  19. individual words:
  20. <<analysis-standard-tokenizer,Standard Tokenizer>>::
  21. The `standard` tokenizer divides text into terms on word boundaries, as
  22. defined by the Unicode Text Segmentation algorithm. It removes most
  23. punctuation symbols. It is the best choice for most languages.
  24. <<analysis-letter-tokenizer,Letter Tokenizer>>::
  25. The `letter` tokenizer divides text into terms whenever it encounters a
  26. character which is not a letter.
  27. <<analysis-lowercase-tokenizer,Lowercase Tokenizer>>::
  28. The `lowercase` tokenizer, like the `letter` tokenizer, divides text into
  29. terms whenever it encounters a character which is not a letter, but it also
  30. lowercases all terms.
  31. <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>::
  32. The `whitespace` tokenizer divides text into terms whenever it encounters any
  33. whitespace character.
  34. <<analysis-uaxurlemail-tokenizer,UAX URL Email Tokenizer>>::
  35. The `uax_url_email` tokenizer is like the `standard` tokenizer except that it
  36. recognises URLs and email addresses as single tokens.
  37. <<analysis-classic-tokenizer,Classic Tokenizer>>::
  38. The `classic` tokenizer is a grammar based tokenizer for the English Language.
  39. <<analysis-thai-tokenizer,Thai Tokenizer>>::
  40. The `thai` tokenizer segments Thai text into words.
  41. [float]
  42. === Partial Word Tokenizers
  43. These tokenizers break up text or words into small fragments, for partial word
  44. matching:
  45. <<analysis-ngram-tokenizer,N-Gram Tokenizer>>::
  46. The `ngram` tokenizer can break up text into words when it encounters any of
  47. a list of specified characters (e.g. whitespace or punctuation), then it returns
  48. n-grams of each word: a sliding window of continuous letters, e.g. `quick` ->
  49. `[qu, ui, ic, ck]`.
  50. <<analysis-edgengram-tokenizer,Edge N-Gram Tokenizer>>::
  51. The `edge_ngram` tokenizer can break up text into words when it encounters any of
  52. a list of specified characters (e.g. whitespace or punctuation), then it returns
  53. n-grams of each word which are anchored to the start of the word, e.g. `quick` ->
  54. `[q, qu, qui, quic, quick]`.
  55. [float]
  56. === Structured Text Tokenizers
  57. The following tokenizers are usually used with structured text like
  58. identifiers, email addresses, zip codes, and paths, rather than with full
  59. text:
  60. <<analysis-keyword-tokenizer,Keyword Tokenizer>>::
  61. The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
  62. is given and outputs the exact same text as a single term. It can be combined
  63. with token filters like <<analysis-lowercase-tokenfilter,`lowercase`>> to
  64. normalise the analysed terms.
  65. <<analysis-pattern-tokenizer,Pattern Tokenizer>>::
  66. The `pattern` tokenizer uses a regular expression to either split text into
  67. terms whenever it matches a word separator, or to capture matching text as
  68. terms.
  69. <<analysis-simplepattern-tokenizer,Simple Pattern Tokenizer>>::
  70. The `simple_pattern` tokenizer uses a regular expression to capture matching
  71. text as terms. It uses a restricted subset of regular expression features
  72. and is generally faster than the `pattern` tokenizer.
  73. <<analysis-chargroup-tokenizer,Char Group Tokenizer>>::
  74. The `char_group` tokenizer is configurable through sets of characters to split
  75. on, which is usually less expensive than running regular expressions.
  76. <<analysis-simplepatternsplit-tokenizer,Simple Pattern Split Tokenizer>>::
  77. The `simple_pattern_split` tokenizer uses the same restricted regular expression
  78. subset as the `simple_pattern` tokenizer, but splits the input at matches rather
  79. than returning the matches as terms.
  80. <<analysis-pathhierarchy-tokenizer,Path Tokenizer>>::
  81. The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
  82. path, splits on the path separator, and emits a term for each component in the
  83. tree, e.g. `/foo/bar/baz` -> `[/foo, /foo/bar, /foo/bar/baz ]`.
  84. include::tokenizers/chargroup-tokenizer.asciidoc[]
  85. include::tokenizers/classic-tokenizer.asciidoc[]
  86. include::tokenizers/edgengram-tokenizer.asciidoc[]
  87. include::tokenizers/keyword-tokenizer.asciidoc[]
  88. include::tokenizers/letter-tokenizer.asciidoc[]
  89. include::tokenizers/lowercase-tokenizer.asciidoc[]
  90. include::tokenizers/ngram-tokenizer.asciidoc[]
  91. include::tokenizers/pathhierarchy-tokenizer.asciidoc[]
  92. include::tokenizers/pathhierarchy-tokenizer-examples.asciidoc[]
  93. include::tokenizers/pattern-tokenizer.asciidoc[]
  94. include::tokenizers/simplepattern-tokenizer.asciidoc[]
  95. include::tokenizers/simplepatternsplit-tokenizer.asciidoc[]
  96. include::tokenizers/standard-tokenizer.asciidoc[]
  97. include::tokenizers/thai-tokenizer.asciidoc[]
  98. include::tokenizers/uaxurlemail-tokenizer.asciidoc[]
  99. include::tokenizers/whitespace-tokenizer.asciidoc[]