tokenizers.asciidoc 5.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
  1. [[analysis-tokenizers]]
  2. == Tokenizers
  3. A _tokenizer_ receives a stream of characters, breaks it up into individual
  4. _tokens_ (usually individual words), and outputs a stream of _tokens_. For
  5. instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
  6. text into tokens whenever it sees any whitespace. It would convert the text
  7. `"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
  8. The tokenizer is also responsible for recording the order or _position_ of
  9. each term (used for phrase and word proximity queries) and the start and end
  10. _character offsets_ of the original word which the term represents (used for
  11. highlighting search snippets).
  12. Elasticsearch has a number of built in tokenizers which can be used to build
  13. <<analysis-custom-analyzer,custom analyzers>>.
  14. [float]
  15. === Word Oriented Tokenizers
  16. The following tokenizers are usually used for tokenizing full text into
  17. individual words:
  18. <<analysis-standard-tokenizer,Standard Tokenizer>>::
  19. The `standard` tokenizer divides text into terms on word boundaries, as
  20. defined by the Unicode Text Segmentation algorithm. It removes most
  21. punctuation symbols. It is the best choice for most languages.
  22. <<analysis-letter-tokenizer,Letter Tokenizer>>::
  23. The `letter` tokenizer divides text into terms whenever it encounters a
  24. character which is not a letter.
  25. <<analysis-lowercase-tokenizer,Lowercase Tokenizer>>::
  26. The `lowercase` tokenizer, like the `letter` tokenizer, divides text into
  27. terms whenever it encounters a character which is not a letter, but it also
  28. lowercases all terms.
  29. <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>::
  30. The `whitespace` tokenizer divides text into terms whenever it encounters any
  31. whitespace character.
  32. <<analysis-uaxurlemail-tokenizer,UAX URL Email Tokenizer>>::
  33. The `uax_url_email` tokenizer is like the `standard` tokenizer except that it
  34. recognises URLs and email addresses as single tokens.
  35. <<analysis-classic-tokenizer,Classic Tokenizer>>::
  36. The `classic` tokenizer is a grammar based tokenizer for the English Language.
  37. <<analysis-thai-tokenizer,Thai Tokenizer>>::
  38. The `thai` tokenizer segments Thai text into words.
  39. [float]
  40. === Partial Word Tokenizers
  41. These tokenizers break up text or words into small fragments, for partial word
  42. matching:
  43. <<analysis-ngram-tokenizer,N-Gram Tokenizer>>::
  44. The `ngram` tokenizer can break up text into words when it encounters any of
  45. a list of specified characters (e.g. whitespace or punctuation), then it returns
  46. n-grams of each word: a sliding window of continuous letters, e.g. `quick` ->
  47. `[qu, ui, ic, ck]`.
  48. <<analysis-edgengram-tokenizer,Edge N-Gram Tokenizer>>::
  49. The `edge_ngram` tokenizer can break up text into words when it encounters any of
  50. a list of specified characters (e.g. whitespace or punctuation), then it returns
  51. n-grams of each word which are anchored to the start of the word, e.g. `quick` ->
  52. `[q, qu, qui, quic, quick]`.
  53. [float]
  54. === Structured Text Tokenizers
  55. The following tokenizers are usually used with structured text like
  56. identifiers, email addresses, zip codes, and paths, rather than with full
  57. text:
  58. <<analysis-keyword-tokenizer,Keyword Tokenizer>>::
  59. The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
  60. is given and outputs the exact same text as a single term. It can be combined
  61. with token filters like <<analysis-lowercase-tokenfilter,`lowercase`>> to
  62. normalise the analysed terms.
  63. <<analysis-pattern-tokenizer,Pattern Tokenizer>>::
  64. The `pattern` tokenizer uses a regular expression to either split text into
  65. terms whenever it matches a word separator, or to capture matching text as
  66. terms.
  67. <<analysis-simplepattern-tokenizer,Simple Pattern Tokenizer>>::
  68. The `simple_pattern` tokenizer uses a regular expression to capture matching
  69. text as terms. It uses a restricted subset of regular expression features
  70. and is generally faster than the `pattern` tokenizer.
  71. <<analysis-chargroup-tokenizer,Char Group Tokenizer>>::
  72. The `char_group` tokenizer is configurable through sets of characters to split
  73. on, which is usually less expensive than running regular expressions.
  74. <<analysis-simplepatternsplit-tokenizer,Simple Pattern Split Tokenizer>>::
  75. The `simple_pattern_split` tokenizer uses the same restricted regular expression
  76. subset as the `simple_pattern` tokenizer, but splits the input at matches rather
  77. than returning the matches as terms.
  78. <<analysis-pathhierarchy-tokenizer,Path Tokenizer>>::
  79. The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
  80. path, splits on the path separator, and emits a term for each component in the
  81. tree, e.g. `/foo/bar/baz` -> `[/foo, /foo/bar, /foo/bar/baz ]`.
  82. include::tokenizers/standard-tokenizer.asciidoc[]
  83. include::tokenizers/letter-tokenizer.asciidoc[]
  84. include::tokenizers/lowercase-tokenizer.asciidoc[]
  85. include::tokenizers/whitespace-tokenizer.asciidoc[]
  86. include::tokenizers/uaxurlemail-tokenizer.asciidoc[]
  87. include::tokenizers/classic-tokenizer.asciidoc[]
  88. include::tokenizers/thai-tokenizer.asciidoc[]
  89. include::tokenizers/ngram-tokenizer.asciidoc[]
  90. include::tokenizers/edgengram-tokenizer.asciidoc[]
  91. include::tokenizers/keyword-tokenizer.asciidoc[]
  92. include::tokenizers/pattern-tokenizer.asciidoc[]
  93. include::tokenizers/chargroup-tokenizer.asciidoc[]
  94. include::tokenizers/simplepattern-tokenizer.asciidoc[]
  95. include::tokenizers/simplepatternsplit-tokenizer.asciidoc[]
  96. include::tokenizers/pathhierarchy-tokenizer.asciidoc[]
  97. include::tokenizers/pathhierarchy-tokenizer-examples.asciidoc[]