anatomy.asciidoc 2.2 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
  1. [[analyzer-anatomy]]
  2. === Anatomy of an analyzer
  3. An _analyzer_ -- whether built-in or custom -- is just a package which
  4. contains three lower-level building blocks: _character filters_,
  5. _tokenizers_, and _token filters_.
  6. The built-in <<analysis-analyzers,analyzers>> pre-package these building
  7. blocks into analyzers suitable for different languages and types of text.
  8. Elasticsearch also exposes the individual building blocks so that they can be
  9. combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
  10. ==== Character filters
  11. A _character filter_ receives the original text as a stream of characters and
  12. can transform the stream by adding, removing, or changing characters. For
  13. instance, a character filter could be used to convert Hindu-Arabic numerals
  14. (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML
  15. elements like `<b>` from the stream.
  16. An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
  17. which are applied in order.
  18. ==== Tokenizer
  19. A _tokenizer_ receives a stream of characters, breaks it up into individual
  20. _tokens_ (usually individual words), and outputs a stream of _tokens_. For
  21. instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
  22. text into tokens whenever it sees any whitespace. It would convert the text
  23. `"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
  24. The tokenizer is also responsible for recording the order or _position_ of
  25. each term and the start and end _character offsets_ of the original word which
  26. the term represents.
  27. An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
  28. ==== Token filters
  29. A _token filter_ receives the token stream and may add, remove, or change
  30. tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
  31. filter converts all tokens to lowercase, a
  32. <<analysis-stop-tokenfilter,`stop`>> token filter removes common words
  33. (_stop words_) like `the` from the token stream, and a
  34. <<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonyms
  35. into the token stream.
  36. Token filters are not allowed to change the position or character offsets of
  37. each token.
  38. An analyzer may have *zero or more* <<analysis-tokenfilters,token filters>>,
  39. which are applied in order.