1
0

anatomy.asciidoc 2.3 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
  1. [[analyzer-anatomy]]
  2. === Anatomy of an analyzer
  3. An _analyzer_ -- whether built-in or custom -- is just a package which
  4. contains three lower-level building blocks: _character filters_,
  5. _tokenizers_, and _token filters_.
  6. The built-in <<analysis-analyzers,analyzers>> pre-package these building
  7. blocks into analyzers suitable for different languages and types of text.
  8. Elasticsearch also exposes the individual building blocks so that they can be
  9. combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
  10. [[analyzer-anatomy-character-filters]]
  11. ==== Character filters
  12. A _character filter_ receives the original text as a stream of characters and
  13. can transform the stream by adding, removing, or changing characters. For
  14. instance, a character filter could be used to convert Hindu-Arabic numerals
  15. (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML
  16. elements like `<b>` from the stream.
  17. An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
  18. which are applied in order.
  19. [[analyzer-anatomy-tokenizer]]
  20. ==== Tokenizer
  21. A _tokenizer_ receives a stream of characters, breaks it up into individual
  22. _tokens_ (usually individual words), and outputs a stream of _tokens_. For
  23. instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
  24. text into tokens whenever it sees any whitespace. It would convert the text
  25. `"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
  26. The tokenizer is also responsible for recording the order or _position_ of
  27. each term and the start and end _character offsets_ of the original word which
  28. the term represents.
  29. An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
  30. [[analyzer-anatomy-token-filters]]
  31. ==== Token filters
  32. A _token filter_ receives the token stream and may add, remove, or change
  33. tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
  34. filter converts all tokens to lowercase, a
  35. <<analysis-stop-tokenfilter,`stop`>> token filter removes common words
  36. (_stop words_) like `the` from the token stream, and a
  37. <<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonyms
  38. into the token stream.
  39. Token filters are not allowed to change the position or character offsets of
  40. each token.
  41. An analyzer may have *zero or more* <<analysis-tokenfilters,token filters>>,
  42. which are applied in order.