123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960 |
- [[analyzer-anatomy]]
- == Anatomy of an analyzer
- An _analyzer_ -- whether built-in or custom -- is just a package which
- contains three lower-level building blocks: _character filters_,
- _tokenizers_, and _token filters_.
- The built-in <<analysis-analyzers,analyzers>> pre-package these building
- blocks into analyzers suitable for different languages and types of text.
- Elasticsearch also exposes the individual building blocks so that they can be
- combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
- [float]
- === Character filters
- A _character filter_ receives the original text as a stream of characters and
- can transform the stream by adding, removing, or changing characters. For
- instance, a character filter could be used to convert Arabic numerals
- (٠١٢٣٤٥٦٧٨٩) into their Latin equivalents (0123456789), or to strip HTML
- elements like `<b>` from the stream.
- An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
- which are applied in order.
- [float]
- === Tokenizer
- A _tokenizer_ receives a stream of characters, breaks it up into individual
- _tokens_ (usually individual words), and outputs a stream of _tokens_. For
- instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
- text into tokens whenever it sees any whitespace. It would convert the text
- `"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
- The tokenizer is also responsible for recording the order or _position_ of
- each term and the start and end _character offsets_ of the original word which
- the term represents.
- An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
- [float]
- === Token filters
- A _token filter_ receives the token stream and may add, remove, or change
- tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
- filter converts all tokens to lowercase, a
- <<analysis-stop-tokenfilter,`stop`>> token filter removes common words
- (_stop words_) like `the` from the token stream, and a
- <<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonyms
- into the token stream.
- Token filters are not allowed to change the position or character offsets of
- each token.
- An analyzer may have *zero or more* <<analysis-tokenfilters,token filters>>,
- which are applied in order.
|