| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960 | [[analyzer-anatomy]]== Anatomy of an analyzerAn _analyzer_  -- whether built-in or custom -- is just a package whichcontains three lower-level building blocks: _character filters_,_tokenizers_, and _token filters_.The built-in <<analysis-analyzers,analyzers>> pre-package these buildingblocks into analyzers suitable for different languages and types of text.Elasticsearch also exposes the individual building blocks so that they can becombined to define new <<analysis-custom-analyzer,`custom`>> analyzers.[float]=== Character filtersA _character filter_ receives the original text as a stream of characters andcan transform the stream by adding, removing, or changing characters.  Forinstance, a character filter could be used to convert Hindu-Arabic numerals(٠١٢٣٤٥٦٧٨٩) into their Arabic-Latin equivalents (0123456789), or to strip HTMLelements like `<b>` from the stream.An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,which are applied in order.[float]=== TokenizerA _tokenizer_  receives a stream of characters, breaks it up into individual_tokens_ (usually individual words), and outputs a stream of _tokens_. Forinstance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breakstext into tokens whenever it sees any whitespace.  It would convert the text`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.The tokenizer is also responsible for recording the order or _position_ ofeach term and the start and end _character offsets_ of the original word whichthe term represents.An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.[float]=== Token filtersA _token filter_ receives the token stream and may add, remove, or changetokens.  For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> tokenfilter converts all tokens to lowercase, a<<analysis-stop-tokenfilter,`stop`>> token filter removes common words(_stop words_) like `the` from the token stream, and a<<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonymsinto the token stream.Token filters are not allowed to change the position or character offsets ofeach token.An analyzer may have *zero or more* <<analysis-tokenfilters,token filters>>,which are applied in order.
 |