lqb
/
elasticsearch
mirror of https://gitee.com/mirrors/elasticsearch.git


			
				
					
						
						
							12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
							[[analyzer-anatomy]]
=== Anatomy of an analyzer

An _analyzer_  -- whether built-in or custom -- is just a package which
contains three lower-level building blocks: _character filters_,
_tokenizers_, and _token filters_.

The built-in <<analysis-analyzers,analyzers>> pre-package these building
blocks into analyzers suitable for different languages and types of text.
Elasticsearch also exposes the individual building blocks so that they can be
combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.

==== Character filters

A _character filter_ receives the original text as a stream of characters and
can transform the stream by adding, removing, or changing characters.  For
instance, a character filter could be used to convert Hindu-Arabic numerals
(٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML
elements like `<b>` from the stream.

An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
which are applied in order.

==== Tokenizer

A _tokenizer_  receives a stream of characters, breaks it up into individual
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
text into tokens whenever it sees any whitespace.  It would convert the text
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.

The tokenizer is also responsible for recording the order or _position_ of
each term and the start and end _character offsets_ of the original word which
the term represents.

An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.

==== Token filters

A _token filter_ receives the token stream and may add, remove, or change
tokens.  For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
filter converts all tokens to lowercase, a
<<analysis-stop-tokenfilter,`stop`>> token filter removes common words
(_stop words_) like `the` from the token stream, and a
<<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonyms
into the token stream.

Token filters are not allowed to change the position or character offsets of
each token.

An analyzer may have *zero or more* <<analysis-tokenfilters,token filters>>,
which are applied in order.