123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778 |
- [[analysis-overview]]
- == Text analysis overview
- ++++
- <titleabbrev>Overview</titleabbrev>
- ++++
- Text analysis enables {es} to perform full-text search, where the search returns
- all _relevant_ results rather than just exact matches.
- If you search for `Quick fox jumps`, you probably want the document that
- contains `A quick brown fox jumps over the lazy dog`, and you might also want
- documents that contain related words like `fast fox` or `foxes leap`.
- [discrete]
- [[tokenization]]
- === Tokenization
- Analysis makes full-text search possible through _tokenization_: breaking a text
- down into smaller chunks, called _tokens_. In most cases, these tokens are
- individual words.
- If you index the phrase `the quick brown fox jumps` as a single string and the
- user searches for `quick fox`, it isn't considered a match. However, if you
- tokenize the phrase and index each word separately, the terms in the query
- string can be looked up individually. This means they can be matched by searches
- for `quick fox`, `fox brown`, or other variations.
- [discrete]
- [[normalization]]
- === Normalization
- Tokenization enables matching on individual terms, but each token is still
- matched literally. This means:
- * A search for `Quick` would not match `quick`, even though you likely want
- either term to match the other
- * Although `fox` and `foxes` share the same root word, a search for `foxes`
- would not match `fox` or vice versa.
- * A search for `jumps` would not match `leaps`. While they don't share a root
- word, they are synonyms and have a similar meaning.
- To solve these problems, text analysis can _normalize_ these tokens into a
- standard format. This allows you to match tokens that are not exactly the same
- as the search terms, but similar enough to still be relevant. For example:
- * `Quick` can be lowercased: `quick`.
- * `foxes` can be _stemmed_, or reduced to its root word: `fox`.
- * `jump` and `leap` are synonyms and can be indexed as a single word: `jump`.
- To ensure search terms match these words as intended, you can apply the same
- tokenization and normalization rules to the query string. For example, a search
- for `Foxes leap` can be normalized to a search for `fox jump`.
- [discrete]
- [[analysis-customization]]
- === Customize text analysis
- Text analysis is performed by an <<analyzer-anatomy,_analyzer_>>, a set of rules
- that govern the entire process.
- {es} includes a default analyzer, called the
- <<analysis-standard-analyzer,standard analyzer>>, which works well for most use
- cases right out of the box.
- If you want to tailor your search experience, you can choose a different
- <<analysis-analyzers,built-in analyzer>> or even
- <<analysis-custom-analyzer,configure a custom one>>. A custom analyzer gives you
- control over each step of the analysis process, including:
- * Changes to the text _before_ tokenization
- * How text is converted to tokens
- * Normalization changes made to tokens before indexing or search
|