|
@@ -0,0 +1,78 @@
|
|
|
+
|
|
|
+== Text analysis overview
|
|
|
+++++
|
|
|
+<titleabbrev>Overview</titleabbrev>
|
|
|
+++++
|
|
|
+
|
|
|
+Text analysis enables {es} to perform full-text search, where the search returns
|
|
|
+all _relevant_ results rather than just exact matches.
|
|
|
+
|
|
|
+If you search for `Quick fox jumps`, you probably want the document that
|
|
|
+contains `A quick brown fox jumps over the lazy dog`, and you might also want
|
|
|
+documents that contain related words like `fast fox` or `foxes leap`.
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[tokenization]]
|
|
|
+=== Tokenization
|
|
|
+
|
|
|
+Analysis makes full-text search possible through _tokenization_: breaking a text
|
|
|
+down into smaller chunks, called _tokens_. In most cases, these tokens are
|
|
|
+individual words.
|
|
|
+
|
|
|
+If you index the phrase `the quick brown fox jumps` as a single string and the
|
|
|
+user searches for `quick fox`, it isn't considered a match. However, if you
|
|
|
+tokenize the phrase and index each word separately, the terms in the query
|
|
|
+string can be looked up individually. This means they can be matched by searches
|
|
|
+for `quick fox`, `fox brown`, or other variations.
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[normalization]]
|
|
|
+=== Normalization
|
|
|
+
|
|
|
+Tokenization enables matching on individual terms, but each token is still
|
|
|
+matched literally. This means:
|
|
|
+
|
|
|
+* A search for `Quick` would not match `quick`, even though you likely want
|
|
|
+either term to match the other
|
|
|
+
|
|
|
+* Although `fox` and `foxes` share the same root word, a search for `foxes`
|
|
|
+would not match `fox` or vice versa.
|
|
|
+
|
|
|
+* A search for `jumps` would not match `leaps`. While they don't share a root
|
|
|
+word, they are synonyms and have a similar meaning.
|
|
|
+
|
|
|
+To solve these problems, text analysis can _normalize_ these tokens into a
|
|
|
+standard format. This allows you to match tokens that are not exactly the same
|
|
|
+as the search terms, but similar enough to still be relevant. For example:
|
|
|
+
|
|
|
+* `Quick` can be lowercased: `quick`.
|
|
|
+
|
|
|
+* `foxes` can be _stemmed_, or reduced to its root word: `fox`.
|
|
|
+
|
|
|
+* `jump` and `leap` are synonyms and can be indexed as a single word: `jump`.
|
|
|
+
|
|
|
+To ensure search terms match these words as intended, you can apply the same
|
|
|
+tokenization and normalization rules to the query string. For example, a search
|
|
|
+for `Foxes leap` can be normalized to a search for `fox jump`.
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[analysis-customization]]
|
|
|
+=== Customize text analysis
|
|
|
+
|
|
|
+Text analysis is performed by an <<analyzer-anatomy,_analyzer_>>, a set of rules
|
|
|
+that govern the entire process.
|
|
|
+
|
|
|
+{es} includes a default analyzer, called the
|
|
|
+<<analysis-standard-analyzer,standard analyzer>>, which works well for most use
|
|
|
+cases right out of the box.
|
|
|
+
|
|
|
+If you want to tailor your search experience, you can choose a different
|
|
|
+<<analysis-analyzers,built-in analyzer>> or even
|
|
|
+<<analysis-custom-analyzer,configure a custom one>>. A custom analyzer gives you
|
|
|
+control over each step of the analysis process, including:
|
|
|
+
|
|
|
+* Changes to the text _before_ tokenization
|
|
|
+
|
|
|
+* How text is converted to tokens
|
|
|
+
|
|
|
+* Normalization changes made to tokens before indexing or search
|