5 years ago · 495ce1add0
--- a/docs/reference/analysis.asciidoc
+++ b/docs/reference/analysis.asciidoc
@@ -1,11 +1,11 @@
 
				 [[analysis]]
			
 
				-= Analysis
			
 
				+= Text analysis
			
 
				 
			
 
				 [partintro]
			
 
				 --
			
 
				 
			
 
				-_Analysis_ is the process of converting text, like the body of any email, into
			
 
				-_tokens_ or _terms_ which are added to the inverted index for searching.
			
 
				+_Text analysis_ is the process of converting text, like the body of any email,
			
 
				+into _tokens_ or _terms_ which are added to the inverted index for searching.
			
 
				 Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
			
 
				 either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
			
 
				 defined per index.
			
@@ -142,6 +142,8 @@ looking for:
 
				 
			
 
				 --
			
 
				 
			
 
				+include::analysis/overview.asciidoc[]
			
 
				+
			
 
				 include::analysis/anatomy.asciidoc[]
			
 
				 
			
 
				 include::analysis/testing.asciidoc[]
			
--- a/docs/reference/analysis/overview.asciidoc
+++ b/docs/reference/analysis/overview.asciidoc
@@ -0,0 +1,78 @@
 
				+
			
 
				+== Text analysis overview
			
 
				+++++
			
 
				+<titleabbrev>Overview</titleabbrev>
			
 
				+++++
			
 
				+
			
 
				+Text analysis enables {es} to perform full-text search, where the search returns
			
 
				+all _relevant_ results rather than just exact matches.
			
 
				+
			
 
				+If you search for `Quick fox jumps`, you probably want the document that
			
 
				+contains `A quick brown fox jumps over the lazy dog`, and you might also want
			
 
				+documents that contain related words like `fast fox` or `foxes leap`.
			
 
				+
			
 
				+[discrete]
			
 
				+[[tokenization]]
			
 
				+=== Tokenization
			
 
				+
			
 
				+Analysis makes full-text search possible through _tokenization_: breaking a text
			
 
				+down into smaller chunks, called _tokens_. In most cases, these tokens are
			
 
				+individual words.
			
 
				+
			
 
				+If you index the phrase `the quick brown fox jumps` as a single string and the
			
 
				+user searches for `quick fox`, it isn't considered a match. However, if you
			
 
				+tokenize the phrase and index each word separately, the terms in the query
			
 
				+string can be looked up individually. This means they can be matched by searches
			
 
				+for `quick fox`, `fox brown`, or other variations.
			
 
				+
			
 
				+[discrete]
			
 
				+[[normalization]]
			
 
				+=== Normalization
			
 
				+
			
 
				+Tokenization enables matching on individual terms, but each token is still
			
 
				+matched literally. This means:
			
 
				+
			
 
				+*  A search for `Quick` would not match `quick`, even though you likely want
			
 
				+either term to match the other
			
 
				+
			
 
				+* Although `fox` and `foxes` share the same root word, a search for `foxes`
			
 
				+would not match `fox` or vice versa.
			
 
				+
			
 
				+* A search for `jumps` would not match `leaps`. While they don't share a root
			
 
				+word, they are synonyms and have a similar meaning.
			
 
				+
			
 
				+To solve these problems, text analysis can _normalize_ these tokens into a
			
 
				+standard format. This allows you to match tokens that are not exactly the same
			
 
				+as the search terms, but similar enough to still be relevant. For example:
			
 
				+
			
 
				+* `Quick` can be lowercased: `quick`.
			
 
				+
			
 
				+* `foxes` can be _stemmed_, or reduced to its root word: `fox`.
			
 
				+
			
 
				+* `jump` and `leap` are synonyms and can be indexed as a single word: `jump`.
			
 
				+
			
 
				+To ensure search terms match these words as intended, you can apply the same
			
 
				+tokenization and normalization rules to the query string. For example, a search
			
 
				+for `Foxes leap` can be normalized to a search for `fox jump`.
			
 
				+
			
 
				+[discrete]
			
 
				+[[analysis-customization]]
			
 
				+=== Customize text analysis
			
 
				+
			
 
				+Text analysis is performed by an <<analyzer-anatomy,_analyzer_>>, a set of rules
			
 
				+that govern the entire process.
			
 
				+
			
 
				+{es} includes a default analyzer, called the
			
 
				+<<analysis-standard-analyzer,standard analyzer>>, which works well for most use
			
 
				+cases right out of the box.
			
 
				+
			
 
				+If you want to tailor your search experience, you can choose a different
			
 
				+<<analysis-analyzers,built-in analyzer>> or even
			
 
				+<<analysis-custom-analyzer,configure a custom one>>. A custom analyzer gives you
			
 
				+control over each step of the analysis process, including:
			
 
				+
			
 
				+* Changes to the text _before_ tokenization
			
 
				+
			
 
				+* How text is converted to tokens
			
 
				+
			
 
				+* Normalization changes made to tokens before indexing or search