5 years ago · 8d5478f56c
--- a/docs/reference/analysis/anatomy.asciidoc
+++ b/docs/reference/analysis/anatomy.asciidoc
@@ -10,6 +10,7 @@ blocks into analyzers suitable for different languages and types of text.
 
				 Elasticsearch also exposes the individual building blocks so that they can be
			
 
				 combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
			
 
				 
			
 
				+[[analyzer-anatomy-character-filters]]
			
 
				 ==== Character filters
			
 
				 
			
 
				 A _character filter_ receives the original text as a stream of characters and
			
@@ -21,6 +22,7 @@ elements like `<b>` from the stream.
 
				 An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
			
 
				 which are applied in order.
			
 
				 
			
 
				+[[analyzer-anatomy-tokenizer]]
			
 
				 ==== Tokenizer
			
 
				 
			
 
				 A _tokenizer_  receives a stream of characters, breaks it up into individual
			
@@ -35,6 +37,7 @@ the term represents.
 
				 
			
 
				 An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
			
 
				 
			
 
				+[[analyzer-anatomy-token-filters]]
			
 
				 ==== Token filters
			
 
				 
			
 
				 A _token filter_ receives the token stream and may add, remove, or change
			
--- a/docs/reference/analysis/concepts.asciidoc
+++ b/docs/reference/analysis/concepts.asciidoc
@@ -8,6 +8,8 @@ This section explains the fundamental concepts of text analysis in {es}.
 
				 
			
 
				 * <<analyzer-anatomy>>
			
 
				 * <<analysis-index-search-time>>
			
 
				+* <<token-graphs>>
			
 
				 
			
 
				 include::anatomy.asciidoc[]
			
 
				-include::index-search-time.asciidoc[]
			
 
				+include::index-search-time.asciidoc[]
			
 
				+include::token-graphs.asciidoc[]
			
--- a/docs/reference/analysis/token-graphs.asciidoc
+++ b/docs/reference/analysis/token-graphs.asciidoc
@@ -0,0 +1,104 @@
 
				+[[token-graphs]]
			
 
				+=== Token graphs
			
 
				+
			
 
				+When a <<analyzer-anatomy-tokenizer,tokenizer>> converts a text into a stream of
			
 
				+tokens, it also records the following:
			
 
				+
			
 
				+* The `position` of each token in the stream
			
 
				+* The `positionLength`, the number of positions that a token spans
			
 
				+
			
 
				+Using these, you can create a
			
 
				+https://en.wikipedia.org/wiki/Directed_acyclic_graph[directed acyclic graph],
			
 
				+called a _token graph_, for a stream. In a token graph, each position represents
			
 
				+a node. Each token represents an edge or arc, pointing to the next position.
			
 
				+
			
 
				+image::images/analysis/token-graph-qbf-ex.svg[align="center"]
			
 
				+
			
 
				+[[token-graphs-synonyms]]
			
 
				+==== Synonyms
			
 
				+
			
 
				+Some <<analyzer-anatomy-token-filters,token filters>> can add new tokens, like
			
 
				+synonyms, to an existing token stream. These synonyms often span the same
			
 
				+positions as existing tokens.
			
 
				+
			
 
				+In the following graph, `quick` and its synonym `fast` both have a position of
			
 
				+`0`. They span the same positions.
			
 
				+
			
 
				+image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"]
			
 
				+
			
 
				+[[token-graphs-multi-position-tokens]]
			
 
				+==== Multi-position tokens
			
 
				+
			
 
				+Some token filters can add tokens that span multiple positions. These can
			
 
				+include tokens for multi-word synonyms, such as using "atm" as a synonym for
			
 
				+"automatic teller machine."
			
 
				+
			
 
				+However, only some token filters, known as _graph token filters_, accurately
			
 
				+record the `positionLength` for multi-position tokens. This filters include:
			
 
				+
			
 
				+* <<analysis-synonym-graph-tokenfilter,`synonym_graph`>>
			
 
				+* <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>>
			
 
				+
			
 
				+In the following graph, `domain name system` and its synonym, `dns`, both have a
			
 
				+position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in
			
 
				+the graph have a default `positionLength` of `1`.
			
 
				+
			
 
				+image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]
			
 
				+
			
 
				+[[token-graphs-token-graphs-search]]
			
 
				+===== Using token graphs for search 
			
 
				+
			
 
				+<<analysis-index-search-time,Indexing>> ignores the `positionLength` attribute
			
 
				+and does not support token graphs containing multi-position tokens.
			
 
				+
			
 
				+However, queries, such as the <<query-dsl-match-query,`match`>> or
			
 
				+<<query-dsl-match-query-phrase,`match_phrase`>> query, can use these graphs to
			
 
				+generate multiple sub-queries from a single query string.
			
 
				+
			
 
				+.*Example*
			
 
				+[%collapsible]
			
 
				+====
			
 
				+
			
 
				+A user runs a search for the following phrase using the `match_phrase` query:
			
 
				+
			
 
				+`domain name system is fragile`
			
 
				+
			
 
				+During <<analysis-index-search-time,search analysis>>, `dns`, a synonym for
			
 
				+`domain name system`, is added to the query string's token stream. The `dns`
			
 
				+token has a `positionLength` of `3`.
			
 
				+
			
 
				+image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]
			
 
				+
			
 
				+The `match_phrase` query uses this graph to generate sub-queries for the
			
 
				+following phrases:
			
 
				+
			
 
				+[source,text]
			
 
				+------
			
 
				+dns is fragile
			
 
				+domain name system is fragile
			
 
				+------
			
 
				+
			
 
				+This means the query matches documents containing either `dns is fragile` _or_
			
 
				+`domain name system is fragile`.
			
 
				+====
			
 
				+
			
 
				+[[token-graphs-invalid-token-graphs]]
			
 
				+===== Invalid token graphs
			
 
				+
			
 
				+The following token filters can add tokens that span multiple positions but
			
 
				+only record a default `positionLength` of `1`:
			
 
				+
			
 
				+* <<analysis-synonym-tokenfilter,`synonym`>>
			
 
				+* <<analysis-word-delimiter-tokenfilter,`word_delimiter`>>
			
 
				+
			
 
				+This means these filters will produce invalid token graphs for streams
			
 
				+containing such tokens.
			
 
				+
			
 
				+In the following graph, `dns` is a multi-position synonym for `domain name
			
 
				+system`. However, `dns` has the default `positionLength` value of `1`, resulting
			
 
				+in an invalid graph.
			
 
				+
			
 
				+image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"]
			
 
				+
			
 
				+Avoid using invalid token graphs for search. Invalid graphs can cause unexpected
			
 
				+search results.
			
--- a/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc
@@ -8,8 +8,8 @@ The `synonym_graph` token filter allows to easily handle synonyms,
 
				 including multi-word synonyms correctly during the analysis process.
			
 
				 
			
 
				 In order to properly handle multi-word synonyms this token filter
			
 
				-creates a "graph token stream" during processing.  For more information
			
 
				-on this topic and its various complexities, please read the
			
 
				+creates a <<token-graphs,graph token stream>> during processing.  For more
			
 
				+information on this topic and its various complexities, please read the
			
 
				 http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html[Lucene's TokenStreams are actually graphs] blog post.
			
 
				 
			
 
				 ["NOTE",id="synonym-graph-index-note"]
			
--- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc
@@ -440,8 +440,8 @@ that span multiple positions when any of the following parameters are `true`:
 
				 
			
 
				 However, only the `word_delimiter_graph` filter assigns multi-position tokens a
			
 
				 `positionLength` attribute, which indicates the number of positions a token
			
 
				-spans. This ensures the `word_delimiter_graph` filter always produces valid token
			
 
				-https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs].
			
 
				+spans. This ensures the `word_delimiter_graph` filter always produces valid
			
 
				+<<token-graphs,token graphs>>.
			
 
				 
			
 
				 The `word_delimiter` filter does not assign multi-position tokens a
			
 
				 `positionLength` attribute. This means it produces invalid graphs for streams
			
--- a/docs/reference/images/analysis/token-graph-dns-ex.svg
+++ b/docs/reference/images/analysis/token-graph-dns-ex.svg
--- a/docs/reference/images/analysis/token-graph-dns-invalid-ex.svg
+++ b/docs/reference/images/analysis/token-graph-dns-invalid-ex.svg
--- a/docs/reference/images/analysis/token-graph-dns-synonym-ex.svg
+++ b/docs/reference/images/analysis/token-graph-dns-synonym-ex.svg
--- a/docs/reference/images/analysis/token-graph-qbf-ex.svg
+++ b/docs/reference/images/analysis/token-graph-qbf-ex.svg
--- a/docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg
+++ b/docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg