5 年之前 · 1c8ab01ee6
--- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc
@@ -4,105 +4,496 @@
 
				 <titleabbrev>Word delimiter graph</titleabbrev>
			
 
				 ++++
			
 
				 
			
 
				-experimental[This functionality is marked as experimental in Lucene]
			
 
				+Splits tokens at non-alphanumeric characters. The `word_delimiter_graph` filter
			
 
				+also performs optional token normalization based on a set of rules. By default,
			
 
				+the filter uses the following rules:
			
 
				 
			
 
				-Named `word_delimiter_graph`, it splits words into subwords and performs
			
 
				-optional transformations on subword groups. Words are split into
			
 
				-subwords with the following rules:
			
 
				+* Split tokens at non-alphanumeric characters.
			
 
				+  The filter uses these characters as delimiters.
			
 
				+  For example: `Super-Duper` -> `Super`, `Duper`
			
 
				+* Remove leading or trailing delimiters from each token.
			
 
				+  For example: `XL---42+'Autocoder'` -> `XL`, `42`, `Autocoder`
			
 
				+* Split tokens at letter case transitions.
			
 
				+  For example: `PowerShot` -> `Power`, `Shot`
			
 
				+* Split tokens at letter-number transitions.
			
 
				+  For example: `XL500` -> `XL`, `500`
			
 
				+* Remove the English possessive (`'s`) from the end of each token.
			
 
				+  For example: `Neil's` -> `Neil`
			
 
				 
			
 
				-* split on intra-word delimiters (by default, all non alpha-numeric
			
 
				-characters).
			
 
				-* "Wi-Fi" -> "Wi", "Fi"
			
 
				-* split on case transitions: "PowerShot" -> "Power", "Shot"
			
 
				-* split on letter-number transitions: "SD500" -> "SD", "500"
			
 
				-* leading and trailing intra-word delimiters on each subword are
			
 
				-ignored: "//hello---there, 'dude'" -> "hello", "there", "dude"
			
 
				-* trailing "'s" are removed for each subword: "O'Neil's" -> "O", "Neil"
			
 
				+The `word_delimiter_graph` filter uses Lucene's
			
 
				+{lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter].
			
 
				 
			
 
				-Unlike the `word_delimiter`, this token filter correctly handles positions for
			
 
				-multi terms expansion at search-time when any of the following options
			
 
				-are set to true:
			
 
				+[TIP]
			
 
				+====
			
 
				+The `word_delimiter_graph` filter was designed to remove punctuation from
			
 
				+complex identifiers, such as product IDs or part numbers. For these use cases,
			
 
				+we recommend using the `word_delimiter_graph` filter with the
			
 
				+<<analysis-keyword-tokenizer,`keyword`>> tokenizer.
			
 
				 
			
 
				- * `preserve_original`
			
 
				- * `catenate_numbers`
			
 
				- * `catenate_words`
			
 
				- * `catenate_all`
			
 
				+Avoid using the `word_delimiter_graph` filter to split hyphenated words, such as
			
 
				+`wi-fi`. Because users often search for these words both with and without
			
 
				+hyphens, we recommend using the
			
 
				+<<analysis-synonym-graph-tokenfilter,`synonym_graph`>> filter instead.
			
 
				+====
			
 
				 
			
 
				-Parameters include:
			
 
				+[[analysis-word-delimiter-graph-tokenfilter-analyze-ex]]
			
 
				+==== Example
			
 
				 
			
 
				-`generate_word_parts`::
			
 
				-    If `true` causes parts of words to be
			
 
				-    generated: "PowerShot" -> "Power" "Shot". Defaults to `true`.
			
 
				+The following <<indices-analyze,analyze API>> request uses the
			
 
				+`word_delimiter_graph` filter to split `Neil's Super-Duper-XL500--42+AutoCoder`
			
 
				+into normalized tokens using the filter's default rules:
			
 
				 
			
 
				-`generate_number_parts`::
			
 
				-    If `true` causes number subwords to be
			
 
				-    generated: "500-42" -> "500" "42". Defaults to `true`.
			
 
				+[source,console]
			
 
				+----
			
 
				+GET /_analyze
			
 
				+{
			
 
				+  "tokenizer": "whitespace",
			
 
				+  "filter": [ "word_delimiter_graph" ],
			
 
				+  "text": "Neil's Super-Duper-XL500--42+AutoCoder"
			
 
				+}
			
 
				+----
			
 
				 
			
 
				-`catenate_words`::
			
 
				-    If `true` causes maximum runs of word parts to be
			
 
				-    catenated: "wi-fi" -> "wifi". Defaults to `false`.
			
 
				+The filter produces the following tokens:
			
 
				 
			
 
				-`catenate_numbers`::
			
 
				-    If `true` causes maximum runs of number parts to
			
 
				-    be catenated: "500-42" -> "50042". Defaults to `false`.
			
 
				+[source,txt]
			
 
				+----
			
 
				+[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
			
 
				+----
			
 
				+
			
 
				+////
			
 
				+[source,console-result]
			
 
				+----
			
 
				+{
			
 
				+  "tokens" : [
			
 
				+    {
			
 
				+      "token" : "Neil",
			
 
				+      "start_offset" : 0,
			
 
				+      "end_offset" : 4,
			
 
				+      "type" : "word",
			
 
				+      "position" : 0
			
 
				+    },
			
 
				+    {
			
 
				+      "token" : "Super",
			
 
				+      "start_offset" : 7,
			
 
				+      "end_offset" : 12,
			
 
				+      "type" : "word",
			
 
				+      "position" : 1
			
 
				+    },
			
 
				+    {
			
 
				+      "token" : "Duper",
			
 
				+      "start_offset" : 13,
			
 
				+      "end_offset" : 18,
			
 
				+      "type" : "word",
			
 
				+      "position" : 2
			
 
				+    },
			
 
				+    {
			
 
				+      "token" : "XL",
			
 
				+      "start_offset" : 19,
			
 
				+      "end_offset" : 21,
			
 
				+      "type" : "word",
			
 
				+      "position" : 3
			
 
				+    },
			
 
				+    {
			
 
				+      "token" : "500",
			
 
				+      "start_offset" : 21,
			
 
				+      "end_offset" : 24,
			
 
				+      "type" : "word",
			
 
				+      "position" : 4
			
 
				+    },
			
 
				+    {
			
 
				+      "token" : "42",
			
 
				+      "start_offset" : 26,
			
 
				+      "end_offset" : 28,
			
 
				+      "type" : "word",
			
 
				+      "position" : 5
			
 
				+    },
			
 
				+    {
			
 
				+      "token" : "Auto",
			
 
				+      "start_offset" : 29,
			
 
				+      "end_offset" : 33,
			
 
				+      "type" : "word",
			
 
				+      "position" : 6
			
 
				+    },
			
 
				+    {
			
 
				+      "token" : "Coder",
			
 
				+      "start_offset" : 33,
			
 
				+      "end_offset" : 38,
			
 
				+      "type" : "word",
			
 
				+      "position" : 7
			
 
				+    }
			
 
				+  ]
			
 
				+}
			
 
				+----
			
 
				+////
			
 
				+
			
 
				+[analysis-word-delimiter-tokenfilter-analyzer-ex]]
			
 
				+==== Add to an analyzer
			
 
				+
			
 
				+The following <<indices-create-index,create index API>> request uses the
			
 
				+`word_delimiter_graph` filter to configure a new
			
 
				+<<analysis-custom-analyzer,custom analyzer>>.
			
 
				+
			
 
				+[source,console]
			
 
				+----
			
 
				+PUT /my_index
			
 
				+{
			
 
				+  "settings": {
			
 
				+    "analysis": {
			
 
				+      "analyzer": {
			
 
				+        "my_analyzer": {
			
 
				+          "tokenizer": "whitespace",
			
 
				+          "filter": [ "word_delimiter_graph" ]
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----
			
 
				+
			
 
				+[WARNING]
			
 
				+====
			
 
				+Avoid using the `word_delimiter_graph` filter with tokenizers that remove
			
 
				+punctuation, such as the <<analysis-standard-tokenizer,`standard`>> tokenizer.
			
 
				+This could prevent the `word_delimiter_graph` filter from splitting tokens
			
 
				+correctly. It can also interfere with the filter's configurable parameters, such
			
 
				+as <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>> or
			
 
				+<<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>. We
			
 
				+recommend using the <<analysis-keyword-tokenizer,`keyword`>> or
			
 
				+<<analysis-whitespace-tokenizer,`whitespace`>> tokenizer instead.
			
 
				+====
			
 
				+
			
 
				+[[word-delimiter-graph-tokenfilter-configure-parms]]
			
 
				+==== Configurable parameters
			
 
				 
			
 
				+[[word-delimiter-graph-tokenfilter-adjust-offsets]]
			
 
				+`adjust_offsets`::
			
 
				++
			
 
				+--
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter adjusts the offsets of split or catenated tokens to better
			
 
				+reflect their actual position in the token stream. Defaults to `true`.
			
 
				+
			
 
				+[WARNING]
			
 
				+====
			
 
				+Set `adjust_offsets` to `false` if your analyzer uses filters, such as the
			
 
				+<<analysis-trim-tokenfilter,`trim`>> filter, that change the length of tokens
			
 
				+without changing their offsets. Otherwise, the `word_delimiter_graph` filter
			
 
				+could produce tokens with illegal offsets.
			
 
				+====
			
 
				+--
			
 
				+
			
 
				+[[word-delimiter-graph-tokenfilter-catenate-all]]
			
 
				 `catenate_all`::
			
 
				-    If `true` causes all subword parts to be catenated:
			
 
				-    "wi-fi-4000" -> "wifi4000". Defaults to `false`.
			
 
				++
			
 
				+--
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter produces catenated tokens for chains of alphanumeric
			
 
				+characters separated by non-alphabetic delimiters. For example:
			
 
				+`super-duper-xl-500` -> [**`superduperxl500`**, `super`, `duper`, `xl`, `500` ].
			
 
				+Defaults to `false`.
			
 
				 
			
 
				-`split_on_case_change`::
			
 
				-    If `true` causes "PowerShot" to be two tokens;
			
 
				-    ("Power-Shot" remains two parts regards). Defaults to `true`.
			
 
				+[WARNING]
			
 
				+====
			
 
				+Setting this parameter to `true` produces multi-position tokens, which are not
			
 
				+supported by indexing.
			
 
				+
			
 
				+If this parameter is `true`, avoid using this filter in an index analyzer or
			
 
				+use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
			
 
				+this filter to make the token stream suitable for indexing.
			
 
				+
			
 
				+When used for search analysis, catenated tokens can cause problems for the
			
 
				+<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
			
 
				+rely on token position for matching. Avoid setting this parameter to `true` if
			
 
				+you plan to use these queries.
			
 
				+====
			
 
				+--
			
 
				+
			
 
				+[[word-delimiter-graph-tokenfilter-catenate-numbers]]
			
 
				+`catenate_numbers`::
			
 
				++
			
 
				+--
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter produces catenated tokens for chains of numeric characters
			
 
				+separated by non-alphabetic delimiters. For example: `01-02-03` ->
			
 
				+[**`010203`**, `01`, `02`, `03` ]. Defaults to `false`.
			
 
				+
			
 
				+[WARNING]
			
 
				+====
			
 
				+Setting this parameter to `true` produces multi-position tokens, which are not
			
 
				+supported by indexing.
			
 
				+
			
 
				+If this parameter is `true`, avoid using this filter in an index analyzer or
			
 
				+use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
			
 
				+this filter to make the token stream suitable for indexing.
			
 
				 
			
 
				+When used for search analysis, catenated tokens can cause problems for the
			
 
				+<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
			
 
				+rely on token position for matching. Avoid setting this parameter to `true` if
			
 
				+you plan to use these queries.
			
 
				+====
			
 
				+--
			
 
				+
			
 
				+[[word-delimiter-graph-tokenfilter-catenate-words]]
			
 
				+`catenate_words`::
			
 
				++
			
 
				+--
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter produces catenated tokens for chains of alphabetical
			
 
				+characters separated by non-alphabetic delimiters. For example: `super-duper-xl`
			
 
				+-> [**`superduperxl`**, `super`, `duper`, `xl`]. Defaults to `false`.
			
 
				+
			
 
				+[WARNING]
			
 
				+====
			
 
				+Setting this parameter to `true` produces multi-position tokens, which are not
			
 
				+supported by indexing.
			
 
				+
			
 
				+If this parameter is `true`, avoid using this filter in an index analyzer or
			
 
				+use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
			
 
				+this filter to make the token stream suitable for indexing.
			
 
				+
			
 
				+When used for search analysis, catenated tokens can cause problems for the
			
 
				+<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
			
 
				+rely on token position for matching. Avoid setting this parameter to `true` if
			
 
				+you plan to use these queries.
			
 
				+====
			
 
				+--
			
 
				+
			
 
				+`generate_number_parts`::
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter includes tokens consisting of only numeric characters in
			
 
				+the output. If `false`, the filter excludes these tokens from the output.
			
 
				+Defaults to `true`.
			
 
				+
			
 
				+`generate_word_parts`::
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter includes tokens consisting of only alphabetical characters
			
 
				+in the output. If `false`, the filter excludes these tokens from the output.
			
 
				+Defaults to `true`.
			
 
				+
			
 
				+[[word-delimiter-graph-tokenfilter-preserve-original]]
			
 
				 `preserve_original`::
			
 
				-    If `true` includes original words in subwords:
			
 
				-    "500-42" -> "500-42" "500" "42". Defaults to `false`.
			
 
				++
			
 
				+--
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter includes the original version of any split tokens in the
			
 
				+output. This original version includes non-alphanumeric delimiters. For example:
			
 
				+`super-duper-xl-500` -> [**`super-duper-xl-500`**, `super`, `duper`, `xl`, `500`
			
 
				+]. Defaults to `false`.
			
 
				+
			
 
				+[WARNING]
			
 
				+====
			
 
				+Setting this parameter to `true` produces multi-position tokens, which are not
			
 
				+supported by indexing.
			
 
				+
			
 
				+If this parameter is `true`, avoid using this filter in an index analyzer or
			
 
				+use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
			
 
				+this filter to make the token stream suitable for indexing.
			
 
				+====
			
 
				+--
			
 
				+
			
 
				+`protected_words`::
			
 
				+(Optional, array of strings)
			
 
				+Array of tokens the filter won't split.
			
 
				+
			
 
				+`protected_words_path`::
			
 
				++
			
 
				+--
			
 
				+(Optional, string)
			
 
				+Path to a file that contains a list of tokens the filter won't split.
			
 
				+
			
 
				+This path must be absolute or relative to the `config` location, and the file
			
 
				+must be UTF-8 encoded. Each token in the file must be separated by a line
			
 
				+break.
			
 
				+--
			
 
				+
			
 
				+`split_on_case_change`::
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter splits tokens at letter case transitions. For example:
			
 
				+`camelCase` -> [ `camel`, `Case`]. Defaults to `true`.
			
 
				 
			
 
				 `split_on_numerics`::
			
 
				-    If `true` causes "j2se" to be three tokens; "j"
			
 
				-    "2" "se". Defaults to `true`.
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter splits tokens at letter-number transitions. For example:
			
 
				+`j2se` -> [ `j`, `2`, `se` ]. Defaults to `true`.
			
 
				 
			
 
				 `stem_english_possessive`::
			
 
				-    If `true` causes trailing "'s" to be
			
 
				-    removed for each subword: "O'Neil's" -> "O", "Neil". Defaults to `true`.
			
 
				+(Optional, boolean)
			
 
				+If `true`, the filter removes the English possessive (`'s`) from the end of each
			
 
				+token. For example: `O'Neil's` -> `[ `O`, `Neil` ]. Defaults to `true`.
			
 
				 
			
 
				-Advance settings include:
			
 
				+`type_table`::
			
 
				++
			
 
				+--
			
 
				+(Optional, array of strings)
			
 
				+Array of custom type mappings for characters. This allows you to map
			
 
				+non-alphanumeric characters as numeric or alphanumeric to avoid splitting on
			
 
				+those characters.
			
 
				 
			
 
				-`protected_words`::
			
 
				-    A list of protected words from being delimiter.
			
 
				-    Either an array, or also can set `protected_words_path` which resolved
			
 
				-    to a file configured with protected words (one on each line).
			
 
				-    Automatically resolves to `config/` based location if exists.
			
 
				+For example, the following array maps the plus (`+`) and hyphen (`-`) characters
			
 
				+as alphanumeric, which means they won't be treated as delimiters:
			
 
				 
			
 
				-`adjust_offsets`::
			
 
				-    By default, the filter tries to output subtokens with adjusted offsets
			
 
				-    to reflect their actual position in the token stream.  However, when
			
 
				-    used in combination with other filters that alter the length or starting
			
 
				-    position of tokens without changing their offsets
			
 
				-    (e.g. <<analysis-trim-tokenfilter,`trim`>>) this can cause tokens with
			
 
				-    illegal offsets to be emitted.  Setting `adjust_offsets` to false will
			
 
				-    stop `word_delimiter_graph` from adjusting these internal offsets.
			
 
				+`["+ => ALPHA", "- => ALPHA"]`
			
 
				 
			
 
				-`type_table`::
			
 
				-    A custom type mapping table, for example (when configured
			
 
				-    using `type_table_path`):
			
 
				-
			
 
				-[source,type_table]
			
 
				---------------------------------------------------
			
 
				-    # Map the $, %, '.', and ',' characters to DIGIT
			
 
				-    # This might be useful for financial data.
			
 
				-    $ => DIGIT
			
 
				-    % => DIGIT
			
 
				-    . => DIGIT
			
 
				-    \\u002C => DIGIT
			
 
				-
			
 
				-    # in some cases you might not want to split on ZWJ
			
 
				-    # this also tests the case where we need a bigger byte[]
			
 
				-    # see http://en.wikipedia.org/wiki/Zero-width_joiner
			
 
				-    \\u200D => ALPHANUM
			
 
				---------------------------------------------------
			
 
				-
			
 
				-NOTE: Using a tokenizer like the `standard` tokenizer may interfere with
			
 
				-the `catenate_*` and `preserve_original` parameters, as the original
			
 
				-string may already have lost punctuation during tokenization.  Instead,
			
 
				-you may want to use the `whitespace` tokenizer.
			
 
				+Supported types include:
			
 
				+
			
 
				+* `ALPHA` (Alphabetical)
			
 
				+* `ALPHANUM` (Alphanumeric)
			
 
				+* `DIGIT` (Numeric)
			
 
				+* `LOWER` (Lowercase alphabetical)
			
 
				+* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
			
 
				+* `UPPER` (Uppercase alphabetical)
			
 
				+--
			
 
				+
			
 
				+`type_table_path`::
			
 
				++
			
 
				+--
			
 
				+(Optional, string)
			
 
				+Path to a file that contains custom type mappings for characters. This allows
			
 
				+you to map non-alphanumeric characters as numeric or alphanumeric to avoid
			
 
				+splitting on those characters.
			
 
				+
			
 
				+For example, the contents of this file may contain the following:
			
 
				+
			
 
				+[source,txt]
			
 
				+----
			
 
				+# Map the $, %, '.', and ',' characters to DIGIT
			
 
				+# This might be useful for financial data.
			
 
				+$ => DIGIT
			
 
				+% => DIGIT
			
 
				+. => DIGIT
			
 
				+\\u002C => DIGIT
			
 
				+
			
 
				+# in some cases you might not want to split on ZWJ
			
 
				+# this also tests the case where we need a bigger byte[]
			
 
				+# see http://en.wikipedia.org/wiki/Zero-width_joiner
			
 
				+\\u200D => ALPHANUM
			
 
				+----
			
 
				+
			
 
				+Supported types include:
			
 
				+
			
 
				+* `ALPHA` (Alphabetical)
			
 
				+* `ALPHANUM` (Alphanumeric)
			
 
				+* `DIGIT` (Numeric)
			
 
				+* `LOWER` (Lowercase alphabetical)
			
 
				+* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
			
 
				+* `UPPER` (Uppercase alphabetical)
			
 
				+
			
 
				+This file path must be absolute or relative to the `config` location, and the
			
 
				+file must be UTF-8 encoded. Each mapping in the file must be separated by a line
			
 
				+break.
			
 
				+--
			
 
				+
			
 
				+[[analysis-word-delimiter-graph-tokenfilter-customize]]
			
 
				+==== Customize
			
 
				+
			
 
				+To customize the `word_delimiter_graph` filter, duplicate it to create the basis
			
 
				+for a new custom token filter. You can modify the filter using its configurable
			
 
				+parameters.
			
 
				+
			
 
				+For example, the following request creates a `word_delimiter_graph`
			
 
				+filter that uses the following rules:
			
 
				+
			
 
				+* Split tokens at non-alphanumeric characters, _except_ the hyphen (`-`)
			
 
				+  character.
			
 
				+* Remove leading or trailing delimiters from each token.
			
 
				+* Do _not_ split tokens at letter case transitions.
			
 
				+* Do _not_ split tokens at letter-number transitions.
			
 
				+* Remove the English possessive (`'s`) from the end of each token.
			
 
				+
			
 
				+[source,console]
			
 
				+----
			
 
				+PUT /my_index
			
 
				+{
			
 
				+  "settings": {
			
 
				+    "analysis": {
			
 
				+      "analyzer": {
			
 
				+        "my_analyzer": {
			
 
				+          "tokenizer": "whitespace",
			
 
				+          "filter": [ "my_custom_word_delimiter_graph_filter" ]
			
 
				+        }
			
 
				+      },
			
 
				+      "filter": {
			
 
				+        "my_custom_word_delimiter_graph_filter": {
			
 
				+          "type": "word_delimiter_graph",
			
 
				+          "type_table": [ "- => ALPHA" ],
			
 
				+          "split_on_case_change": false,
			
 
				+          "split_on_numerics": false,
			
 
				+          "stem_english_possessive": true
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----
			
 
				+
			
 
				+[[analysis-word-delimiter-graph-differences]]
			
 
				+==== Differences between `word_delimiter_graph` and `word_delimiter` 
			
 
				+
			
 
				+Both the  `word_delimiter_graph` and
			
 
				+<<analysis-word-delimiter-tokenfilter,`word_delimiter`>> filters produce tokens
			
 
				+that span multiple positions when any of the following parameters are `true`:
			
 
				+
			
 
				+ * <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>>
			
 
				+ * <<word-delimiter-graph-tokenfilter-catenate-numbers,`catenate_numbers`>>
			
 
				+ * <<word-delimiter-graph-tokenfilter-catenate-words,`catenate_words`>>
			
 
				+ * <<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>
			
 
				+
			
 
				+However, only the `word_delimiter_graph` filter assigns multi-position tokens a
			
 
				+`positionLength` attribute, which indicates the number of positions a token
			
 
				+spans. This ensures the `word_delimiter_graph` filter always produces valid token
			
 
				+https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs].
			
 
				+
			
 
				+The `word_delimiter` filter does not assign multi-position tokens a
			
 
				+`positionLength` attribute. This means it produces invalid graphs for streams
			
 
				+including these tokens.
			
 
				+
			
 
				+While indexing does not support token graphs containing multi-position tokens,
			
 
				+queries, such as the <<query-dsl-match-query-phrase,`match_phrase`>> query, can
			
 
				+use these graphs to generate multiple sub-queries from a single query string.
			
 
				+
			
 
				+To see how token graphs produced by the `word_delimiter` and
			
 
				+`word_delimiter_graph` filters differ, check out the following example.
			
 
				+
			
 
				+.*Example*
			
 
				+[%collapsible]
			
 
				+====
			
 
				+
			
 
				+[[analysis-word-delimiter-graph-basic-token-graph]]
			
 
				+*Basic token graph*
			
 
				+
			
 
				+Both the `word_delimiter` and `word_delimiter_graph` produce the following token
			
 
				+graph for `PowerShot2000` when the following parameters are `false`:
			
 
				+
			
 
				+ * <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>>
			
 
				+ * <<word-delimiter-graph-tokenfilter-catenate-numbers,`catenate_numbers`>>
			
 
				+ * <<word-delimiter-graph-tokenfilter-catenate-words,`catenate_words`>>
			
 
				+ * <<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>
			
 
				+
			
 
				+This graph does not contain multi-position tokens. All tokens span only one
			
 
				+position.
			
 
				+
			
 
				+image::images/analysis/token-graph-basic.svg[align="center"]
			
 
				+
			
 
				+[[analysis-word-delimiter-graph-wdg-token-graph]]
			
 
				+*`word_delimiter_graph` graph with a multi-position token*
			
 
				+
			
 
				+The `word_delimiter_graph` filter produces the following token graph for
			
 
				+`PowerShot2000` when `catenate_words` is `true`.
			
 
				+
			
 
				+This graph correctly indicates the catenated `PowerShot` token spans two
			
 
				+positions.
			
 
				+
			
 
				+image::images/analysis/token-graph-wdg.svg[align="center"]
			
 
				+
			
 
				+[[analysis-word-delimiter-graph-wd-token-graph]]
			
 
				+*`word_delimiter` graph with a multi-position token*
			
 
				+
			
 
				+When `catenate_words` is `true`, the `word_delimiter` filter produces
			
 
				+the following token graph for `PowerShot2000`.
			
 
				+
			
 
				+Note that the catenated `PowerShot` token should span two positions but only
			
 
				+spans one in the token graph, making it invalid.
			
 
				+
			
 
				+image::images/analysis/token-graph-wd.svg[align="center"]
			
 
				+
			
 
				+====
			
--- a/docs/reference/images/analysis/token-graph-basic.svg
+++ b/docs/reference/images/analysis/token-graph-basic.svg
--- a/docs/reference/images/analysis/token-graph-wd.svg
+++ b/docs/reference/images/analysis/token-graph-wd.svg
--- a/docs/reference/images/analysis/token-graph-wdg.svg
+++ b/docs/reference/images/analysis/token-graph-wdg.svg