|  | @@ -4,105 +4,496 @@
 | 
	
		
			
				|  |  |  <titleabbrev>Word delimiter graph</titleabbrev>
 | 
	
		
			
				|  |  |  ++++
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -experimental[This functionality is marked as experimental in Lucene]
 | 
	
		
			
				|  |  | +Splits tokens at non-alphanumeric characters. The `word_delimiter_graph` filter
 | 
	
		
			
				|  |  | +also performs optional token normalization based on a set of rules. By default,
 | 
	
		
			
				|  |  | +the filter uses the following rules:
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -Named `word_delimiter_graph`, it splits words into subwords and performs
 | 
	
		
			
				|  |  | -optional transformations on subword groups. Words are split into
 | 
	
		
			
				|  |  | -subwords with the following rules:
 | 
	
		
			
				|  |  | +* Split tokens at non-alphanumeric characters.
 | 
	
		
			
				|  |  | +  The filter uses these characters as delimiters.
 | 
	
		
			
				|  |  | +  For example: `Super-Duper` -> `Super`, `Duper`
 | 
	
		
			
				|  |  | +* Remove leading or trailing delimiters from each token.
 | 
	
		
			
				|  |  | +  For example: `XL---42+'Autocoder'` -> `XL`, `42`, `Autocoder`
 | 
	
		
			
				|  |  | +* Split tokens at letter case transitions.
 | 
	
		
			
				|  |  | +  For example: `PowerShot` -> `Power`, `Shot`
 | 
	
		
			
				|  |  | +* Split tokens at letter-number transitions.
 | 
	
		
			
				|  |  | +  For example: `XL500` -> `XL`, `500`
 | 
	
		
			
				|  |  | +* Remove the English possessive (`'s`) from the end of each token.
 | 
	
		
			
				|  |  | +  For example: `Neil's` -> `Neil`
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -* split on intra-word delimiters (by default, all non alpha-numeric
 | 
	
		
			
				|  |  | -characters).
 | 
	
		
			
				|  |  | -* "Wi-Fi" -> "Wi", "Fi"
 | 
	
		
			
				|  |  | -* split on case transitions: "PowerShot" -> "Power", "Shot"
 | 
	
		
			
				|  |  | -* split on letter-number transitions: "SD500" -> "SD", "500"
 | 
	
		
			
				|  |  | -* leading and trailing intra-word delimiters on each subword are
 | 
	
		
			
				|  |  | -ignored: "//hello---there, 'dude'" -> "hello", "there", "dude"
 | 
	
		
			
				|  |  | -* trailing "'s" are removed for each subword: "O'Neil's" -> "O", "Neil"
 | 
	
		
			
				|  |  | +The `word_delimiter_graph` filter uses Lucene's
 | 
	
		
			
				|  |  | +{lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter].
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -Unlike the `word_delimiter`, this token filter correctly handles positions for
 | 
	
		
			
				|  |  | -multi terms expansion at search-time when any of the following options
 | 
	
		
			
				|  |  | -are set to true:
 | 
	
		
			
				|  |  | +[TIP]
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +The `word_delimiter_graph` filter was designed to remove punctuation from
 | 
	
		
			
				|  |  | +complex identifiers, such as product IDs or part numbers. For these use cases,
 | 
	
		
			
				|  |  | +we recommend using the `word_delimiter_graph` filter with the
 | 
	
		
			
				|  |  | +<<analysis-keyword-tokenizer,`keyword`>> tokenizer.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | - * `preserve_original`
 | 
	
		
			
				|  |  | - * `catenate_numbers`
 | 
	
		
			
				|  |  | - * `catenate_words`
 | 
	
		
			
				|  |  | - * `catenate_all`
 | 
	
		
			
				|  |  | +Avoid using the `word_delimiter_graph` filter to split hyphenated words, such as
 | 
	
		
			
				|  |  | +`wi-fi`. Because users often search for these words both with and without
 | 
	
		
			
				|  |  | +hyphens, we recommend using the
 | 
	
		
			
				|  |  | +<<analysis-synonym-graph-tokenfilter,`synonym_graph`>> filter instead.
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -Parameters include:
 | 
	
		
			
				|  |  | +[[analysis-word-delimiter-graph-tokenfilter-analyze-ex]]
 | 
	
		
			
				|  |  | +==== Example
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -`generate_word_parts`::
 | 
	
		
			
				|  |  | -    If `true` causes parts of words to be
 | 
	
		
			
				|  |  | -    generated: "PowerShot" -> "Power" "Shot". Defaults to `true`.
 | 
	
		
			
				|  |  | +The following <<indices-analyze,analyze API>> request uses the
 | 
	
		
			
				|  |  | +`word_delimiter_graph` filter to split `Neil's Super-Duper-XL500--42+AutoCoder`
 | 
	
		
			
				|  |  | +into normalized tokens using the filter's default rules:
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -`generate_number_parts`::
 | 
	
		
			
				|  |  | -    If `true` causes number subwords to be
 | 
	
		
			
				|  |  | -    generated: "500-42" -> "500" "42". Defaults to `true`.
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +GET /_analyze
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "tokenizer": "whitespace",
 | 
	
		
			
				|  |  | +  "filter": [ "word_delimiter_graph" ],
 | 
	
		
			
				|  |  | +  "text": "Neil's Super-Duper-XL500--42+AutoCoder"
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -`catenate_words`::
 | 
	
		
			
				|  |  | -    If `true` causes maximum runs of word parts to be
 | 
	
		
			
				|  |  | -    catenated: "wi-fi" -> "wifi". Defaults to `false`.
 | 
	
		
			
				|  |  | +The filter produces the following tokens:
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -`catenate_numbers`::
 | 
	
		
			
				|  |  | -    If `true` causes maximum runs of number parts to
 | 
	
		
			
				|  |  | -    be catenated: "500-42" -> "50042". Defaults to `false`.
 | 
	
		
			
				|  |  | +[source,txt]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +////
 | 
	
		
			
				|  |  | +[source,console-result]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "tokens" : [
 | 
	
		
			
				|  |  | +    {
 | 
	
		
			
				|  |  | +      "token" : "Neil",
 | 
	
		
			
				|  |  | +      "start_offset" : 0,
 | 
	
		
			
				|  |  | +      "end_offset" : 4,
 | 
	
		
			
				|  |  | +      "type" : "word",
 | 
	
		
			
				|  |  | +      "position" : 0
 | 
	
		
			
				|  |  | +    },
 | 
	
		
			
				|  |  | +    {
 | 
	
		
			
				|  |  | +      "token" : "Super",
 | 
	
		
			
				|  |  | +      "start_offset" : 7,
 | 
	
		
			
				|  |  | +      "end_offset" : 12,
 | 
	
		
			
				|  |  | +      "type" : "word",
 | 
	
		
			
				|  |  | +      "position" : 1
 | 
	
		
			
				|  |  | +    },
 | 
	
		
			
				|  |  | +    {
 | 
	
		
			
				|  |  | +      "token" : "Duper",
 | 
	
		
			
				|  |  | +      "start_offset" : 13,
 | 
	
		
			
				|  |  | +      "end_offset" : 18,
 | 
	
		
			
				|  |  | +      "type" : "word",
 | 
	
		
			
				|  |  | +      "position" : 2
 | 
	
		
			
				|  |  | +    },
 | 
	
		
			
				|  |  | +    {
 | 
	
		
			
				|  |  | +      "token" : "XL",
 | 
	
		
			
				|  |  | +      "start_offset" : 19,
 | 
	
		
			
				|  |  | +      "end_offset" : 21,
 | 
	
		
			
				|  |  | +      "type" : "word",
 | 
	
		
			
				|  |  | +      "position" : 3
 | 
	
		
			
				|  |  | +    },
 | 
	
		
			
				|  |  | +    {
 | 
	
		
			
				|  |  | +      "token" : "500",
 | 
	
		
			
				|  |  | +      "start_offset" : 21,
 | 
	
		
			
				|  |  | +      "end_offset" : 24,
 | 
	
		
			
				|  |  | +      "type" : "word",
 | 
	
		
			
				|  |  | +      "position" : 4
 | 
	
		
			
				|  |  | +    },
 | 
	
		
			
				|  |  | +    {
 | 
	
		
			
				|  |  | +      "token" : "42",
 | 
	
		
			
				|  |  | +      "start_offset" : 26,
 | 
	
		
			
				|  |  | +      "end_offset" : 28,
 | 
	
		
			
				|  |  | +      "type" : "word",
 | 
	
		
			
				|  |  | +      "position" : 5
 | 
	
		
			
				|  |  | +    },
 | 
	
		
			
				|  |  | +    {
 | 
	
		
			
				|  |  | +      "token" : "Auto",
 | 
	
		
			
				|  |  | +      "start_offset" : 29,
 | 
	
		
			
				|  |  | +      "end_offset" : 33,
 | 
	
		
			
				|  |  | +      "type" : "word",
 | 
	
		
			
				|  |  | +      "position" : 6
 | 
	
		
			
				|  |  | +    },
 | 
	
		
			
				|  |  | +    {
 | 
	
		
			
				|  |  | +      "token" : "Coder",
 | 
	
		
			
				|  |  | +      "start_offset" : 33,
 | 
	
		
			
				|  |  | +      "end_offset" : 38,
 | 
	
		
			
				|  |  | +      "type" : "word",
 | 
	
		
			
				|  |  | +      "position" : 7
 | 
	
		
			
				|  |  | +    }
 | 
	
		
			
				|  |  | +  ]
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +////
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[analysis-word-delimiter-tokenfilter-analyzer-ex]]
 | 
	
		
			
				|  |  | +==== Add to an analyzer
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +The following <<indices-create-index,create index API>> request uses the
 | 
	
		
			
				|  |  | +`word_delimiter_graph` filter to configure a new
 | 
	
		
			
				|  |  | +<<analysis-custom-analyzer,custom analyzer>>.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +PUT /my_index
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "settings": {
 | 
	
		
			
				|  |  | +    "analysis": {
 | 
	
		
			
				|  |  | +      "analyzer": {
 | 
	
		
			
				|  |  | +        "my_analyzer": {
 | 
	
		
			
				|  |  | +          "tokenizer": "whitespace",
 | 
	
		
			
				|  |  | +          "filter": [ "word_delimiter_graph" ]
 | 
	
		
			
				|  |  | +        }
 | 
	
		
			
				|  |  | +      }
 | 
	
		
			
				|  |  | +    }
 | 
	
		
			
				|  |  | +  }
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[WARNING]
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +Avoid using the `word_delimiter_graph` filter with tokenizers that remove
 | 
	
		
			
				|  |  | +punctuation, such as the <<analysis-standard-tokenizer,`standard`>> tokenizer.
 | 
	
		
			
				|  |  | +This could prevent the `word_delimiter_graph` filter from splitting tokens
 | 
	
		
			
				|  |  | +correctly. It can also interfere with the filter's configurable parameters, such
 | 
	
		
			
				|  |  | +as <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>> or
 | 
	
		
			
				|  |  | +<<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>. We
 | 
	
		
			
				|  |  | +recommend using the <<analysis-keyword-tokenizer,`keyword`>> or
 | 
	
		
			
				|  |  | +<<analysis-whitespace-tokenizer,`whitespace`>> tokenizer instead.
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[word-delimiter-graph-tokenfilter-configure-parms]]
 | 
	
		
			
				|  |  | +==== Configurable parameters
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | +[[word-delimiter-graph-tokenfilter-adjust-offsets]]
 | 
	
		
			
				|  |  | +`adjust_offsets`::
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter adjusts the offsets of split or catenated tokens to better
 | 
	
		
			
				|  |  | +reflect their actual position in the token stream. Defaults to `true`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[WARNING]
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +Set `adjust_offsets` to `false` if your analyzer uses filters, such as the
 | 
	
		
			
				|  |  | +<<analysis-trim-tokenfilter,`trim`>> filter, that change the length of tokens
 | 
	
		
			
				|  |  | +without changing their offsets. Otherwise, the `word_delimiter_graph` filter
 | 
	
		
			
				|  |  | +could produce tokens with illegal offsets.
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[word-delimiter-graph-tokenfilter-catenate-all]]
 | 
	
		
			
				|  |  |  `catenate_all`::
 | 
	
		
			
				|  |  | -    If `true` causes all subword parts to be catenated:
 | 
	
		
			
				|  |  | -    "wi-fi-4000" -> "wifi4000". Defaults to `false`.
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter produces catenated tokens for chains of alphanumeric
 | 
	
		
			
				|  |  | +characters separated by non-alphabetic delimiters. For example:
 | 
	
		
			
				|  |  | +`super-duper-xl-500` -> [**`superduperxl500`**, `super`, `duper`, `xl`, `500` ].
 | 
	
		
			
				|  |  | +Defaults to `false`.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -`split_on_case_change`::
 | 
	
		
			
				|  |  | -    If `true` causes "PowerShot" to be two tokens;
 | 
	
		
			
				|  |  | -    ("Power-Shot" remains two parts regards). Defaults to `true`.
 | 
	
		
			
				|  |  | +[WARNING]
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +Setting this parameter to `true` produces multi-position tokens, which are not
 | 
	
		
			
				|  |  | +supported by indexing.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If this parameter is `true`, avoid using this filter in an index analyzer or
 | 
	
		
			
				|  |  | +use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
 | 
	
		
			
				|  |  | +this filter to make the token stream suitable for indexing.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +When used for search analysis, catenated tokens can cause problems for the
 | 
	
		
			
				|  |  | +<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
 | 
	
		
			
				|  |  | +rely on token position for matching. Avoid setting this parameter to `true` if
 | 
	
		
			
				|  |  | +you plan to use these queries.
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[word-delimiter-graph-tokenfilter-catenate-numbers]]
 | 
	
		
			
				|  |  | +`catenate_numbers`::
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter produces catenated tokens for chains of numeric characters
 | 
	
		
			
				|  |  | +separated by non-alphabetic delimiters. For example: `01-02-03` ->
 | 
	
		
			
				|  |  | +[**`010203`**, `01`, `02`, `03` ]. Defaults to `false`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[WARNING]
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +Setting this parameter to `true` produces multi-position tokens, which are not
 | 
	
		
			
				|  |  | +supported by indexing.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If this parameter is `true`, avoid using this filter in an index analyzer or
 | 
	
		
			
				|  |  | +use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
 | 
	
		
			
				|  |  | +this filter to make the token stream suitable for indexing.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | +When used for search analysis, catenated tokens can cause problems for the
 | 
	
		
			
				|  |  | +<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
 | 
	
		
			
				|  |  | +rely on token position for matching. Avoid setting this parameter to `true` if
 | 
	
		
			
				|  |  | +you plan to use these queries.
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[word-delimiter-graph-tokenfilter-catenate-words]]
 | 
	
		
			
				|  |  | +`catenate_words`::
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter produces catenated tokens for chains of alphabetical
 | 
	
		
			
				|  |  | +characters separated by non-alphabetic delimiters. For example: `super-duper-xl`
 | 
	
		
			
				|  |  | +-> [**`superduperxl`**, `super`, `duper`, `xl`]. Defaults to `false`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[WARNING]
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +Setting this parameter to `true` produces multi-position tokens, which are not
 | 
	
		
			
				|  |  | +supported by indexing.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If this parameter is `true`, avoid using this filter in an index analyzer or
 | 
	
		
			
				|  |  | +use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
 | 
	
		
			
				|  |  | +this filter to make the token stream suitable for indexing.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +When used for search analysis, catenated tokens can cause problems for the
 | 
	
		
			
				|  |  | +<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
 | 
	
		
			
				|  |  | +rely on token position for matching. Avoid setting this parameter to `true` if
 | 
	
		
			
				|  |  | +you plan to use these queries.
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +`generate_number_parts`::
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter includes tokens consisting of only numeric characters in
 | 
	
		
			
				|  |  | +the output. If `false`, the filter excludes these tokens from the output.
 | 
	
		
			
				|  |  | +Defaults to `true`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +`generate_word_parts`::
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter includes tokens consisting of only alphabetical characters
 | 
	
		
			
				|  |  | +in the output. If `false`, the filter excludes these tokens from the output.
 | 
	
		
			
				|  |  | +Defaults to `true`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[word-delimiter-graph-tokenfilter-preserve-original]]
 | 
	
		
			
				|  |  |  `preserve_original`::
 | 
	
		
			
				|  |  | -    If `true` includes original words in subwords:
 | 
	
		
			
				|  |  | -    "500-42" -> "500-42" "500" "42". Defaults to `false`.
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter includes the original version of any split tokens in the
 | 
	
		
			
				|  |  | +output. This original version includes non-alphanumeric delimiters. For example:
 | 
	
		
			
				|  |  | +`super-duper-xl-500` -> [**`super-duper-xl-500`**, `super`, `duper`, `xl`, `500`
 | 
	
		
			
				|  |  | +]. Defaults to `false`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[WARNING]
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +Setting this parameter to `true` produces multi-position tokens, which are not
 | 
	
		
			
				|  |  | +supported by indexing.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +If this parameter is `true`, avoid using this filter in an index analyzer or
 | 
	
		
			
				|  |  | +use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
 | 
	
		
			
				|  |  | +this filter to make the token stream suitable for indexing.
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +`protected_words`::
 | 
	
		
			
				|  |  | +(Optional, array of strings)
 | 
	
		
			
				|  |  | +Array of tokens the filter won't split.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +`protected_words_path`::
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +(Optional, string)
 | 
	
		
			
				|  |  | +Path to a file that contains a list of tokens the filter won't split.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +This path must be absolute or relative to the `config` location, and the file
 | 
	
		
			
				|  |  | +must be UTF-8 encoded. Each token in the file must be separated by a line
 | 
	
		
			
				|  |  | +break.
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +`split_on_case_change`::
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter splits tokens at letter case transitions. For example:
 | 
	
		
			
				|  |  | +`camelCase` -> [ `camel`, `Case`]. Defaults to `true`.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  |  `split_on_numerics`::
 | 
	
		
			
				|  |  | -    If `true` causes "j2se" to be three tokens; "j"
 | 
	
		
			
				|  |  | -    "2" "se". Defaults to `true`.
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter splits tokens at letter-number transitions. For example:
 | 
	
		
			
				|  |  | +`j2se` -> [ `j`, `2`, `se` ]. Defaults to `true`.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  |  `stem_english_possessive`::
 | 
	
		
			
				|  |  | -    If `true` causes trailing "'s" to be
 | 
	
		
			
				|  |  | -    removed for each subword: "O'Neil's" -> "O", "Neil". Defaults to `true`.
 | 
	
		
			
				|  |  | +(Optional, boolean)
 | 
	
		
			
				|  |  | +If `true`, the filter removes the English possessive (`'s`) from the end of each
 | 
	
		
			
				|  |  | +token. For example: `O'Neil's` -> `[ `O`, `Neil` ]. Defaults to `true`.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -Advance settings include:
 | 
	
		
			
				|  |  | +`type_table`::
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +(Optional, array of strings)
 | 
	
		
			
				|  |  | +Array of custom type mappings for characters. This allows you to map
 | 
	
		
			
				|  |  | +non-alphanumeric characters as numeric or alphanumeric to avoid splitting on
 | 
	
		
			
				|  |  | +those characters.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -`protected_words`::
 | 
	
		
			
				|  |  | -    A list of protected words from being delimiter.
 | 
	
		
			
				|  |  | -    Either an array, or also can set `protected_words_path` which resolved
 | 
	
		
			
				|  |  | -    to a file configured with protected words (one on each line).
 | 
	
		
			
				|  |  | -    Automatically resolves to `config/` based location if exists.
 | 
	
		
			
				|  |  | +For example, the following array maps the plus (`+`) and hyphen (`-`) characters
 | 
	
		
			
				|  |  | +as alphanumeric, which means they won't be treated as delimiters:
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -`adjust_offsets`::
 | 
	
		
			
				|  |  | -    By default, the filter tries to output subtokens with adjusted offsets
 | 
	
		
			
				|  |  | -    to reflect their actual position in the token stream.  However, when
 | 
	
		
			
				|  |  | -    used in combination with other filters that alter the length or starting
 | 
	
		
			
				|  |  | -    position of tokens without changing their offsets
 | 
	
		
			
				|  |  | -    (e.g. <<analysis-trim-tokenfilter,`trim`>>) this can cause tokens with
 | 
	
		
			
				|  |  | -    illegal offsets to be emitted.  Setting `adjust_offsets` to false will
 | 
	
		
			
				|  |  | -    stop `word_delimiter_graph` from adjusting these internal offsets.
 | 
	
		
			
				|  |  | +`["+ => ALPHA", "- => ALPHA"]`
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -`type_table`::
 | 
	
		
			
				|  |  | -    A custom type mapping table, for example (when configured
 | 
	
		
			
				|  |  | -    using `type_table_path`):
 | 
	
		
			
				|  |  | -
 | 
	
		
			
				|  |  | -[source,type_table]
 | 
	
		
			
				|  |  | ---------------------------------------------------
 | 
	
		
			
				|  |  | -    # Map the $, %, '.', and ',' characters to DIGIT
 | 
	
		
			
				|  |  | -    # This might be useful for financial data.
 | 
	
		
			
				|  |  | -    $ => DIGIT
 | 
	
		
			
				|  |  | -    % => DIGIT
 | 
	
		
			
				|  |  | -    . => DIGIT
 | 
	
		
			
				|  |  | -    \\u002C => DIGIT
 | 
	
		
			
				|  |  | -
 | 
	
		
			
				|  |  | -    # in some cases you might not want to split on ZWJ
 | 
	
		
			
				|  |  | -    # this also tests the case where we need a bigger byte[]
 | 
	
		
			
				|  |  | -    # see http://en.wikipedia.org/wiki/Zero-width_joiner
 | 
	
		
			
				|  |  | -    \\u200D => ALPHANUM
 | 
	
		
			
				|  |  | ---------------------------------------------------
 | 
	
		
			
				|  |  | -
 | 
	
		
			
				|  |  | -NOTE: Using a tokenizer like the `standard` tokenizer may interfere with
 | 
	
		
			
				|  |  | -the `catenate_*` and `preserve_original` parameters, as the original
 | 
	
		
			
				|  |  | -string may already have lost punctuation during tokenization.  Instead,
 | 
	
		
			
				|  |  | -you may want to use the `whitespace` tokenizer.
 | 
	
		
			
				|  |  | +Supported types include:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* `ALPHA` (Alphabetical)
 | 
	
		
			
				|  |  | +* `ALPHANUM` (Alphanumeric)
 | 
	
		
			
				|  |  | +* `DIGIT` (Numeric)
 | 
	
		
			
				|  |  | +* `LOWER` (Lowercase alphabetical)
 | 
	
		
			
				|  |  | +* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
 | 
	
		
			
				|  |  | +* `UPPER` (Uppercase alphabetical)
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +`type_table_path`::
 | 
	
		
			
				|  |  | ++
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +(Optional, string)
 | 
	
		
			
				|  |  | +Path to a file that contains custom type mappings for characters. This allows
 | 
	
		
			
				|  |  | +you to map non-alphanumeric characters as numeric or alphanumeric to avoid
 | 
	
		
			
				|  |  | +splitting on those characters.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +For example, the contents of this file may contain the following:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,txt]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +# Map the $, %, '.', and ',' characters to DIGIT
 | 
	
		
			
				|  |  | +# This might be useful for financial data.
 | 
	
		
			
				|  |  | +$ => DIGIT
 | 
	
		
			
				|  |  | +% => DIGIT
 | 
	
		
			
				|  |  | +. => DIGIT
 | 
	
		
			
				|  |  | +\\u002C => DIGIT
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +# in some cases you might not want to split on ZWJ
 | 
	
		
			
				|  |  | +# this also tests the case where we need a bigger byte[]
 | 
	
		
			
				|  |  | +# see http://en.wikipedia.org/wiki/Zero-width_joiner
 | 
	
		
			
				|  |  | +\\u200D => ALPHANUM
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Supported types include:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* `ALPHA` (Alphabetical)
 | 
	
		
			
				|  |  | +* `ALPHANUM` (Alphanumeric)
 | 
	
		
			
				|  |  | +* `DIGIT` (Numeric)
 | 
	
		
			
				|  |  | +* `LOWER` (Lowercase alphabetical)
 | 
	
		
			
				|  |  | +* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
 | 
	
		
			
				|  |  | +* `UPPER` (Uppercase alphabetical)
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +This file path must be absolute or relative to the `config` location, and the
 | 
	
		
			
				|  |  | +file must be UTF-8 encoded. Each mapping in the file must be separated by a line
 | 
	
		
			
				|  |  | +break.
 | 
	
		
			
				|  |  | +--
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[analysis-word-delimiter-graph-tokenfilter-customize]]
 | 
	
		
			
				|  |  | +==== Customize
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To customize the `word_delimiter_graph` filter, duplicate it to create the basis
 | 
	
		
			
				|  |  | +for a new custom token filter. You can modify the filter using its configurable
 | 
	
		
			
				|  |  | +parameters.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +For example, the following request creates a `word_delimiter_graph`
 | 
	
		
			
				|  |  | +filter that uses the following rules:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +* Split tokens at non-alphanumeric characters, _except_ the hyphen (`-`)
 | 
	
		
			
				|  |  | +  character.
 | 
	
		
			
				|  |  | +* Remove leading or trailing delimiters from each token.
 | 
	
		
			
				|  |  | +* Do _not_ split tokens at letter case transitions.
 | 
	
		
			
				|  |  | +* Do _not_ split tokens at letter-number transitions.
 | 
	
		
			
				|  |  | +* Remove the English possessive (`'s`) from the end of each token.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[source,console]
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +PUT /my_index
 | 
	
		
			
				|  |  | +{
 | 
	
		
			
				|  |  | +  "settings": {
 | 
	
		
			
				|  |  | +    "analysis": {
 | 
	
		
			
				|  |  | +      "analyzer": {
 | 
	
		
			
				|  |  | +        "my_analyzer": {
 | 
	
		
			
				|  |  | +          "tokenizer": "whitespace",
 | 
	
		
			
				|  |  | +          "filter": [ "my_custom_word_delimiter_graph_filter" ]
 | 
	
		
			
				|  |  | +        }
 | 
	
		
			
				|  |  | +      },
 | 
	
		
			
				|  |  | +      "filter": {
 | 
	
		
			
				|  |  | +        "my_custom_word_delimiter_graph_filter": {
 | 
	
		
			
				|  |  | +          "type": "word_delimiter_graph",
 | 
	
		
			
				|  |  | +          "type_table": [ "- => ALPHA" ],
 | 
	
		
			
				|  |  | +          "split_on_case_change": false,
 | 
	
		
			
				|  |  | +          "split_on_numerics": false,
 | 
	
		
			
				|  |  | +          "stem_english_possessive": true
 | 
	
		
			
				|  |  | +        }
 | 
	
		
			
				|  |  | +      }
 | 
	
		
			
				|  |  | +    }
 | 
	
		
			
				|  |  | +  }
 | 
	
		
			
				|  |  | +}
 | 
	
		
			
				|  |  | +----
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[analysis-word-delimiter-graph-differences]]
 | 
	
		
			
				|  |  | +==== Differences between `word_delimiter_graph` and `word_delimiter` 
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Both the  `word_delimiter_graph` and
 | 
	
		
			
				|  |  | +<<analysis-word-delimiter-tokenfilter,`word_delimiter`>> filters produce tokens
 | 
	
		
			
				|  |  | +that span multiple positions when any of the following parameters are `true`:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | + * <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>>
 | 
	
		
			
				|  |  | + * <<word-delimiter-graph-tokenfilter-catenate-numbers,`catenate_numbers`>>
 | 
	
		
			
				|  |  | + * <<word-delimiter-graph-tokenfilter-catenate-words,`catenate_words`>>
 | 
	
		
			
				|  |  | + * <<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +However, only the `word_delimiter_graph` filter assigns multi-position tokens a
 | 
	
		
			
				|  |  | +`positionLength` attribute, which indicates the number of positions a token
 | 
	
		
			
				|  |  | +spans. This ensures the `word_delimiter_graph` filter always produces valid token
 | 
	
		
			
				|  |  | +https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs].
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +The `word_delimiter` filter does not assign multi-position tokens a
 | 
	
		
			
				|  |  | +`positionLength` attribute. This means it produces invalid graphs for streams
 | 
	
		
			
				|  |  | +including these tokens.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +While indexing does not support token graphs containing multi-position tokens,
 | 
	
		
			
				|  |  | +queries, such as the <<query-dsl-match-query-phrase,`match_phrase`>> query, can
 | 
	
		
			
				|  |  | +use these graphs to generate multiple sub-queries from a single query string.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To see how token graphs produced by the `word_delimiter` and
 | 
	
		
			
				|  |  | +`word_delimiter_graph` filters differ, check out the following example.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +.*Example*
 | 
	
		
			
				|  |  | +[%collapsible]
 | 
	
		
			
				|  |  | +====
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[analysis-word-delimiter-graph-basic-token-graph]]
 | 
	
		
			
				|  |  | +*Basic token graph*
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Both the `word_delimiter` and `word_delimiter_graph` produce the following token
 | 
	
		
			
				|  |  | +graph for `PowerShot2000` when the following parameters are `false`:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | + * <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>>
 | 
	
		
			
				|  |  | + * <<word-delimiter-graph-tokenfilter-catenate-numbers,`catenate_numbers`>>
 | 
	
		
			
				|  |  | + * <<word-delimiter-graph-tokenfilter-catenate-words,`catenate_words`>>
 | 
	
		
			
				|  |  | + * <<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +This graph does not contain multi-position tokens. All tokens span only one
 | 
	
		
			
				|  |  | +position.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +image::images/analysis/token-graph-basic.svg[align="center"]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[analysis-word-delimiter-graph-wdg-token-graph]]
 | 
	
		
			
				|  |  | +*`word_delimiter_graph` graph with a multi-position token*
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +The `word_delimiter_graph` filter produces the following token graph for
 | 
	
		
			
				|  |  | +`PowerShot2000` when `catenate_words` is `true`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +This graph correctly indicates the catenated `PowerShot` token spans two
 | 
	
		
			
				|  |  | +positions.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +image::images/analysis/token-graph-wdg.svg[align="center"]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +[[analysis-word-delimiter-graph-wd-token-graph]]
 | 
	
		
			
				|  |  | +*`word_delimiter` graph with a multi-position token*
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +When `catenate_words` is `true`, the `word_delimiter` filter produces
 | 
	
		
			
				|  |  | +the following token graph for `PowerShot2000`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Note that the catenated `PowerShot` token should span two positions but only
 | 
	
		
			
				|  |  | +spans one in the token graph, making it invalid.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +image::images/analysis/token-graph-wd.svg[align="center"]
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +====
 |