| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104 | [[analysis-simplepatternsplit-tokenizer]]=== Simple Pattern Split Tokenizerexperimental[This functionality is marked as experimental in Lucene]The `simple_pattern_split` tokenizer uses a regular expression to split theinput into terms at pattern matches. The set of regular expression features itsupports is more limited than the <<analysis-pattern-tokenizer,`pattern`>>tokenizer, but the tokenization is generally faster.This tokenizer does not produce terms from the matches themselves. To produceterms from matches using patterns in the same restricted regular expressionsubset, see the <<analysis-simplepattern-tokenizer,`simple_pattern`>>tokenizer.This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.The default pattern is the empty string, which produces one term containing thefull input. This tokenizer should always be configured with a non-defaultpattern.[float]=== ConfigurationThe `simple_pattern_split` tokenizer accepts the following parameters:[horizontal]`pattern`::    A {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.[float]=== Example configurationThis example configures the `simple_pattern_split` tokenizer to split the inputtext on underscores.[source,console]----------------------------PUT my_index{  "settings": {    "analysis": {      "analyzer": {        "my_analyzer": {          "tokenizer": "my_tokenizer"        }      },      "tokenizer": {        "my_tokenizer": {          "type": "simple_pattern_split",          "pattern": "_"        }      }    }  }}POST my_index/_analyze{  "analyzer": "my_analyzer",  "text": "an_underscored_phrase"}----------------------------/////////////////////[source,console-result]----------------------------{  "tokens" : [    {      "token" : "an",      "start_offset" : 0,      "end_offset" : 2,      "type" : "word",      "position" : 0    },    {      "token" : "underscored",      "start_offset" : 3,      "end_offset" : 14,      "type" : "word",      "position" : 1    },    {      "token" : "phrase",      "start_offset" : 15,      "end_offset" : 21,      "type" : "word",      "position" : 2    }  ]}----------------------------/////////////////////The above example produces these terms:[source,text]---------------------------[ an, underscored, phrase ]---------------------------
 |