123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104 |
- [[analysis-simplepattern-tokenizer]]
- === Simple Pattern Tokenizer
- experimental[This functionality is marked as experimental in Lucene]
- The `simple_pattern` tokenizer uses a regular expression to capture matching
- text as terms. The set of regular expression features it supports is more
- limited than the <<analysis-pattern-tokenizer,`pattern`>> tokenizer, but the
- tokenization is generally faster.
- This tokenizer does not support splitting the input on a pattern match, unlike
- the <<analysis-pattern-tokenizer,`pattern`>> tokenizer. To split on pattern
- matches using the same restricted regular expression subset, see the
- <<analysis-simplepatternsplit-tokenizer,`simple_pattern_split`>> tokenizer.
- This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
- For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.
- The default pattern is the empty string, which produces no terms. This
- tokenizer should always be configured with a non-default pattern.
- [float]
- === Configuration
- The `simple_pattern` tokenizer accepts the following parameters:
- [horizontal]
- `pattern`::
- {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
- [float]
- === Example configuration
- This example configures the `simple_pattern` tokenizer to produce terms that are
- three-digit numbers
- [source,js]
- ----------------------------
- PUT my_index
- {
- "settings": {
- "analysis": {
- "analyzer": {
- "my_analyzer": {
- "tokenizer": "my_tokenizer"
- }
- },
- "tokenizer": {
- "my_tokenizer": {
- "type": "simple_pattern",
- "pattern": "[0123456789]{3}"
- }
- }
- }
- }
- }
- POST my_index/_analyze
- {
- "analyzer": "my_analyzer",
- "text": "fd-786-335-514-x"
- }
- ----------------------------
- // CONSOLE
- /////////////////////
- [source,console-result]
- ----------------------------
- {
- "tokens" : [
- {
- "token" : "786",
- "start_offset" : 3,
- "end_offset" : 6,
- "type" : "word",
- "position" : 0
- },
- {
- "token" : "335",
- "start_offset" : 7,
- "end_offset" : 10,
- "type" : "word",
- "position" : 1
- },
- {
- "token" : "514",
- "start_offset" : 11,
- "end_offset" : 14,
- "type" : "word",
- "position" : 2
- }
- ]
- }
- ----------------------------
- /////////////////////
- The above example produces these terms:
- [source,text]
- ---------------------------
- [ 786, 335, 514 ]
- ---------------------------
|