1234567891011121314151617181920212223242526272829303132333435363738 |
- [[analysis-pattern-tokenizer]]
- === Pattern Tokenizer
- A tokenizer of type `pattern` that can flexibly separate text into terms
- via a regular expression. Accepts the following settings:
- [cols="<,<",options="header",]
- |======================================================================
- |Setting |Description
- |`pattern` |The regular expression pattern, defaults to `\W+`.
- |`flags` |The regular expression flags.
- |`group` |Which group to extract into tokens. Defaults to `-1` (split).
- |======================================================================
- *IMPORTANT*: The regular expression should match the *token separators*,
- not the tokens themselves.
- *********************************************
- Note that you may need to escape `pattern` string literal according to
- your client language rules. For example, in many programming languages
- a string literal for `\W+` pattern is written as `"\\W+"`.
- There is nothing special about `pattern` (you may have to escape other
- string literals as well); escaping `pattern` is common just because it
- often contains characters that should be escaped.
- *********************************************
- `group` set to `-1` (the default) is equivalent to "split". Using group
- >= 0 selects the matching group as the token. For example, if you have:
- ------------------------
- pattern = '([^']+)'
- group = 0
- input = aaa 'bbb' 'ccc'
- ------------------------
- the output will be two tokens: `'bbb'` and `'ccc'` (including the `'`
- marks). With the same input but using group=1, the output would be:
- `bbb` and `ccc` (no `'` marks).
|