pattern-tokenizer.asciidoc 1.5 KB

1234567891011121314151617181920212223242526272829303132333435363738
  1. [[analysis-pattern-tokenizer]]
  2. === Pattern Tokenizer
  3. A tokenizer of type `pattern` that can flexibly separate text into terms
  4. via a regular expression. Accepts the following settings:
  5. [cols="<,<",options="header",]
  6. |======================================================================
  7. |Setting |Description
  8. |`pattern` |The regular expression pattern, defaults to `\W+`.
  9. |`flags` |The regular expression flags.
  10. |`group` |Which group to extract into tokens. Defaults to `-1` (split).
  11. |======================================================================
  12. *IMPORTANT*: The regular expression should match the *token separators*,
  13. not the tokens themselves.
  14. *********************************************
  15. Note that you may need to escape `pattern` string literal according to
  16. your client language rules. For example, in many programming languages
  17. a string literal for `\W+` pattern is written as `"\\W+"`.
  18. There is nothing special about `pattern` (you may have to escape other
  19. string literals as well); escaping `pattern` is common just because it
  20. often contains characters that should be escaped.
  21. *********************************************
  22. `group` set to `-1` (the default) is equivalent to "split". Using group
  23. >= 0 selects the matching group as the token. For example, if you have:
  24. ------------------------
  25. pattern = '([^']+)'
  26. group = 0
  27. input = aaa 'bbb' 'ccc'
  28. ------------------------
  29. the output will be two tokens: `'bbb'` and `'ccc'` (including the `'`
  30. marks). With the same input but using group=1, the output would be:
  31. `bbb` and `ccc` (no `'` marks).