simplepattern-tokenizer.asciidoc 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
  1. [[analysis-simplepattern-tokenizer]]
  2. === Simple Pattern Tokenizer
  3. The `simple_pattern` tokenizer uses a regular expression to capture matching
  4. text as terms. The set of regular expression features it supports is more
  5. limited than the <<analysis-pattern-tokenizer,`pattern`>> tokenizer, but the
  6. tokenization is generally faster.
  7. This tokenizer does not support splitting the input on a pattern match, unlike
  8. the <<analysis-pattern-tokenizer,`pattern`>> tokenizer. To split on pattern
  9. matches using the same restricted regular expression subset, see the
  10. <<analysis-simplepatternsplit-tokenizer,`simple_pattern_split`>> tokenizer.
  11. This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
  12. For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.
  13. The default pattern is the empty string, which produces no terms. This
  14. tokenizer should always be configured with a non-default pattern.
  15. [float]
  16. === Configuration
  17. The `simple_pattern` tokenizer accepts the following parameters:
  18. [horizontal]
  19. `pattern`::
  20. {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
  21. [float]
  22. === Example configuration
  23. This example configures the `simple_pattern` tokenizer to produce terms that are
  24. three-digit numbers
  25. [source,console]
  26. ----------------------------
  27. PUT my_index
  28. {
  29. "settings": {
  30. "analysis": {
  31. "analyzer": {
  32. "my_analyzer": {
  33. "tokenizer": "my_tokenizer"
  34. }
  35. },
  36. "tokenizer": {
  37. "my_tokenizer": {
  38. "type": "simple_pattern",
  39. "pattern": "[0123456789]{3}"
  40. }
  41. }
  42. }
  43. }
  44. }
  45. POST my_index/_analyze
  46. {
  47. "analyzer": "my_analyzer",
  48. "text": "fd-786-335-514-x"
  49. }
  50. ----------------------------
  51. /////////////////////
  52. [source,console-result]
  53. ----------------------------
  54. {
  55. "tokens" : [
  56. {
  57. "token" : "786",
  58. "start_offset" : 3,
  59. "end_offset" : 6,
  60. "type" : "word",
  61. "position" : 0
  62. },
  63. {
  64. "token" : "335",
  65. "start_offset" : 7,
  66. "end_offset" : 10,
  67. "type" : "word",
  68. "position" : 1
  69. },
  70. {
  71. "token" : "514",
  72. "start_offset" : 11,
  73. "end_offset" : 14,
  74. "type" : "word",
  75. "position" : 2
  76. }
  77. ]
  78. }
  79. ----------------------------
  80. /////////////////////
  81. The above example produces these terms:
  82. [source,text]
  83. ---------------------------
  84. [ 786, 335, 514 ]
  85. ---------------------------