1
0

simplepattern-tokenizer.asciidoc 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
  1. [[analysis-simplepattern-tokenizer]]
  2. === Simple Pattern Tokenizer
  3. experimental[This functionality is marked as experimental in Lucene]
  4. The `simple_pattern` tokenizer uses a regular expression to capture matching
  5. text as terms. The set of regular expression features it supports is more
  6. limited than the <<analysis-pattern-tokenizer,`pattern`>> tokenizer, but the
  7. tokenization is generally faster.
  8. This tokenizer does not support splitting the input on a pattern match, unlike
  9. the <<analysis-pattern-tokenizer,`pattern`>> tokenizer. To split on pattern
  10. matches using the same restricted regular expression subset, see the
  11. <<analysis-simplepatternsplit-tokenizer,`simple_pattern_split`>> tokenizer.
  12. This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
  13. For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.
  14. The default pattern is the empty string, which produces no terms. This
  15. tokenizer should always be configured with a non-default pattern.
  16. [float]
  17. === Configuration
  18. The `simple_pattern` tokenizer accepts the following parameters:
  19. [horizontal]
  20. `pattern`::
  21. {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
  22. [float]
  23. === Example configuration
  24. This example configures the `simple_pattern` tokenizer to produce terms that are
  25. three-digit numbers
  26. [source,console]
  27. ----------------------------
  28. PUT my_index
  29. {
  30. "settings": {
  31. "analysis": {
  32. "analyzer": {
  33. "my_analyzer": {
  34. "tokenizer": "my_tokenizer"
  35. }
  36. },
  37. "tokenizer": {
  38. "my_tokenizer": {
  39. "type": "simple_pattern",
  40. "pattern": "[0123456789]{3}"
  41. }
  42. }
  43. }
  44. }
  45. }
  46. POST my_index/_analyze
  47. {
  48. "analyzer": "my_analyzer",
  49. "text": "fd-786-335-514-x"
  50. }
  51. ----------------------------
  52. /////////////////////
  53. [source,console-result]
  54. ----------------------------
  55. {
  56. "tokens" : [
  57. {
  58. "token" : "786",
  59. "start_offset" : 3,
  60. "end_offset" : 6,
  61. "type" : "word",
  62. "position" : 0
  63. },
  64. {
  65. "token" : "335",
  66. "start_offset" : 7,
  67. "end_offset" : 10,
  68. "type" : "word",
  69. "position" : 1
  70. },
  71. {
  72. "token" : "514",
  73. "start_offset" : 11,
  74. "end_offset" : 14,
  75. "type" : "word",
  76. "position" : 2
  77. }
  78. ]
  79. }
  80. ----------------------------
  81. /////////////////////
  82. The above example produces these terms:
  83. [source,text]
  84. ---------------------------
  85. [ 786, 335, 514 ]
  86. ---------------------------