simplepattern-tokenizer.asciidoc 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
  1. [[analysis-simplepattern-tokenizer]]
  2. === Simple Pattern Tokenizer
  3. experimental[This functionality is marked as experimental in Lucene]
  4. The `simple_pattern` tokenizer uses a regular expression to capture matching
  5. text as terms. The set of regular expression features it supports is more
  6. limited than the <<analysis-pattern-tokenizer,`pattern`>> tokenizer, but the
  7. tokenization is generally faster.
  8. This tokenizer does not support splitting the input on a pattern match, unlike
  9. the <<analysis-pattern-tokenizer,`pattern`>> tokenizer. To split on pattern
  10. matches using the same restricted regular expression subset, see the
  11. <<analysis-simplepatternsplit-tokenizer,`simple_pattern_split`>> tokenizer.
  12. This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
  13. For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.
  14. The default pattern is the empty string, which produces no terms. This
  15. tokenizer should always be configured with a non-default pattern.
  16. [float]
  17. === Configuration
  18. The `simple_pattern` tokenizer accepts the following parameters:
  19. [horizontal]
  20. `pattern`::
  21. {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
  22. [float]
  23. === Example configuration
  24. This example configures the `simple_pattern` tokenizer to produce terms that are
  25. three-digit numbers
  26. [source,js]
  27. ----------------------------
  28. PUT my_index
  29. {
  30. "settings": {
  31. "analysis": {
  32. "analyzer": {
  33. "my_analyzer": {
  34. "tokenizer": "my_tokenizer"
  35. }
  36. },
  37. "tokenizer": {
  38. "my_tokenizer": {
  39. "type": "simple_pattern",
  40. "pattern": "[0123456789]{3}"
  41. }
  42. }
  43. }
  44. }
  45. }
  46. POST my_index/_analyze
  47. {
  48. "analyzer": "my_analyzer",
  49. "text": "fd-786-335-514-x"
  50. }
  51. ----------------------------
  52. // CONSOLE
  53. /////////////////////
  54. [source,console-result]
  55. ----------------------------
  56. {
  57. "tokens" : [
  58. {
  59. "token" : "786",
  60. "start_offset" : 3,
  61. "end_offset" : 6,
  62. "type" : "word",
  63. "position" : 0
  64. },
  65. {
  66. "token" : "335",
  67. "start_offset" : 7,
  68. "end_offset" : 10,
  69. "type" : "word",
  70. "position" : 1
  71. },
  72. {
  73. "token" : "514",
  74. "start_offset" : 11,
  75. "end_offset" : 14,
  76. "type" : "word",
  77. "position" : 2
  78. }
  79. ]
  80. }
  81. ----------------------------
  82. /////////////////////
  83. The above example produces these terms:
  84. [source,text]
  85. ---------------------------
  86. [ 786, 335, 514 ]
  87. ---------------------------