simplepattern-tokenizer.asciidoc 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
  1. [[analysis-simplepattern-tokenizer]]
  2. === Simple pattern tokenizer
  3. ++++
  4. <titleabbrev>Simple pattern</titleabbrev>
  5. ++++
  6. The `simple_pattern` tokenizer uses a regular expression to capture matching
  7. text as terms. The set of regular expression features it supports is more
  8. limited than the <<analysis-pattern-tokenizer,`pattern`>> tokenizer, but the
  9. tokenization is generally faster.
  10. This tokenizer does not support splitting the input on a pattern match, unlike
  11. the <<analysis-pattern-tokenizer,`pattern`>> tokenizer. To split on pattern
  12. matches using the same restricted regular expression subset, see the
  13. <<analysis-simplepatternsplit-tokenizer,`simple_pattern_split`>> tokenizer.
  14. This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
  15. For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.
  16. The default pattern is the empty string, which produces no terms. This
  17. tokenizer should always be configured with a non-default pattern.
  18. [discrete]
  19. === Configuration
  20. The `simple_pattern` tokenizer accepts the following parameters:
  21. [horizontal]
  22. `pattern`::
  23. {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
  24. [discrete]
  25. === Example configuration
  26. This example configures the `simple_pattern` tokenizer to produce terms that are
  27. three-digit numbers
  28. [source,console]
  29. ----------------------------
  30. PUT my-index-000001
  31. {
  32. "settings": {
  33. "analysis": {
  34. "analyzer": {
  35. "my_analyzer": {
  36. "tokenizer": "my_tokenizer"
  37. }
  38. },
  39. "tokenizer": {
  40. "my_tokenizer": {
  41. "type": "simple_pattern",
  42. "pattern": "[0123456789]{3}"
  43. }
  44. }
  45. }
  46. }
  47. }
  48. POST my-index-000001/_analyze
  49. {
  50. "analyzer": "my_analyzer",
  51. "text": "fd-786-335-514-x"
  52. }
  53. ----------------------------
  54. /////////////////////
  55. [source,console-result]
  56. ----------------------------
  57. {
  58. "tokens" : [
  59. {
  60. "token" : "786",
  61. "start_offset" : 3,
  62. "end_offset" : 6,
  63. "type" : "word",
  64. "position" : 0
  65. },
  66. {
  67. "token" : "335",
  68. "start_offset" : 7,
  69. "end_offset" : 10,
  70. "type" : "word",
  71. "position" : 1
  72. },
  73. {
  74. "token" : "514",
  75. "start_offset" : 11,
  76. "end_offset" : 14,
  77. "type" : "word",
  78. "position" : 2
  79. }
  80. ]
  81. }
  82. ----------------------------
  83. /////////////////////
  84. The above example produces these terms:
  85. [source,text]
  86. ---------------------------
  87. [ 786, 335, 514 ]
  88. ---------------------------