chargroup-tokenizer.asciidoc 1.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
  1. [[analysis-chargroup-tokenizer]]
  2. === Character group tokenizer
  3. ++++
  4. <titleabbrev>Character group</titleabbrev>
  5. ++++
  6. The `char_group` tokenizer breaks text into terms whenever it encounters a
  7. character which is in a defined set. It is mostly useful for cases where a simple
  8. custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>>
  9. is not acceptable.
  10. [float]
  11. === Configuration
  12. The `char_group` tokenizer accepts one parameter:
  13. [horizontal]
  14. `tokenize_on_chars`::
  15. A list containing a list of characters to tokenize the string on. Whenever a character
  16. from this list is encountered, a new token is started. This accepts either single
  17. characters like e.g. `-`, or character groups: `whitespace`, `letter`, `digit`,
  18. `punctuation`, `symbol`.
  19. `max_token_length`::
  20. The maximum token length. If a token is seen that exceeds this length then
  21. it is split at `max_token_length` intervals. Defaults to `255`.
  22. [float]
  23. === Example output
  24. [source,console]
  25. ---------------------------
  26. POST _analyze
  27. {
  28. "tokenizer": {
  29. "type": "char_group",
  30. "tokenize_on_chars": [
  31. "whitespace",
  32. "-",
  33. "\n"
  34. ]
  35. },
  36. "text": "The QUICK brown-fox"
  37. }
  38. ---------------------------
  39. returns
  40. [source,console-result]
  41. ---------------------------
  42. {
  43. "tokens": [
  44. {
  45. "token": "The",
  46. "start_offset": 0,
  47. "end_offset": 3,
  48. "type": "word",
  49. "position": 0
  50. },
  51. {
  52. "token": "QUICK",
  53. "start_offset": 4,
  54. "end_offset": 9,
  55. "type": "word",
  56. "position": 1
  57. },
  58. {
  59. "token": "brown",
  60. "start_offset": 10,
  61. "end_offset": 15,
  62. "type": "word",
  63. "position": 2
  64. },
  65. {
  66. "token": "fox",
  67. "start_offset": 16,
  68. "end_offset": 19,
  69. "type": "word",
  70. "position": 3
  71. }
  72. ]
  73. }
  74. ---------------------------