chargroup-tokenizer.asciidoc 1.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
  1. [[analysis-chargroup-tokenizer]]
  2. === Char Group Tokenizer
  3. The `char_group` tokenizer breaks text into terms whenever it encounters a
  4. character which is in a defined set. It is mostly useful for cases where a simple
  5. custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>>
  6. is not acceptable.
  7. [float]
  8. === Configuration
  9. The `char_group` tokenizer accepts one parameter:
  10. [horizontal]
  11. `tokenize_on_chars`::
  12. A list containing a list of characters to tokenize the string on. Whenever a character
  13. from this list is encountered, a new token is started. This accepts either single
  14. characters like e.g. `-`, or character groups: `whitespace`, `letter`, `digit`,
  15. `punctuation`, `symbol`.
  16. `max_token_length`::
  17. The maximum token length. If a token is seen that exceeds this length then
  18. it is split at `max_token_length` intervals. Defaults to `255`.
  19. [float]
  20. === Example output
  21. [source,console]
  22. ---------------------------
  23. POST _analyze
  24. {
  25. "tokenizer": {
  26. "type": "char_group",
  27. "tokenize_on_chars": [
  28. "whitespace",
  29. "-",
  30. "\n"
  31. ]
  32. },
  33. "text": "The QUICK brown-fox"
  34. }
  35. ---------------------------
  36. returns
  37. [source,console-result]
  38. ---------------------------
  39. {
  40. "tokens": [
  41. {
  42. "token": "The",
  43. "start_offset": 0,
  44. "end_offset": 3,
  45. "type": "word",
  46. "position": 0
  47. },
  48. {
  49. "token": "QUICK",
  50. "start_offset": 4,
  51. "end_offset": 9,
  52. "type": "word",
  53. "position": 1
  54. },
  55. {
  56. "token": "brown",
  57. "start_offset": 10,
  58. "end_offset": 15,
  59. "type": "word",
  60. "position": 2
  61. },
  62. {
  63. "token": "fox",
  64. "start_offset": 16,
  65. "end_offset": 19,
  66. "type": "word",
  67. "position": 3
  68. }
  69. ]
  70. }
  71. ---------------------------