chargroup-tokenizer.asciidoc 1.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
  1. [[analysis-chargroup-tokenizer]]
  2. === Char Group Tokenizer
  3. The `char_group` tokenizer breaks text into terms whenever it encounters a
  4. character which is in a defined set. It is mostly useful for cases where a simple
  5. custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>>
  6. is not acceptable.
  7. [float]
  8. === Configuration
  9. The `char_group` tokenizer accepts one parameter:
  10. [horizontal]
  11. `tokenize_on_chars`::
  12. A list containing a list of characters to tokenize the string on. Whenever a character
  13. from this list is encountered, a new token is started. This accepts either single
  14. characters like e.g. `-`, or character groups: `whitespace`, `letter`, `digit`,
  15. `punctuation`, `symbol`.
  16. [float]
  17. === Example output
  18. [source,js]
  19. ---------------------------
  20. POST _analyze
  21. {
  22. "tokenizer": {
  23. "type": "char_group",
  24. "tokenize_on_chars": [
  25. "whitespace",
  26. "-",
  27. "\n"
  28. ]
  29. },
  30. "text": "The QUICK brown-fox"
  31. }
  32. ---------------------------
  33. // CONSOLE
  34. returns
  35. [source,console-result]
  36. ---------------------------
  37. {
  38. "tokens": [
  39. {
  40. "token": "The",
  41. "start_offset": 0,
  42. "end_offset": 3,
  43. "type": "word",
  44. "position": 0
  45. },
  46. {
  47. "token": "QUICK",
  48. "start_offset": 4,
  49. "end_offset": 9,
  50. "type": "word",
  51. "position": 1
  52. },
  53. {
  54. "token": "brown",
  55. "start_offset": 10,
  56. "end_offset": 15,
  57. "type": "word",
  58. "position": 2
  59. },
  60. {
  61. "token": "fox",
  62. "start_offset": 16,
  63. "end_offset": 19,
  64. "type": "word",
  65. "position": 3
  66. }
  67. ]
  68. }
  69. ---------------------------