chargroup-tokenizer.asciidoc 1.6 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
  1. [[analysis-chargroup-tokenizer]]
  2. === Char Group Tokenizer
  3. The `char_group` tokenizer breaks text into terms whenever it encounters a
  4. character which is in a defined set. It is mostly useful for cases where a simple
  5. custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>>
  6. is not acceptable.
  7. [float]
  8. === Configuration
  9. The `char_group` tokenizer accepts one parameter:
  10. [horizontal]
  11. `tokenize_on_chars`::
  12. A list containing a list of characters to tokenize the string on. Whenever a character
  13. from this list is encountered, a new token is started. This accepts either single
  14. characters like e.g. `-`, or character groups: `whitespace`, `letter`, `digit`,
  15. `punctuation`, `symbol`.
  16. [float]
  17. === Example output
  18. [source,console]
  19. ---------------------------
  20. POST _analyze
  21. {
  22. "tokenizer": {
  23. "type": "char_group",
  24. "tokenize_on_chars": [
  25. "whitespace",
  26. "-",
  27. "\n"
  28. ]
  29. },
  30. "text": "The QUICK brown-fox"
  31. }
  32. ---------------------------
  33. returns
  34. [source,console-result]
  35. ---------------------------
  36. {
  37. "tokens": [
  38. {
  39. "token": "The",
  40. "start_offset": 0,
  41. "end_offset": 3,
  42. "type": "word",
  43. "position": 0
  44. },
  45. {
  46. "token": "QUICK",
  47. "start_offset": 4,
  48. "end_offset": 9,
  49. "type": "word",
  50. "position": 1
  51. },
  52. {
  53. "token": "brown",
  54. "start_offset": 10,
  55. "end_offset": 15,
  56. "type": "word",
  57. "position": 2
  58. },
  59. {
  60. "token": "fox",
  61. "start_offset": 16,
  62. "end_offset": 19,
  63. "type": "word",
  64. "position": 3
  65. }
  66. ]
  67. }
  68. ---------------------------