123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778 |
- [[analysis-chargroup-tokenizer]]
- === Char Group Tokenizer
- The `char_group` tokenizer breaks text into terms whenever it encounters a
- character which is in a defined set. It is mostly useful for cases where a simple
- custom tokenization is desired, and the overhead of use of the <<analysis-pattern-tokenizer, `pattern` tokenizer>>
- is not acceptable.
- [float]
- === Configuration
- The `char_group` tokenizer accepts one parameter:
- [horizontal]
- `tokenize_on_chars`::
- A list containing a list of characters to tokenize the string on. Whenever a character
- from this list is encountered, a new token is started. This accepts either single
- characters like e.g. `-`, or character groups: `whitespace`, `letter`, `digit`,
- `punctuation`, `symbol`.
- [float]
- === Example output
- [source,js]
- ---------------------------
- POST _analyze
- {
- "tokenizer": {
- "type": "char_group",
- "tokenize_on_chars": [
- "whitespace",
- "-",
- "\n"
- ]
- },
- "text": "The QUICK brown-fox"
- }
- ---------------------------
- // CONSOLE
- returns
- [source,console-result]
- ---------------------------
- {
- "tokens": [
- {
- "token": "The",
- "start_offset": 0,
- "end_offset": 3,
- "type": "word",
- "position": 0
- },
- {
- "token": "QUICK",
- "start_offset": 4,
- "end_offset": 9,
- "type": "word",
- "position": 1
- },
- {
- "token": "brown",
- "start_offset": 10,
- "end_offset": 15,
- "type": "word",
- "position": 2
- },
- {
- "token": "fox",
- "start_offset": 16,
- "end_offset": 19,
- "type": "word",
- "position": 3
- }
- ]
- }
- ---------------------------
|