123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109 |
- [[analysis-keyword-tokenizer]]
- === Keyword tokenizer
- ++++
- <titleabbrev>Keyword</titleabbrev>
- ++++
- The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
- is given and outputs the exact same text as a single term. It can be combined
- with token filters to normalise output, e.g. lower-casing email addresses.
- [float]
- === Example output
- [source,console]
- ---------------------------
- POST _analyze
- {
- "tokenizer": "keyword",
- "text": "New York"
- }
- ---------------------------
- /////////////////////
- [source,console-result]
- ----------------------------
- {
- "tokens": [
- {
- "token": "New York",
- "start_offset": 0,
- "end_offset": 8,
- "type": "word",
- "position": 0
- }
- ]
- }
- ----------------------------
- /////////////////////
- The above sentence would produce the following term:
- [source,text]
- ---------------------------
- [ New York ]
- ---------------------------
- [discrete]
- [[analysis-keyword-tokenizer-token-filters]]
- === Combine with token filters
- You can combine the `keyword` tokenizer with token filters to normalise
- structured data, such as product IDs or email addresses.
- For example, the following <<indices-analyze,analyze API>> request uses the
- `keyword` tokenizer and <<analysis-lowercase-tokenfilter,`lowercase`>> filter to
- convert an email address to lowercase.
- [source,console]
- ---------------------------
- POST _analyze
- {
- "tokenizer": "keyword",
- "filter": [ "lowercase" ],
- "text": "john.SMITH@example.COM"
- }
- ---------------------------
- /////////////////////
- [source,console-result]
- ----------------------------
- {
- "tokens": [
- {
- "token": "john.smith@example.com",
- "start_offset": 0,
- "end_offset": 22,
- "type": "word",
- "position": 0
- }
- ]
- }
- ----------------------------
- /////////////////////
- The request produces the following token:
- [source,text]
- ---------------------------
- [ john.smith@example.com ]
- ---------------------------
- [float]
- === Configuration
- The `keyword` tokenizer accepts the following parameters:
- [horizontal]
- `buffer_size`::
- The number of characters read into the term buffer in a single pass.
- Defaults to `256`. The term buffer will grow by this size until all the
- text has been consumed. It is advisable not to change this setting.
|