lqb
/
elasticsearch
mirror of https://gitee.com/mirrors/elasticsearch.git


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
							[[analysis-keyword-tokenizer]]
=== Keyword Tokenizer

The `keyword` tokenizer  is a ``noop'' tokenizer that accepts whatever text it
is given and outputs the exact same text as a single term.  It can be combined
with token filters to normalise output, e.g. lower-casing email addresses.

[float]
=== Example output

[source,console]
---------------------------
POST _analyze
{
  "tokenizer": "keyword",
  "text": "New York"
}
---------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "New York",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    }
  ]
}
----------------------------

/////////////////////


The above sentence would produce the following term:

[source,text]
---------------------------
[ New York ]
---------------------------

[discrete]
[[analysis-keyword-tokenizer-token-filters]]
=== Combine with token filters
You can combine the `keyword` tokenizer with token filters to normalise
structured data, such as product IDs or email addresses.

For example, the following <<indices-analyze,analyze API>> request uses the
`keyword` tokenizer and <<analysis-lowercase-tokenfilter,`lowercase`>> filter to
convert an email address to lowercase.

[source,console]
---------------------------
POST _analyze
{
  "tokenizer": "keyword",
  "filter": [ "lowercase" ],
  "text": "john.SMITH@example.COM"
}
---------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "john.smith@example.com",
      "start_offset": 0,
      "end_offset": 22,
      "type": "word",
      "position": 0
    }
  ]
}
----------------------------

/////////////////////


The request produces the following token:

[source,text]
---------------------------
[ john.smith@example.com ]
---------------------------


[float]
=== Configuration

The `keyword` tokenizer accepts the following parameters:

[horizontal]
`buffer_size`::

    The number of characters read into the term buffer in a single pass.
    Defaults to `256`.  The term buffer will grow by this size until all the
    text has been consumed.  It is advisable not to change this setting.