keyword-tokenizer.asciidoc 2.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
  1. [[analysis-keyword-tokenizer]]
  2. === Keyword Tokenizer
  3. The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
  4. is given and outputs the exact same text as a single term. It can be combined
  5. with token filters to normalise output, e.g. lower-casing email addresses.
  6. [float]
  7. === Example output
  8. [source,console]
  9. ---------------------------
  10. POST _analyze
  11. {
  12. "tokenizer": "keyword",
  13. "text": "New York"
  14. }
  15. ---------------------------
  16. /////////////////////
  17. [source,console-result]
  18. ----------------------------
  19. {
  20. "tokens": [
  21. {
  22. "token": "New York",
  23. "start_offset": 0,
  24. "end_offset": 8,
  25. "type": "word",
  26. "position": 0
  27. }
  28. ]
  29. }
  30. ----------------------------
  31. /////////////////////
  32. The above sentence would produce the following term:
  33. [source,text]
  34. ---------------------------
  35. [ New York ]
  36. ---------------------------
  37. [discrete]
  38. [[analysis-keyword-tokenizer-token-filters]]
  39. === Combine with token filters
  40. You can combine the `keyword` tokenizer with token filters to normalise
  41. structured data, such as product IDs or email addresses.
  42. For example, the following <<indices-analyze,analyze API>> request uses the
  43. `keyword` tokenizer and <<analysis-lowercase-tokenfilter,`lowercase`>> filter to
  44. convert an email address to lowercase.
  45. [source,console]
  46. ---------------------------
  47. POST _analyze
  48. {
  49. "tokenizer": "keyword",
  50. "filter": [ "lowercase" ],
  51. "text": "john.SMITH@example.COM"
  52. }
  53. ---------------------------
  54. /////////////////////
  55. [source,console-result]
  56. ----------------------------
  57. {
  58. "tokens": [
  59. {
  60. "token": "john.smith@example.com",
  61. "start_offset": 0,
  62. "end_offset": 22,
  63. "type": "word",
  64. "position": 0
  65. }
  66. ]
  67. }
  68. ----------------------------
  69. /////////////////////
  70. The request produces the following token:
  71. [source,text]
  72. ---------------------------
  73. [ john.smith@example.com ]
  74. ---------------------------
  75. [float]
  76. === Configuration
  77. The `keyword` tokenizer accepts the following parameters:
  78. [horizontal]
  79. `buffer_size`::
  80. The number of characters read into the term buffer in a single pass.
  81. Defaults to `256`. The term buffer will grow by this size until all the
  82. text has been consumed. It is advisable not to change this setting.