keyword-tokenizer.asciidoc 2.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
  1. [[analysis-keyword-tokenizer]]
  2. === Keyword tokenizer
  3. ++++
  4. <titleabbrev>Keyword</titleabbrev>
  5. ++++
  6. The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
  7. is given and outputs the exact same text as a single term. It can be combined
  8. with token filters to normalise output, e.g. lower-casing email addresses.
  9. [float]
  10. === Example output
  11. [source,console]
  12. ---------------------------
  13. POST _analyze
  14. {
  15. "tokenizer": "keyword",
  16. "text": "New York"
  17. }
  18. ---------------------------
  19. /////////////////////
  20. [source,console-result]
  21. ----------------------------
  22. {
  23. "tokens": [
  24. {
  25. "token": "New York",
  26. "start_offset": 0,
  27. "end_offset": 8,
  28. "type": "word",
  29. "position": 0
  30. }
  31. ]
  32. }
  33. ----------------------------
  34. /////////////////////
  35. The above sentence would produce the following term:
  36. [source,text]
  37. ---------------------------
  38. [ New York ]
  39. ---------------------------
  40. [discrete]
  41. [[analysis-keyword-tokenizer-token-filters]]
  42. === Combine with token filters
  43. You can combine the `keyword` tokenizer with token filters to normalise
  44. structured data, such as product IDs or email addresses.
  45. For example, the following <<indices-analyze,analyze API>> request uses the
  46. `keyword` tokenizer and <<analysis-lowercase-tokenfilter,`lowercase`>> filter to
  47. convert an email address to lowercase.
  48. [source,console]
  49. ---------------------------
  50. POST _analyze
  51. {
  52. "tokenizer": "keyword",
  53. "filter": [ "lowercase" ],
  54. "text": "john.SMITH@example.COM"
  55. }
  56. ---------------------------
  57. /////////////////////
  58. [source,console-result]
  59. ----------------------------
  60. {
  61. "tokens": [
  62. {
  63. "token": "john.smith@example.com",
  64. "start_offset": 0,
  65. "end_offset": 22,
  66. "type": "word",
  67. "position": 0
  68. }
  69. ]
  70. }
  71. ----------------------------
  72. /////////////////////
  73. The request produces the following token:
  74. [source,text]
  75. ---------------------------
  76. [ john.smith@example.com ]
  77. ---------------------------
  78. [float]
  79. === Configuration
  80. The `keyword` tokenizer accepts the following parameters:
  81. [horizontal]
  82. `buffer_size`::
  83. The number of characters read into the term buffer in a single pass.
  84. Defaults to `256`. The term buffer will grow by this size until all the
  85. text has been consumed. It is advisable not to change this setting.