lowercase-tokenizer.asciidoc 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
  1. [[analysis-lowercase-tokenizer]]
  2. === Lowercase tokenizer
  3. ++++
  4. <titleabbrev>Lowercase</titleabbrev>
  5. ++++
  6. The `lowercase` tokenizer, like the
  7. <<analysis-letter-tokenizer, `letter` tokenizer>> breaks text into terms
  8. whenever it encounters a character which is not a letter, but it also
  9. lowercases all terms. It is functionally equivalent to the
  10. <<analysis-letter-tokenizer, `letter` tokenizer>> combined with the
  11. <<analysis-lowercase-tokenfilter, `lowercase` token filter>>, but is more
  12. efficient as it performs both steps in a single pass.
  13. [discrete]
  14. === Example output
  15. [source,console]
  16. ---------------------------
  17. POST _analyze
  18. {
  19. "tokenizer": "lowercase",
  20. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  21. }
  22. ---------------------------
  23. /////////////////////
  24. [source,console-result]
  25. ----------------------------
  26. {
  27. "tokens": [
  28. {
  29. "token": "the",
  30. "start_offset": 0,
  31. "end_offset": 3,
  32. "type": "word",
  33. "position": 0
  34. },
  35. {
  36. "token": "quick",
  37. "start_offset": 6,
  38. "end_offset": 11,
  39. "type": "word",
  40. "position": 1
  41. },
  42. {
  43. "token": "brown",
  44. "start_offset": 12,
  45. "end_offset": 17,
  46. "type": "word",
  47. "position": 2
  48. },
  49. {
  50. "token": "foxes",
  51. "start_offset": 18,
  52. "end_offset": 23,
  53. "type": "word",
  54. "position": 3
  55. },
  56. {
  57. "token": "jumped",
  58. "start_offset": 24,
  59. "end_offset": 30,
  60. "type": "word",
  61. "position": 4
  62. },
  63. {
  64. "token": "over",
  65. "start_offset": 31,
  66. "end_offset": 35,
  67. "type": "word",
  68. "position": 5
  69. },
  70. {
  71. "token": "the",
  72. "start_offset": 36,
  73. "end_offset": 39,
  74. "type": "word",
  75. "position": 6
  76. },
  77. {
  78. "token": "lazy",
  79. "start_offset": 40,
  80. "end_offset": 44,
  81. "type": "word",
  82. "position": 7
  83. },
  84. {
  85. "token": "dog",
  86. "start_offset": 45,
  87. "end_offset": 48,
  88. "type": "word",
  89. "position": 8
  90. },
  91. {
  92. "token": "s",
  93. "start_offset": 49,
  94. "end_offset": 50,
  95. "type": "word",
  96. "position": 9
  97. },
  98. {
  99. "token": "bone",
  100. "start_offset": 51,
  101. "end_offset": 55,
  102. "type": "word",
  103. "position": 10
  104. }
  105. ]
  106. }
  107. ----------------------------
  108. /////////////////////
  109. The above sentence would produce the following terms:
  110. [source,text]
  111. ---------------------------
  112. [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
  113. ---------------------------
  114. [discrete]
  115. === Configuration
  116. The `lowercase` tokenizer is not configurable.