1
0

lowercase-tokenizer.asciidoc 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
  1. [[analysis-lowercase-tokenizer]]
  2. === Lowercase Tokenizer
  3. The `lowercase` tokenizer, like the
  4. <<analysis-letter-tokenizer, `letter` tokenizer>> breaks text into terms
  5. whenever it encounters a character which is not a letter, but it also
  6. lowercases all terms. It is functionally equivalent to the
  7. <<analysis-letter-tokenizer, `letter` tokenizer>> combined with the
  8. <<analysis-lowercase-tokenfilter, `lowercase` token filter>>, but is more
  9. efficient as it performs both steps in a single pass.
  10. [float]
  11. === Example output
  12. [source,js]
  13. ---------------------------
  14. POST _analyze
  15. {
  16. "tokenizer": "lowercase",
  17. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  18. }
  19. ---------------------------
  20. // CONSOLE
  21. /////////////////////
  22. [source,console-result]
  23. ----------------------------
  24. {
  25. "tokens": [
  26. {
  27. "token": "the",
  28. "start_offset": 0,
  29. "end_offset": 3,
  30. "type": "word",
  31. "position": 0
  32. },
  33. {
  34. "token": "quick",
  35. "start_offset": 6,
  36. "end_offset": 11,
  37. "type": "word",
  38. "position": 1
  39. },
  40. {
  41. "token": "brown",
  42. "start_offset": 12,
  43. "end_offset": 17,
  44. "type": "word",
  45. "position": 2
  46. },
  47. {
  48. "token": "foxes",
  49. "start_offset": 18,
  50. "end_offset": 23,
  51. "type": "word",
  52. "position": 3
  53. },
  54. {
  55. "token": "jumped",
  56. "start_offset": 24,
  57. "end_offset": 30,
  58. "type": "word",
  59. "position": 4
  60. },
  61. {
  62. "token": "over",
  63. "start_offset": 31,
  64. "end_offset": 35,
  65. "type": "word",
  66. "position": 5
  67. },
  68. {
  69. "token": "the",
  70. "start_offset": 36,
  71. "end_offset": 39,
  72. "type": "word",
  73. "position": 6
  74. },
  75. {
  76. "token": "lazy",
  77. "start_offset": 40,
  78. "end_offset": 44,
  79. "type": "word",
  80. "position": 7
  81. },
  82. {
  83. "token": "dog",
  84. "start_offset": 45,
  85. "end_offset": 48,
  86. "type": "word",
  87. "position": 8
  88. },
  89. {
  90. "token": "s",
  91. "start_offset": 49,
  92. "end_offset": 50,
  93. "type": "word",
  94. "position": 9
  95. },
  96. {
  97. "token": "bone",
  98. "start_offset": 51,
  99. "end_offset": 55,
  100. "type": "word",
  101. "position": 10
  102. }
  103. ]
  104. }
  105. ----------------------------
  106. /////////////////////
  107. The above sentence would produce the following terms:
  108. [source,text]
  109. ---------------------------
  110. [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
  111. ---------------------------
  112. [float]
  113. === Configuration
  114. The `lowercase` tokenizer is not configurable.