whitespace-analyzer.asciidoc 3.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
  1. [[analysis-whitespace-analyzer]]
  2. === Whitespace analyzer
  3. ++++
  4. <titleabbrev>Whitespace</titleabbrev>
  5. ++++
  6. The `whitespace` analyzer breaks text into terms whenever it encounters a
  7. whitespace character.
  8. [discrete]
  9. === Example output
  10. [source,console]
  11. ---------------------------
  12. POST _analyze
  13. {
  14. "analyzer": "whitespace",
  15. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  16. }
  17. ---------------------------
  18. /////////////////////
  19. [source,console-result]
  20. ----------------------------
  21. {
  22. "tokens": [
  23. {
  24. "token": "The",
  25. "start_offset": 0,
  26. "end_offset": 3,
  27. "type": "word",
  28. "position": 0
  29. },
  30. {
  31. "token": "2",
  32. "start_offset": 4,
  33. "end_offset": 5,
  34. "type": "word",
  35. "position": 1
  36. },
  37. {
  38. "token": "QUICK",
  39. "start_offset": 6,
  40. "end_offset": 11,
  41. "type": "word",
  42. "position": 2
  43. },
  44. {
  45. "token": "Brown-Foxes",
  46. "start_offset": 12,
  47. "end_offset": 23,
  48. "type": "word",
  49. "position": 3
  50. },
  51. {
  52. "token": "jumped",
  53. "start_offset": 24,
  54. "end_offset": 30,
  55. "type": "word",
  56. "position": 4
  57. },
  58. {
  59. "token": "over",
  60. "start_offset": 31,
  61. "end_offset": 35,
  62. "type": "word",
  63. "position": 5
  64. },
  65. {
  66. "token": "the",
  67. "start_offset": 36,
  68. "end_offset": 39,
  69. "type": "word",
  70. "position": 6
  71. },
  72. {
  73. "token": "lazy",
  74. "start_offset": 40,
  75. "end_offset": 44,
  76. "type": "word",
  77. "position": 7
  78. },
  79. {
  80. "token": "dog's",
  81. "start_offset": 45,
  82. "end_offset": 50,
  83. "type": "word",
  84. "position": 8
  85. },
  86. {
  87. "token": "bone.",
  88. "start_offset": 51,
  89. "end_offset": 56,
  90. "type": "word",
  91. "position": 9
  92. }
  93. ]
  94. }
  95. ----------------------------
  96. /////////////////////
  97. The above sentence would produce the following terms:
  98. [source,text]
  99. ---------------------------
  100. [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
  101. ---------------------------
  102. [discrete]
  103. === Configuration
  104. The `whitespace` analyzer is not configurable.
  105. [discrete]
  106. === Definition
  107. It consists of:
  108. Tokenizer::
  109. * <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
  110. If you need to customize the `whitespace` analyzer then you need to
  111. recreate it as a `custom` analyzer and modify it, usually by adding
  112. token filters. This would recreate the built-in `whitespace` analyzer
  113. and you can use it as a starting point for further customization:
  114. [source,console]
  115. ----------------------------------------------------
  116. PUT /whitespace_example
  117. {
  118. "settings": {
  119. "analysis": {
  120. "analyzer": {
  121. "rebuilt_whitespace": {
  122. "tokenizer": "whitespace",
  123. "filter": [ <1>
  124. ]
  125. }
  126. }
  127. }
  128. }
  129. }
  130. ----------------------------------------------------
  131. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: whitespace_example, first: whitespace, second: rebuilt_whitespace}\nendyaml\n/]
  132. <1> You'd add any token filters here.