testing.asciidoc 4.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207
  1. [[test-analyzer]]
  2. === Test an analyzer
  3. The <<indices-analyze,`analyze` API>> is an invaluable tool for viewing the
  4. terms produced by an analyzer. A built-in analyzer can be specified inline in
  5. the request:
  6. [source,console]
  7. -------------------------------------
  8. POST _analyze
  9. {
  10. "analyzer": "whitespace",
  11. "text": "The quick brown fox."
  12. }
  13. -------------------------------------
  14. The API returns the following response:
  15. [source,console-result]
  16. -------------------------------------
  17. {
  18. "tokens": [
  19. {
  20. "token": "The",
  21. "start_offset": 0,
  22. "end_offset": 3,
  23. "type": "word",
  24. "position": 0
  25. },
  26. {
  27. "token": "quick",
  28. "start_offset": 4,
  29. "end_offset": 9,
  30. "type": "word",
  31. "position": 1
  32. },
  33. {
  34. "token": "brown",
  35. "start_offset": 10,
  36. "end_offset": 15,
  37. "type": "word",
  38. "position": 2
  39. },
  40. {
  41. "token": "fox.",
  42. "start_offset": 16,
  43. "end_offset": 20,
  44. "type": "word",
  45. "position": 3
  46. }
  47. ]
  48. }
  49. -------------------------------------
  50. You can also test combinations of:
  51. * A tokenizer
  52. * Zero or more token filters
  53. * Zero or more character filters
  54. [source,console]
  55. -------------------------------------
  56. POST _analyze
  57. {
  58. "tokenizer": "standard",
  59. "filter": [ "lowercase", "asciifolding" ],
  60. "text": "Is this déja vu?"
  61. }
  62. -------------------------------------
  63. The API returns the following response:
  64. [source,console-result]
  65. -------------------------------------
  66. {
  67. "tokens": [
  68. {
  69. "token": "is",
  70. "start_offset": 0,
  71. "end_offset": 2,
  72. "type": "<ALPHANUM>",
  73. "position": 0
  74. },
  75. {
  76. "token": "this",
  77. "start_offset": 3,
  78. "end_offset": 7,
  79. "type": "<ALPHANUM>",
  80. "position": 1
  81. },
  82. {
  83. "token": "deja",
  84. "start_offset": 8,
  85. "end_offset": 12,
  86. "type": "<ALPHANUM>",
  87. "position": 2
  88. },
  89. {
  90. "token": "vu",
  91. "start_offset": 13,
  92. "end_offset": 15,
  93. "type": "<ALPHANUM>",
  94. "position": 3
  95. }
  96. ]
  97. }
  98. -------------------------------------
  99. .Positions and character offsets
  100. *********************************************************
  101. As can be seen from the output of the `analyze` API, analyzers not only
  102. convert words into terms, they also record the order or relative _positions_
  103. of each term (used for phrase queries or word proximity queries), and the
  104. start and end _character offsets_ of each term in the original text (used for
  105. highlighting search snippets).
  106. *********************************************************
  107. Alternatively, a <<analysis-custom-analyzer,`custom` analyzer>> can be
  108. referred to when running the `analyze` API on a specific index:
  109. [source,console]
  110. -------------------------------------
  111. PUT my-index-000001
  112. {
  113. "settings": {
  114. "analysis": {
  115. "analyzer": {
  116. "std_folded": { <1>
  117. "type": "custom",
  118. "tokenizer": "standard",
  119. "filter": [
  120. "lowercase",
  121. "asciifolding"
  122. ]
  123. }
  124. }
  125. }
  126. },
  127. "mappings": {
  128. "properties": {
  129. "my_text": {
  130. "type": "text",
  131. "analyzer": "std_folded" <2>
  132. }
  133. }
  134. }
  135. }
  136. GET my-index-000001/_analyze <3>
  137. {
  138. "analyzer": "std_folded", <4>
  139. "text": "Is this déjà vu?"
  140. }
  141. GET my-index-000001/_analyze <3>
  142. {
  143. "field": "my_text", <5>
  144. "text": "Is this déjà vu?"
  145. }
  146. -------------------------------------
  147. The API returns the following response:
  148. [source,console-result]
  149. -------------------------------------
  150. {
  151. "tokens": [
  152. {
  153. "token": "is",
  154. "start_offset": 0,
  155. "end_offset": 2,
  156. "type": "<ALPHANUM>",
  157. "position": 0
  158. },
  159. {
  160. "token": "this",
  161. "start_offset": 3,
  162. "end_offset": 7,
  163. "type": "<ALPHANUM>",
  164. "position": 1
  165. },
  166. {
  167. "token": "deja",
  168. "start_offset": 8,
  169. "end_offset": 12,
  170. "type": "<ALPHANUM>",
  171. "position": 2
  172. },
  173. {
  174. "token": "vu",
  175. "start_offset": 13,
  176. "end_offset": 15,
  177. "type": "<ALPHANUM>",
  178. "position": 3
  179. }
  180. ]
  181. }
  182. -------------------------------------
  183. <1> Define a `custom` analyzer called `std_folded`.
  184. <2> The field `my_text` uses the `std_folded` analyzer.
  185. <3> To refer to this analyzer, the `analyze` API must specify the index name.
  186. <4> Refer to the analyzer by name.
  187. <5> Refer to the analyzer used by field `my_text`.