analyze.asciidoc 5.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209
  1. [[indices-analyze]]
  2. == Analyze
  3. Performs the analysis process on a text and return the tokens breakdown
  4. of the text.
  5. Can be used without specifying an index against one of the many built in
  6. analyzers:
  7. [source,js]
  8. --------------------------------------------------
  9. GET _analyze
  10. {
  11. "analyzer" : "standard",
  12. "text" : "this is a test"
  13. }
  14. --------------------------------------------------
  15. // CONSOLE
  16. If text parameter is provided as array of strings, it is analyzed as a multi-valued field.
  17. [source,js]
  18. --------------------------------------------------
  19. GET _analyze
  20. {
  21. "analyzer" : "standard",
  22. "text" : ["this is a test", "the second text"]
  23. }
  24. --------------------------------------------------
  25. // CONSOLE
  26. Or by building a custom transient analyzer out of tokenizers,
  27. token filters and char filters. Token filters can use the shorter 'filter'
  28. parameter name:
  29. [source,js]
  30. --------------------------------------------------
  31. GET _analyze
  32. {
  33. "tokenizer" : "keyword",
  34. "filter" : ["lowercase"],
  35. "text" : "this is a test"
  36. }
  37. --------------------------------------------------
  38. // CONSOLE
  39. [source,js]
  40. --------------------------------------------------
  41. GET _analyze
  42. {
  43. "tokenizer" : "keyword",
  44. "filter" : ["lowercase"],
  45. "char_filter" : ["html_strip"],
  46. "text" : "this is a <b>test</b>"
  47. }
  48. --------------------------------------------------
  49. // CONSOLE
  50. deprecated[5.0.0, Use `filter`/`char_filter` instead of `filters`/`char_filters` and `token_filters` has been removed]
  51. Custom tokenizers, token filters, and character filters can be specified in the request body as follows:
  52. [source,js]
  53. --------------------------------------------------
  54. GET _analyze
  55. {
  56. "tokenizer" : "whitespace",
  57. "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
  58. "text" : "this is a test"
  59. }
  60. --------------------------------------------------
  61. // CONSOLE
  62. It can also run against a specific index:
  63. [source,js]
  64. --------------------------------------------------
  65. GET analyze_sample/_analyze
  66. {
  67. "text" : "this is a test"
  68. }
  69. --------------------------------------------------
  70. // CONSOLE
  71. // TEST[setup:analyze_sample]
  72. The above will run an analysis on the "this is a test" text, using the
  73. default index analyzer associated with the `analyze_sample` index. An `analyzer`
  74. can also be provided to use a different analyzer:
  75. [source,js]
  76. --------------------------------------------------
  77. GET analyze_sample/_analyze
  78. {
  79. "analyzer" : "whitespace",
  80. "text" : "this is a test"
  81. }
  82. --------------------------------------------------
  83. // CONSOLE
  84. // TEST[setup:analyze_sample]
  85. Also, the analyzer can be derived based on a field mapping, for example:
  86. [source,js]
  87. --------------------------------------------------
  88. GET analyze_sample/_analyze
  89. {
  90. "field" : "obj1.field1",
  91. "text" : "this is a test"
  92. }
  93. --------------------------------------------------
  94. // CONSOLE
  95. // TEST[setup:analyze_sample]
  96. Will cause the analysis to happen based on the analyzer configured in the
  97. mapping for `obj1.field1` (and if not, the default index analyzer).
  98. A `normalizer` can be provided for keyword field with normalizer associated with the `analyze_sample` index.
  99. [source,js]
  100. --------------------------------------------------
  101. GET analyze_sample/_analyze
  102. {
  103. "normalizer" : "my_normalizer",
  104. "text" : "BaR"
  105. }
  106. --------------------------------------------------
  107. // CONSOLE
  108. // TEST[setup:analyze_sample]
  109. Or by building a custom transient normalizer out of token filters and char filters.
  110. [source,js]
  111. --------------------------------------------------
  112. GET _analyze
  113. {
  114. "filter" : ["lowercase"],
  115. "text" : "BaR"
  116. }
  117. --------------------------------------------------
  118. // CONSOLE
  119. === Explain Analyze
  120. If you want to get more advanced details, set `explain` to `true` (defaults to `false`). It will output all token attributes for each token.
  121. You can filter token attributes you want to output by setting `attributes` option.
  122. NOTE: The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.
  123. [source,js]
  124. --------------------------------------------------
  125. GET _analyze
  126. {
  127. "tokenizer" : "standard",
  128. "filter" : ["snowball"],
  129. "text" : "detailed output",
  130. "explain" : true,
  131. "attributes" : ["keyword"] <1>
  132. }
  133. --------------------------------------------------
  134. // CONSOLE
  135. <1> Set "keyword" to output "keyword" attribute only
  136. The request returns the following result:
  137. [source,js]
  138. --------------------------------------------------
  139. {
  140. "detail" : {
  141. "custom_analyzer" : true,
  142. "charfilters" : [ ],
  143. "tokenizer" : {
  144. "name" : "standard",
  145. "tokens" : [ {
  146. "token" : "detailed",
  147. "start_offset" : 0,
  148. "end_offset" : 8,
  149. "type" : "<ALPHANUM>",
  150. "position" : 0
  151. }, {
  152. "token" : "output",
  153. "start_offset" : 9,
  154. "end_offset" : 15,
  155. "type" : "<ALPHANUM>",
  156. "position" : 1
  157. } ]
  158. },
  159. "tokenfilters" : [ {
  160. "name" : "snowball",
  161. "tokens" : [ {
  162. "token" : "detail",
  163. "start_offset" : 0,
  164. "end_offset" : 8,
  165. "type" : "<ALPHANUM>",
  166. "position" : 0,
  167. "keyword" : false <1>
  168. }, {
  169. "token" : "output",
  170. "start_offset" : 9,
  171. "end_offset" : 15,
  172. "type" : "<ALPHANUM>",
  173. "position" : 1,
  174. "keyword" : false <1>
  175. } ]
  176. } ]
  177. }
  178. }
  179. --------------------------------------------------
  180. // TESTRESPONSE
  181. <1> Output only "keyword" attribute, since specify "attributes" in the request.