analyze.asciidoc 6.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246
  1. [[indices-analyze]]
  2. === Analyze
  3. Performs the analysis process on a text and return the tokens breakdown
  4. of the text.
  5. Can be used without specifying an index against one of the many built in
  6. analyzers:
  7. [source,js]
  8. --------------------------------------------------
  9. GET _analyze
  10. {
  11. "analyzer" : "standard",
  12. "text" : "this is a test"
  13. }
  14. --------------------------------------------------
  15. // CONSOLE
  16. If text parameter is provided as array of strings, it is analyzed as a multi-valued field.
  17. [source,js]
  18. --------------------------------------------------
  19. GET _analyze
  20. {
  21. "analyzer" : "standard",
  22. "text" : ["this is a test", "the second text"]
  23. }
  24. --------------------------------------------------
  25. // CONSOLE
  26. Or by building a custom transient analyzer out of tokenizers,
  27. token filters and char filters. Token filters can use the shorter 'filter'
  28. parameter name:
  29. [source,js]
  30. --------------------------------------------------
  31. GET _analyze
  32. {
  33. "tokenizer" : "keyword",
  34. "filter" : ["lowercase"],
  35. "text" : "this is a test"
  36. }
  37. --------------------------------------------------
  38. // CONSOLE
  39. [source,js]
  40. --------------------------------------------------
  41. GET _analyze
  42. {
  43. "tokenizer" : "keyword",
  44. "filter" : ["lowercase"],
  45. "char_filter" : ["html_strip"],
  46. "text" : "this is a <b>test</b>"
  47. }
  48. --------------------------------------------------
  49. // CONSOLE
  50. deprecated[5.0.0, Use `filter`/`char_filter` instead of `filters`/`char_filters` and `token_filters` has been removed]
  51. Custom tokenizers, token filters, and character filters can be specified in the request body as follows:
  52. [source,js]
  53. --------------------------------------------------
  54. GET _analyze
  55. {
  56. "tokenizer" : "whitespace",
  57. "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
  58. "text" : "this is a test"
  59. }
  60. --------------------------------------------------
  61. // CONSOLE
  62. It can also run against a specific index:
  63. [source,js]
  64. --------------------------------------------------
  65. GET analyze_sample/_analyze
  66. {
  67. "text" : "this is a test"
  68. }
  69. --------------------------------------------------
  70. // CONSOLE
  71. // TEST[setup:analyze_sample]
  72. The above will run an analysis on the "this is a test" text, using the
  73. default index analyzer associated with the `analyze_sample` index. An `analyzer`
  74. can also be provided to use a different analyzer:
  75. [source,js]
  76. --------------------------------------------------
  77. GET analyze_sample/_analyze
  78. {
  79. "analyzer" : "whitespace",
  80. "text" : "this is a test"
  81. }
  82. --------------------------------------------------
  83. // CONSOLE
  84. // TEST[setup:analyze_sample]
  85. Also, the analyzer can be derived based on a field mapping, for example:
  86. [source,js]
  87. --------------------------------------------------
  88. GET analyze_sample/_analyze
  89. {
  90. "field" : "obj1.field1",
  91. "text" : "this is a test"
  92. }
  93. --------------------------------------------------
  94. // CONSOLE
  95. // TEST[setup:analyze_sample]
  96. Will cause the analysis to happen based on the analyzer configured in the
  97. mapping for `obj1.field1` (and if not, the default index analyzer).
  98. A `normalizer` can be provided for keyword field with normalizer associated with the `analyze_sample` index.
  99. [source,js]
  100. --------------------------------------------------
  101. GET analyze_sample/_analyze
  102. {
  103. "normalizer" : "my_normalizer",
  104. "text" : "BaR"
  105. }
  106. --------------------------------------------------
  107. // CONSOLE
  108. // TEST[setup:analyze_sample]
  109. Or by building a custom transient normalizer out of token filters and char filters.
  110. [source,js]
  111. --------------------------------------------------
  112. GET _analyze
  113. {
  114. "filter" : ["lowercase"],
  115. "text" : "BaR"
  116. }
  117. --------------------------------------------------
  118. // CONSOLE
  119. [[explain-analyze-api]]
  120. ==== Explain Analyze
  121. If you want to get more advanced details, set `explain` to `true` (defaults to `false`). It will output all token attributes for each token.
  122. You can filter token attributes you want to output by setting `attributes` option.
  123. NOTE: The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.
  124. [source,js]
  125. --------------------------------------------------
  126. GET _analyze
  127. {
  128. "tokenizer" : "standard",
  129. "filter" : ["snowball"],
  130. "text" : "detailed output",
  131. "explain" : true,
  132. "attributes" : ["keyword"] <1>
  133. }
  134. --------------------------------------------------
  135. // CONSOLE
  136. <1> Set "keyword" to output "keyword" attribute only
  137. The request returns the following result:
  138. [source,js]
  139. --------------------------------------------------
  140. {
  141. "detail" : {
  142. "custom_analyzer" : true,
  143. "charfilters" : [ ],
  144. "tokenizer" : {
  145. "name" : "standard",
  146. "tokens" : [ {
  147. "token" : "detailed",
  148. "start_offset" : 0,
  149. "end_offset" : 8,
  150. "type" : "<ALPHANUM>",
  151. "position" : 0
  152. }, {
  153. "token" : "output",
  154. "start_offset" : 9,
  155. "end_offset" : 15,
  156. "type" : "<ALPHANUM>",
  157. "position" : 1
  158. } ]
  159. },
  160. "tokenfilters" : [ {
  161. "name" : "snowball",
  162. "tokens" : [ {
  163. "token" : "detail",
  164. "start_offset" : 0,
  165. "end_offset" : 8,
  166. "type" : "<ALPHANUM>",
  167. "position" : 0,
  168. "keyword" : false <1>
  169. }, {
  170. "token" : "output",
  171. "start_offset" : 9,
  172. "end_offset" : 15,
  173. "type" : "<ALPHANUM>",
  174. "position" : 1,
  175. "keyword" : false <1>
  176. } ]
  177. } ]
  178. }
  179. }
  180. --------------------------------------------------
  181. // TESTRESPONSE
  182. <1> Output only "keyword" attribute, since specify "attributes" in the request.
  183. [[tokens-limit-settings]]
  184. [float]
  185. === Settings to prevent tokens explosion
  186. Generating excessive amount of tokens may cause a node to run out of memory.
  187. The following setting allows to limit the number of tokens that can be produced:
  188. `index.analyze.max_token_count`::
  189. The maximum number of tokens that can be produced using `_analyze` API.
  190. The default value is `10000`. If more than this limit of tokens gets
  191. generated, an error will be thrown. The `_analyze` endpoint without a specified
  192. index will always use `10000` value as a limit. This setting allows you to control
  193. the limit for a specific index:
  194. [source,js]
  195. --------------------------------------------------
  196. PUT analyze_sample
  197. {
  198. "settings" : {
  199. "index.analyze.max_token_count" : 20000
  200. }
  201. }
  202. --------------------------------------------------
  203. // CONSOLE
  204. [source,js]
  205. --------------------------------------------------
  206. GET analyze_sample/_analyze
  207. {
  208. "text" : "this is a test"
  209. }
  210. --------------------------------------------------
  211. // CONSOLE
  212. // TEST[setup:analyze_sample]