analyze.asciidoc 9.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390
  1. [[indices-analyze]]
  2. === Analyze API
  3. ++++
  4. <titleabbrev>Analyze</titleabbrev>
  5. ++++
  6. Performs <<analysis,analysis>> on a text string
  7. and returns the resulting tokens.
  8. [source,js]
  9. --------------------------------------------------
  10. GET /_analyze
  11. {
  12. "analyzer" : "standard",
  13. "text" : "Quick Brown Foxes!"
  14. }
  15. --------------------------------------------------
  16. // CONSOLE
  17. [[analyze-api-request]]
  18. ==== {api-request-title}
  19. `GET /_analyze`
  20. `POST /_analyze`
  21. `GET /<index>/_analyze`
  22. `POST /<index>/_analyze`
  23. [[analyze-api-path-params]]
  24. ==== {api-path-parms-title}
  25. `<index>`::
  26. +
  27. --
  28. (Optional, string)
  29. Index used to derive the analyzer.
  30. If specified,
  31. the `analyzer` or `<field>` parameter overrides this value.
  32. If no analyzer or field are specified,
  33. the analyze API uses the default analyzer for the index.
  34. If no index is specified
  35. or the index does not have a default analyzer,
  36. the analyze API uses the <<analysis-standard-analyzer,standard analyzer>>.
  37. --
  38. [[analyze-api-query-params]]
  39. ==== {api-query-parms-title}
  40. `analyzer`::
  41. +
  42. --
  43. (Optional, string or <<analysis-custom-analyzer,custom analyzer object>>)
  44. Analyzer used to analyze for the provided `text`.
  45. See <<analysis-analyzers>> for a list of built-in analyzers.
  46. You can also provide a <<analysis-custom-analyzer,custom analyzer>>.
  47. If this parameter is not specified,
  48. the analyze API uses the analyzer defined in the field's mapping.
  49. If no field is specified,
  50. the analyze API uses the default analyzer for the index.
  51. If no index is specified,
  52. or the index does not have a default analyzer,
  53. the analyze API uses the <<analysis-standard-analyzer,standard analyzer>>.
  54. --
  55. `attributes`::
  56. (Optional, array of strings)
  57. Array of token attributes used to filter the output of the `explain` parameter.
  58. `char_filter`::
  59. (Optional, array of strings)
  60. Array of character filters used to preprocess characters before the tokenizer.
  61. See <<analysis-charfilters>> for a list of character filters.
  62. `explain`::
  63. (Optional, boolean)
  64. If `true`, the response includes token attributes and additional details.
  65. Defaults to `false`.
  66. experimental:[The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.]
  67. `field`::
  68. +
  69. --
  70. (Optional, string)
  71. Field used to derive the analyzer.
  72. To use this parameter,
  73. you must specify an index.
  74. If specified,
  75. the `analyzer` parameter overrides this value.
  76. If no field is specified,
  77. the analyze API uses the default analyzer for the index.
  78. If no index is specified
  79. or the index does not have a default analyzer,
  80. the analyze API uses the <<analysis-standard-analyzer,standard analyzer>>.
  81. --
  82. `filter`::
  83. (Optional, Array of strings)
  84. Array of token filters used to apply after the tokenizer.
  85. See <<analysis-tokenfilters>> for a list of token filters.
  86. `normalizer`::
  87. (Optional, string)
  88. Normalizer to use to convert text into a single token.
  89. See <<analysis-normalizers>> for a list of normalizers.
  90. `text`::
  91. (Required, string or array of strings)
  92. Text to analyze.
  93. If an array of strings is provided, it is analyzed as a multi-value field.
  94. `tokenizer`::
  95. (Optional, string)
  96. Tokenizer to use to convert text into tokens.
  97. See <<analysis-tokenizers>> for a list of tokenizers.
  98. [[analyze-api-example]]
  99. ==== {api-examples-title}
  100. [[analyze-api-no-index-ex]]
  101. ===== No index specified
  102. You can apply any of the built-in analyzers to the text string without
  103. specifying an index.
  104. [source,js]
  105. --------------------------------------------------
  106. GET /_analyze
  107. {
  108. "analyzer" : "standard",
  109. "text" : "this is a test"
  110. }
  111. --------------------------------------------------
  112. // CONSOLE
  113. [[analyze-api-text-array-ex]]
  114. ===== Array of text strings
  115. If the `text` parameter is provided as array of strings, it is analyzed as a multi-value field.
  116. [source,js]
  117. --------------------------------------------------
  118. GET /_analyze
  119. {
  120. "analyzer" : "standard",
  121. "text" : ["this is a test", "the second text"]
  122. }
  123. --------------------------------------------------
  124. // CONSOLE
  125. [[analyze-api-custom-analyzer-ex]]
  126. ===== Custom analyzer
  127. You can use the analyze API to test a custom transient analyzer built from
  128. tokenizers, token filters, and char filters. Token filters use the `filter`
  129. parameter:
  130. [source,js]
  131. --------------------------------------------------
  132. GET /_analyze
  133. {
  134. "tokenizer" : "keyword",
  135. "filter" : ["lowercase"],
  136. "text" : "this is a test"
  137. }
  138. --------------------------------------------------
  139. // CONSOLE
  140. [source,js]
  141. --------------------------------------------------
  142. GET /_analyze
  143. {
  144. "tokenizer" : "keyword",
  145. "filter" : ["lowercase"],
  146. "char_filter" : ["html_strip"],
  147. "text" : "this is a <b>test</b>"
  148. }
  149. --------------------------------------------------
  150. // CONSOLE
  151. deprecated[5.0.0, Use `filter`/`char_filter` instead of `filters`/`char_filters` and `token_filters` has been removed]
  152. Custom tokenizers, token filters, and character filters can be specified in the request body as follows:
  153. [source,js]
  154. --------------------------------------------------
  155. GET /_analyze
  156. {
  157. "tokenizer" : "whitespace",
  158. "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
  159. "text" : "this is a test"
  160. }
  161. --------------------------------------------------
  162. // CONSOLE
  163. [[analyze-api-specific-index-ex]]
  164. ===== Specific index
  165. You can also run the analyze API against a specific index:
  166. [source,js]
  167. --------------------------------------------------
  168. GET /analyze_sample/_analyze
  169. {
  170. "text" : "this is a test"
  171. }
  172. --------------------------------------------------
  173. // CONSOLE
  174. // TEST[setup:analyze_sample]
  175. The above will run an analysis on the "this is a test" text, using the
  176. default index analyzer associated with the `analyze_sample` index. An `analyzer`
  177. can also be provided to use a different analyzer:
  178. [source,js]
  179. --------------------------------------------------
  180. GET /analyze_sample/_analyze
  181. {
  182. "analyzer" : "whitespace",
  183. "text" : "this is a test"
  184. }
  185. --------------------------------------------------
  186. // CONSOLE
  187. // TEST[setup:analyze_sample]
  188. [[analyze-api-field-ex]]
  189. ===== Derive analyzer from a field mapping
  190. The analyzer can be derived based on a field mapping, for example:
  191. [source,js]
  192. --------------------------------------------------
  193. GET /analyze_sample/_analyze
  194. {
  195. "field" : "obj1.field1",
  196. "text" : "this is a test"
  197. }
  198. --------------------------------------------------
  199. // CONSOLE
  200. // TEST[setup:analyze_sample]
  201. Will cause the analysis to happen based on the analyzer configured in the
  202. mapping for `obj1.field1` (and if not, the default index analyzer).
  203. [[analyze-api-normalizer-ex]]
  204. ===== Normalizer
  205. A `normalizer` can be provided for keyword field with normalizer associated with the `analyze_sample` index.
  206. [source,js]
  207. --------------------------------------------------
  208. GET /analyze_sample/_analyze
  209. {
  210. "normalizer" : "my_normalizer",
  211. "text" : "BaR"
  212. }
  213. --------------------------------------------------
  214. // CONSOLE
  215. // TEST[setup:analyze_sample]
  216. Or by building a custom transient normalizer out of token filters and char filters.
  217. [source,js]
  218. --------------------------------------------------
  219. GET /_analyze
  220. {
  221. "filter" : ["lowercase"],
  222. "text" : "BaR"
  223. }
  224. --------------------------------------------------
  225. // CONSOLE
  226. [[explain-analyze-api]]
  227. ===== Explain analyze
  228. If you want to get more advanced details, set `explain` to `true` (defaults to `false`). It will output all token attributes for each token.
  229. You can filter token attributes you want to output by setting `attributes` option.
  230. NOTE: The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.
  231. [source,js]
  232. --------------------------------------------------
  233. GET /_analyze
  234. {
  235. "tokenizer" : "standard",
  236. "filter" : ["snowball"],
  237. "text" : "detailed output",
  238. "explain" : true,
  239. "attributes" : ["keyword"] <1>
  240. }
  241. --------------------------------------------------
  242. // CONSOLE
  243. <1> Set "keyword" to output "keyword" attribute only
  244. The request returns the following result:
  245. [source,console-result]
  246. --------------------------------------------------
  247. {
  248. "detail" : {
  249. "custom_analyzer" : true,
  250. "charfilters" : [ ],
  251. "tokenizer" : {
  252. "name" : "standard",
  253. "tokens" : [ {
  254. "token" : "detailed",
  255. "start_offset" : 0,
  256. "end_offset" : 8,
  257. "type" : "<ALPHANUM>",
  258. "position" : 0
  259. }, {
  260. "token" : "output",
  261. "start_offset" : 9,
  262. "end_offset" : 15,
  263. "type" : "<ALPHANUM>",
  264. "position" : 1
  265. } ]
  266. },
  267. "tokenfilters" : [ {
  268. "name" : "snowball",
  269. "tokens" : [ {
  270. "token" : "detail",
  271. "start_offset" : 0,
  272. "end_offset" : 8,
  273. "type" : "<ALPHANUM>",
  274. "position" : 0,
  275. "keyword" : false <1>
  276. }, {
  277. "token" : "output",
  278. "start_offset" : 9,
  279. "end_offset" : 15,
  280. "type" : "<ALPHANUM>",
  281. "position" : 1,
  282. "keyword" : false <1>
  283. } ]
  284. } ]
  285. }
  286. }
  287. --------------------------------------------------
  288. <1> Output only "keyword" attribute, since specify "attributes" in the request.
  289. [[tokens-limit-settings]]
  290. ===== Setting a token limit
  291. Generating excessive amount of tokens may cause a node to run out of memory.
  292. The following setting allows to limit the number of tokens that can be produced:
  293. `index.analyze.max_token_count`::
  294. The maximum number of tokens that can be produced using `_analyze` API.
  295. The default value is `10000`. If more than this limit of tokens gets
  296. generated, an error will be thrown. The `_analyze` endpoint without a specified
  297. index will always use `10000` value as a limit. This setting allows you to control
  298. the limit for a specific index:
  299. [source,js]
  300. --------------------------------------------------
  301. PUT /analyze_sample
  302. {
  303. "settings" : {
  304. "index.analyze.max_token_count" : 20000
  305. }
  306. }
  307. --------------------------------------------------
  308. // CONSOLE
  309. [source,js]
  310. --------------------------------------------------
  311. GET /analyze_sample/_analyze
  312. {
  313. "text" : "this is a test"
  314. }
  315. --------------------------------------------------
  316. // CONSOLE
  317. // TEST[setup:analyze_sample]