123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377 |
- [[indices-analyze]]
- === Analyze API
- ++++
- <titleabbrev>Analyze</titleabbrev>
- ++++
- Performs <<analysis,analysis>> on a text string
- and returns the resulting tokens.
- [source,console]
- --------------------------------------------------
- GET /_analyze
- {
- "analyzer" : "standard",
- "text" : "Quick Brown Foxes!"
- }
- --------------------------------------------------
- [[analyze-api-request]]
- ==== {api-request-title}
- `GET /_analyze`
- `POST /_analyze`
- `GET /<index>/_analyze`
- `POST /<index>/_analyze`
- [[analyze-api-path-params]]
- ==== {api-path-parms-title}
- `<index>`::
- +
- --
- (Optional, string)
- Index used to derive the analyzer.
- If specified,
- the `analyzer` or `<field>` parameter overrides this value.
- If no analyzer or field are specified,
- the analyze API uses the default analyzer for the index.
- If no index is specified
- or the index does not have a default analyzer,
- the analyze API uses the <<analysis-standard-analyzer,standard analyzer>>.
- --
- [[analyze-api-query-params]]
- ==== {api-query-parms-title}
- `analyzer`::
- +
- --
- (Optional, string or <<analysis-custom-analyzer,custom analyzer object>>)
- Analyzer used to analyze for the provided `text`.
- See <<analysis-analyzers>> for a list of built-in analyzers.
- You can also provide a <<analysis-custom-analyzer,custom analyzer>>.
- If this parameter is not specified,
- the analyze API uses the analyzer defined in the field's mapping.
- If no field is specified,
- the analyze API uses the default analyzer for the index.
- If no index is specified,
- or the index does not have a default analyzer,
- the analyze API uses the <<analysis-standard-analyzer,standard analyzer>>.
- --
- `attributes`::
- (Optional, array of strings)
- Array of token attributes used to filter the output of the `explain` parameter.
- `char_filter`::
- (Optional, array of strings)
- Array of character filters used to preprocess characters before the tokenizer.
- See <<analysis-charfilters>> for a list of character filters.
- `explain`::
- (Optional, boolean)
- If `true`, the response includes token attributes and additional details.
- Defaults to `false`.
- experimental:[The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.]
- `field`::
- +
- --
- (Optional, string)
- Field used to derive the analyzer.
- To use this parameter,
- you must specify an index.
- If specified,
- the `analyzer` parameter overrides this value.
- If no field is specified,
- the analyze API uses the default analyzer for the index.
- If no index is specified
- or the index does not have a default analyzer,
- the analyze API uses the <<analysis-standard-analyzer,standard analyzer>>.
- --
- `filter`::
- (Optional, Array of strings)
- Array of token filters used to apply after the tokenizer.
- See <<analysis-tokenfilters>> for a list of token filters.
- `normalizer`::
- (Optional, string)
- Normalizer to use to convert text into a single token.
- See <<analysis-normalizers>> for a list of normalizers.
- `text`::
- (Required, string or array of strings)
- Text to analyze.
- If an array of strings is provided, it is analyzed as a multi-value field.
- `tokenizer`::
- (Optional, string)
- Tokenizer to use to convert text into tokens.
- See <<analysis-tokenizers>> for a list of tokenizers.
- [[analyze-api-example]]
- ==== {api-examples-title}
- [[analyze-api-no-index-ex]]
- ===== No index specified
- You can apply any of the built-in analyzers to the text string without
- specifying an index.
- [source,console]
- --------------------------------------------------
- GET /_analyze
- {
- "analyzer" : "standard",
- "text" : "this is a test"
- }
- --------------------------------------------------
- [[analyze-api-text-array-ex]]
- ===== Array of text strings
- If the `text` parameter is provided as array of strings, it is analyzed as a multi-value field.
- [source,console]
- --------------------------------------------------
- GET /_analyze
- {
- "analyzer" : "standard",
- "text" : ["this is a test", "the second text"]
- }
- --------------------------------------------------
- [[analyze-api-custom-analyzer-ex]]
- ===== Custom analyzer
- You can use the analyze API to test a custom transient analyzer built from
- tokenizers, token filters, and char filters. Token filters use the `filter`
- parameter:
- [source,console]
- --------------------------------------------------
- GET /_analyze
- {
- "tokenizer" : "keyword",
- "filter" : ["lowercase"],
- "text" : "this is a test"
- }
- --------------------------------------------------
- [source,console]
- --------------------------------------------------
- GET /_analyze
- {
- "tokenizer" : "keyword",
- "filter" : ["lowercase"],
- "char_filter" : ["html_strip"],
- "text" : "this is a <b>test</b>"
- }
- --------------------------------------------------
- deprecated[5.0.0, Use `filter`/`char_filter` instead of `filters`/`char_filters` and `token_filters` has been removed]
- Custom tokenizers, token filters, and character filters can be specified in the request body as follows:
- [source,console]
- --------------------------------------------------
- GET /_analyze
- {
- "tokenizer" : "whitespace",
- "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
- "text" : "this is a test"
- }
- --------------------------------------------------
- [[analyze-api-specific-index-ex]]
- ===== Specific index
- You can also run the analyze API against a specific index:
- [source,console]
- --------------------------------------------------
- GET /analyze_sample/_analyze
- {
- "text" : "this is a test"
- }
- --------------------------------------------------
- // TEST[setup:analyze_sample]
- The above will run an analysis on the "this is a test" text, using the
- default index analyzer associated with the `analyze_sample` index. An `analyzer`
- can also be provided to use a different analyzer:
- [source,console]
- --------------------------------------------------
- GET /analyze_sample/_analyze
- {
- "analyzer" : "whitespace",
- "text" : "this is a test"
- }
- --------------------------------------------------
- // TEST[setup:analyze_sample]
- [[analyze-api-field-ex]]
- ===== Derive analyzer from a field mapping
- The analyzer can be derived based on a field mapping, for example:
- [source,console]
- --------------------------------------------------
- GET /analyze_sample/_analyze
- {
- "field" : "obj1.field1",
- "text" : "this is a test"
- }
- --------------------------------------------------
- // TEST[setup:analyze_sample]
- Will cause the analysis to happen based on the analyzer configured in the
- mapping for `obj1.field1` (and if not, the default index analyzer).
- [[analyze-api-normalizer-ex]]
- ===== Normalizer
- A `normalizer` can be provided for keyword field with normalizer associated with the `analyze_sample` index.
- [source,console]
- --------------------------------------------------
- GET /analyze_sample/_analyze
- {
- "normalizer" : "my_normalizer",
- "text" : "BaR"
- }
- --------------------------------------------------
- // TEST[setup:analyze_sample]
- Or by building a custom transient normalizer out of token filters and char filters.
- [source,console]
- --------------------------------------------------
- GET /_analyze
- {
- "filter" : ["lowercase"],
- "text" : "BaR"
- }
- --------------------------------------------------
- [[explain-analyze-api]]
- ===== Explain analyze
- If you want to get more advanced details, set `explain` to `true` (defaults to `false`). It will output all token attributes for each token.
- You can filter token attributes you want to output by setting `attributes` option.
- NOTE: The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.
- [source,console]
- --------------------------------------------------
- GET /_analyze
- {
- "tokenizer" : "standard",
- "filter" : ["snowball"],
- "text" : "detailed output",
- "explain" : true,
- "attributes" : ["keyword"] <1>
- }
- --------------------------------------------------
- <1> Set "keyword" to output "keyword" attribute only
- The request returns the following result:
- [source,console-result]
- --------------------------------------------------
- {
- "detail" : {
- "custom_analyzer" : true,
- "charfilters" : [ ],
- "tokenizer" : {
- "name" : "standard",
- "tokens" : [ {
- "token" : "detailed",
- "start_offset" : 0,
- "end_offset" : 8,
- "type" : "<ALPHANUM>",
- "position" : 0
- }, {
- "token" : "output",
- "start_offset" : 9,
- "end_offset" : 15,
- "type" : "<ALPHANUM>",
- "position" : 1
- } ]
- },
- "tokenfilters" : [ {
- "name" : "snowball",
- "tokens" : [ {
- "token" : "detail",
- "start_offset" : 0,
- "end_offset" : 8,
- "type" : "<ALPHANUM>",
- "position" : 0,
- "keyword" : false <1>
- }, {
- "token" : "output",
- "start_offset" : 9,
- "end_offset" : 15,
- "type" : "<ALPHANUM>",
- "position" : 1,
- "keyword" : false <1>
- } ]
- } ]
- }
- }
- --------------------------------------------------
- <1> Output only "keyword" attribute, since specify "attributes" in the request.
- [[tokens-limit-settings]]
- ===== Setting a token limit
- Generating excessive amount of tokens may cause a node to run out of memory.
- The following setting allows to limit the number of tokens that can be produced:
- `index.analyze.max_token_count`::
- The maximum number of tokens that can be produced using `_analyze` API.
- The default value is `10000`. If more than this limit of tokens gets
- generated, an error will be thrown. The `_analyze` endpoint without a specified
- index will always use `10000` value as a limit. This setting allows you to control
- the limit for a specific index:
- [source,console]
- --------------------------------------------------
- PUT /analyze_sample
- {
- "settings" : {
- "index.analyze.max_token_count" : 20000
- }
- }
- --------------------------------------------------
- [source,console]
- --------------------------------------------------
- GET /analyze_sample/_analyze
- {
- "text" : "this is a test"
- }
- --------------------------------------------------
- // TEST[setup:analyze_sample]
|