123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170 |
- [[analyzer]]
- === `analyzer`
- The values of <<mapping-index,`analyzed`>> string fields are passed through an
- <<analysis,analyzer>> to convert the string into a stream of _tokens_ or
- _terms_. For instance, the string `"The quick Brown Foxes."` may, depending
- on which analyzer is used, be analyzed to the tokens: `quick`, `brown`,
- `fox`. These are the actual terms that are indexed for the field, which makes
- it possible to search efficiently for individual words _within_ big blobs of
- text.
- This analysis process needs to happen not just at index time, but also at
- query time: the query string needs to be passed through the same (or a
- similar) analyzer so that the terms that it tries to find are in the same
- format as those that exist in the index.
- Elasticsearch ships with a number of <<analysis-analyzers,pre-defined analyzers>>,
- which can be used without further configuration. It also ships with many
- <<analysis-charfilters,character filters>>, <<analysis-tokenizers,tokenizers>>,
- and <<analysis-tokenfilters>> which can be combined to configure
- custom analyzers per index.
- Analyzers can be specified per-query, per-field or per-index. At index time,
- Elasticsearch will look for an analyzer in this order:
- * The `analyzer` defined in the field mapping.
- * An analyzer named `default` in the index settings.
- * The <<analysis-standard-analyzer,`standard`>> analyzer.
- At query time, there are a few more layers:
- * The `analyzer` defined in a <<full-text-queries,full-text query>>.
- * The `search_analyzer` defined in the field mapping.
- * The `analyzer` defined in the field mapping.
- * An analyzer named `default_search` in the index settings.
- * An analyzer named `default` in the index settings.
- * The <<analysis-standard-analyzer,`standard`>> analyzer.
- The easiest way to specify an analyzer for a particular field is to define it
- in the field mapping, as follows:
- [source,js]
- --------------------------------------------------
- PUT my_index
- {
- "mappings": {
- "my_type": {
- "properties": {
- "text": { <1>
- "type": "text",
- "fields": {
- "english": { <2>
- "type": "text",
- "analyzer": "english"
- }
- }
- }
- }
- }
- }
- }
- GET my_index/_analyze?field=text <3>
- {
- "text": "The quick Brown Foxes."
- }
- GET my_index/_analyze?field=text.english <4>
- {
- "text": "The quick Brown Foxes."
- }
- --------------------------------------------------
- // AUTOSENSE
- <1> The `text` field uses the default `standard` analyzer`.
- <2> The `text.english` <<multi-fields,multi-field>> uses the `english` analyzer, which removes stop words and applies stemming.
- <3> This returns the tokens: [ `the`, `quick`, `brown`, `foxes` ].
- <4> This returns the tokens: [ `quick`, `brown`, `fox` ].
- [[search-quote-analyzer]]
- ==== `search_quote_analyzer`
- The `search_quote_analyzer` setting allows you to specify an analyzer for phrases, this is particularly useful when dealing with disabling
- stop words for phrase queries.
- To disable stop words for phrases a field utilising three analyzer settings will be required:
- 1. An `analyzer` setting for indexing all terms including stop words
- 2. A `search_analyzer` setting for non-phrase queries that will remove stop words
- 3. A `search_quote_analyzer` setting for phrase queries that will not remove stop words
- [source,js]
- --------------------------------------------------
- PUT my_index
- {
- "settings":{
- "analysis":{
- "analyzer":{
- "my_analyzer":{ <1>
- "type":"custom",
- "tokenizer":"standard",
- "filter":[
- "lowercase"
- ]
- },
- "my_stop_analyzer":{ <2>
- "type":"custom",
- "tokenizer":"standard",
- "filter":[
- "lowercase",
- "english_stop"
- ]
- }
- },
- "filter":{
- "english_stop":{
- "type":"stop",
- "stopwords":"_english_"
- }
- }
- }
- },
- "mappings":{
- "my_type":{
- "properties":{
- "title": {
- "type":"text",
- "analyzer":"my_analyzer", <3>
- "search_analyzer":"my_stop_analyzer", <4>
- "search_quote_analyzer":"my_analyzer" <5>
- }
- }
- }
- }
- }
- --------------------------------------------------
- // AUTOSENSE
- [source,js]
- --------------------------------------------------
- PUT my_index/my_type/1
- {
- "title":"The Quick Brown Fox"
- }
- PUT my_index/my_type/2
- {
- "title":"A Quick Brown Fox"
- }
- GET my_index/my_type/_search
- {
- "query":{
- "query_string":{
- "query":"\"the quick brown fox\"" <6>
- }
- }
- }
- --------------------------------------------------
- <1> `my_analyzer` analyzer which tokens all terms including stop words
- <2> `my_stop_analyzer` analyzer which removes stop words
- <3> `analyzer` setting that points to the `my_analyzer` analyzer which will be used at index time
- <4> `search_analyzer` setting that points to the `my_stop_analyzer` and removes stop words for non-phrase queries
- <5> `search_quote_analyzer` setting that points to the `my_analyzer` analyzer and ensures that stop words are not removed from phrase queries
- <6> Since the query is wrapped in quotes it is detected as a phrase query therefore the `search_quote_analyzer` kicks in and ensures the stop words
- are not removed from the query. The `my_analyzer` analyzer will then return the following tokens [`the`, `quick`, `brown`, `fox`] which will match one
- of the documents. Meanwhile term queries will be analyzed with the `my_stop_analyzer` analyzer which will filter out stop words. So a search for either
- `The quick brown fox` or `A quick brown fox` will return both documents since both documents contain the following tokens [`quick`, `brown`, `fox`].
- Without the `search_quote_analyzer` it would not be possible to do exact matches for phrase queries as the stop words from phrase queries would be
- removed resulting in both documents matching.
|