123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175 |
- [[analysis-index-search-time]]
- === Index and search analysis
- Text analysis occurs at two times:
- Index time::
- When a document is indexed, any <<text,`text`>> field values are analyzed.
- Search time::
- When running a <<full-text-queries,full-text search>> on a `text` field,
- the query string (the text the user is searching for) is analyzed.
- +
- Search time is also called _query time_.
- The analyzer, or set of analysis rules, used at each time is called the _index
- analyzer_ or _search analyzer_ respectively.
- [[analysis-same-index-search-analyzer]]
- ==== How the index and search analyzer work together
- In most cases, the same analyzer should be used at index and search time. This
- ensures the values and query strings for a field are changed into the same form
- of tokens. In turn, this ensures the tokens match as expected during a search.
- .**Example**
- [%collapsible]
- ====
- A document is indexed with the following value in a `text` field:
- [source,text]
- ------
- The QUICK brown foxes jumped over the dog!
- ------
- The index analyzer for the field converts the value into tokens and normalizes
- them. In this case, each of the tokens represents a word:
- [source,text]
- ------
- [ quick, brown, fox, jump, over, dog ]
- ------
- These tokens are then indexed.
- Later, a user searches the same `text` field for:
- [source,text]
- ------
- "Quick fox"
- ------
- The user expects this search to match the sentence indexed earlier,
- `The QUICK brown foxes jumped over the dog!`.
- However, the query string does not contain the exact words used in the
- document's original text:
- * `quick` vs `QUICK`
- * `fox` vs `foxes`
- To account for this, the query string is analyzed using the same analyzer. This
- analyzer produces the following tokens:
- [source,text]
- ------
- [ quick, fox ]
- ------
- To execute the search, {es} compares these query string tokens to the tokens
- indexed in the `text` field.
- [options="header"]
- |===
- |Token | Query string | `text` field
- |`quick` | X | X
- |`brown` | | X
- |`fox` | X | X
- |`jump` | | X
- |`over` | | X
- |`dog` | | X
- |===
- Because the field value are query string were analyzed in the same way, they
- created similar tokens. The tokens `quick` and `fox` are exact matches. This
- means the search matches the document containing `"The QUICK brown foxes jumped
- over the dog!"`, just as the user expects.
- ====
- [[different-analyzers]]
- ==== When to use a different search analyzer
- While less common, it sometimes makes sense to use different analyzers at index
- and search time. To enable this, {es} allows you to
- <<specify-search-analyzer,specify a separate search analyzer>>.
- Generally, a separate search analyzer should only be specified when using the
- same form of tokens for field values and query strings would create unexpected
- or irrelevant search matches.
- [[different-analyzer-ex]]
- .*Example*
- [%collapsible]
- ====
- {es} is used to create a search engine that matches only words that start with
- a provided prefix. For instance, a search for `tr` should return `tram` or
- `trope`—but never `taxi` or `bat`.
- A document is added to the search engine's index; this document contains one
- such word in a `text` field:
- [source,text]
- ------
- "Apple"
- ------
- The index analyzer for the field converts the value into tokens and normalizes
- them. In this case, each of the tokens represents a potential prefix for
- the word:
- [source,text]
- ------
- [ a, ap, app, appl, apple]
- ------
- These tokens are then indexed.
- Later, a user searches the same `text` field for:
- [source,text]
- ------
- "appli"
- ------
- The user expects this search to match only words that start with `appli`,
- such as `appliance` or `application`. The search should not match `apple`.
- However, if the index analyzer is used to analyze this query string, it would
- produce the following tokens:
- [source,text]
- ------
- [ a, ap, app, appl, appli ]
- ------
- When {es} compares these query string tokens to the ones indexed for `apple`,
- it finds several matches.
- [options="header"]
- |===
- |Token | `appli` | `apple`
- |`a` | X | X
- |`ap` | X | X
- |`app` | X | X
- |`appl` | X | X
- |`appli` | | X
- |===
- This means the search would erroneously match `apple`. Not only that, it would
- match any word starting with `a`.
- To fix this, you can specify a different search analyzer for query strings used
- on the `text` field.
- In this case, you could specify a search analyzer that produces a single token
- rather than a set of prefixes:
- [source,text]
- ------
- [ appli ]
- ------
- This query string token would only match tokens for words that start with
- `appli`, which better aligns with the user's search expectations.
- ====
|