123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125 |
- [[analysis]]
- = Analysis
- [partintro]
- --
- _Analysis_ is the process of converting text, like the body of any email, into
- _tokens_ or _terms_ which are added to the inverted index for searching.
- Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
- either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
- defined per index.
- [float]
- == Index time analysis
- For instance, at index time the built-in <<english-analyzer,`english`>> _analyzer_
- will first convert the sentence:
- [source,text]
- ------
- "The QUICK brown foxes jumped over the lazy dog!"
- ------
- into distinct tokens. It will then lowercase each token, remove frequent
- stopwords ("the") and reduce the terms to their word stems (foxes -> fox,
- jumped -> jump, lazy -> lazi). In the end, the following terms will be added
- to the inverted index:
- [source,text]
- ------
- [ quick, brown, fox, jump, over, lazi, dog ]
- ------
- [float]
- === Specifying an index time analyzer
- Each <<text,`text`>> field in a mapping can specify its own
- <<analyzer,`analyzer`>>:
- [source,js]
- -------------------------
- PUT my_index
- {
- "mappings": {
- "_doc": {
- "properties": {
- "title": {
- "type": "text",
- "analyzer": "standard"
- }
- }
- }
- }
- }
- -------------------------
- // CONSOLE
- At index time, if no `analyzer` has been specified, it looks for an analyzer
- in the index settings called `default`. Failing that, it defaults to using
- the <<analysis-standard-analyzer,`standard` analyzer>>.
- [float]
- == Search time analysis
- This same analysis process is applied to the query string at search time in
- <<full-text-queries,full text queries>> like the
- <<query-dsl-match-query,`match` query>>
- to convert the text in the query string into terms of the same form as those
- that are stored in the inverted index.
- For instance, a user might search for:
- [source,text]
- ------
- "a quick fox"
- ------
- which would be analysed by the same `english` analyzer into the following terms:
- [source,text]
- ------
- [ quick, fox ]
- ------
- Even though the exact words used in the query string don't appear in the
- original text (`quick` vs `QUICK`, `fox` vs `foxes`), because we have applied
- the same analyzer to both the text and the query string, the terms from the
- query string exactly match the terms from the text in the inverted index,
- which means that this query would match our example document.
- [float]
- === Specifying a search time analyzer
- Usually the same analyzer should be used both at
- index time and at search time, and <<full-text-queries,full text queries>>
- like the <<query-dsl-match-query,`match` query>> will use the mapping to look
- up the analyzer to use for each field.
- The analyzer to use to search a particular field is determined by
- looking for:
- * An `analyzer` specified in the query itself.
- * The <<search-analyzer,`search_analyzer`>> mapping parameter.
- * The <<analyzer,`analyzer`>> mapping parameter.
- * An analyzer in the index settings called `default_search`.
- * An analyzer in the index settings called `default`.
- * The `standard` analyzer.
- --
- include::analysis/anatomy.asciidoc[]
- include::analysis/testing.asciidoc[]
- include::analysis/analyzers.asciidoc[]
- include::analysis/normalizers.asciidoc[]
- include::analysis/tokenizers.asciidoc[]
- include::analysis/tokenfilters.asciidoc[]
- include::analysis/charfilters.asciidoc[]
|