123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319 |
- [[analysis-edgengram-tokenizer]]
- === Edge NGram Tokenizer
- The `edge_ngram` tokenizer first breaks text down into words whenever it
- encounters one of a list of specified characters, then it emits
- https://en.wikipedia.org/wiki/N-gram[N-grams] of each word where the start of
- the N-gram is anchored to the beginning of the word.
- Edge N-Grams are useful for _search-as-you-type_ queries.
- TIP: When you need _search-as-you-type_ for text which has a widely known
- order, such as movie or song titles, the
- <<completion-suggester,completion suggester>> is a much more efficient
- choice than edge N-grams. Edge N-grams have the advantage when trying to
- autocomplete words that can appear in any order.
- [float]
- === Example output
- With the default settings, the `edge_ngram` tokenizer treats the initial text as a
- single token and produces N-grams with minimum length `1` and maximum length
- `2`:
- [source,js]
- ---------------------------
- POST _analyze
- {
- "tokenizer": "edge_ngram",
- "text": "Quick Fox"
- }
- ---------------------------
- // CONSOLE
- /////////////////////
- [source,console-result]
- ----------------------------
- {
- "tokens": [
- {
- "token": "Q",
- "start_offset": 0,
- "end_offset": 1,
- "type": "word",
- "position": 0
- },
- {
- "token": "Qu",
- "start_offset": 0,
- "end_offset": 2,
- "type": "word",
- "position": 1
- }
- ]
- }
- ----------------------------
- /////////////////////
- The above sentence would produce the following terms:
- [source,text]
- ---------------------------
- [ Q, Qu ]
- ---------------------------
- NOTE: These default gram lengths are almost entirely useless. You need to
- configure the `edge_ngram` before using it.
- [float]
- === Configuration
- The `edge_ngram` tokenizer accepts the following parameters:
- [horizontal]
- `min_gram`::
- Minimum length of characters in a gram. Defaults to `1`.
- `max_gram`::
- Maximum length of characters in a gram. Defaults to `2`.
- `token_chars`::
- Character classes that should be included in a token. Elasticsearch
- will split on characters that don't belong to the classes specified.
- Defaults to `[]` (keep all characters).
- +
- Character classes may be any of the following:
- +
- * `letter` -- for example `a`, `b`, `ï` or `京`
- * `digit` -- for example `3` or `7`
- * `whitespace` -- for example `" "` or `"\n"`
- * `punctuation` -- for example `!` or `"`
- * `symbol` -- for example `$` or `√`
- [float]
- === Example configuration
- In this example, we configure the `edge_ngram` tokenizer to treat letters and
- digits as tokens, and to produce grams with minimum length `2` and maximum
- length `10`:
- [source,js]
- ----------------------------
- PUT my_index
- {
- "settings": {
- "analysis": {
- "analyzer": {
- "my_analyzer": {
- "tokenizer": "my_tokenizer"
- }
- },
- "tokenizer": {
- "my_tokenizer": {
- "type": "edge_ngram",
- "min_gram": 2,
- "max_gram": 10,
- "token_chars": [
- "letter",
- "digit"
- ]
- }
- }
- }
- }
- }
- POST my_index/_analyze
- {
- "analyzer": "my_analyzer",
- "text": "2 Quick Foxes."
- }
- ----------------------------
- // CONSOLE
- /////////////////////
- [source,console-result]
- ----------------------------
- {
- "tokens": [
- {
- "token": "Qu",
- "start_offset": 2,
- "end_offset": 4,
- "type": "word",
- "position": 0
- },
- {
- "token": "Qui",
- "start_offset": 2,
- "end_offset": 5,
- "type": "word",
- "position": 1
- },
- {
- "token": "Quic",
- "start_offset": 2,
- "end_offset": 6,
- "type": "word",
- "position": 2
- },
- {
- "token": "Quick",
- "start_offset": 2,
- "end_offset": 7,
- "type": "word",
- "position": 3
- },
- {
- "token": "Fo",
- "start_offset": 8,
- "end_offset": 10,
- "type": "word",
- "position": 4
- },
- {
- "token": "Fox",
- "start_offset": 8,
- "end_offset": 11,
- "type": "word",
- "position": 5
- },
- {
- "token": "Foxe",
- "start_offset": 8,
- "end_offset": 12,
- "type": "word",
- "position": 6
- },
- {
- "token": "Foxes",
- "start_offset": 8,
- "end_offset": 13,
- "type": "word",
- "position": 7
- }
- ]
- }
- ----------------------------
- /////////////////////
- The above example produces the following terms:
- [source,text]
- ---------------------------
- [ Qu, Qui, Quic, Quick, Fo, Fox, Foxe, Foxes ]
- ---------------------------
- Usually we recommend using the same `analyzer` at index time and at search
- time. In the case of the `edge_ngram` tokenizer, the advice is different. It
- only makes sense to use the `edge_ngram` tokenizer at index time, to ensure
- that partial words are available for matching in the index. At search time,
- just search for the terms the user has typed in, for instance: `Quick Fo`.
- Below is an example of how to set up a field for _search-as-you-type_:
- [source,js]
- -----------------------------------
- PUT my_index
- {
- "settings": {
- "analysis": {
- "analyzer": {
- "autocomplete": {
- "tokenizer": "autocomplete",
- "filter": [
- "lowercase"
- ]
- },
- "autocomplete_search": {
- "tokenizer": "lowercase"
- }
- },
- "tokenizer": {
- "autocomplete": {
- "type": "edge_ngram",
- "min_gram": 2,
- "max_gram": 10,
- "token_chars": [
- "letter"
- ]
- }
- }
- }
- },
- "mappings": {
- "properties": {
- "title": {
- "type": "text",
- "analyzer": "autocomplete",
- "search_analyzer": "autocomplete_search"
- }
- }
- }
- }
- PUT my_index/_doc/1
- {
- "title": "Quick Foxes" <1>
- }
- POST my_index/_refresh
- GET my_index/_search
- {
- "query": {
- "match": {
- "title": {
- "query": "Quick Fo", <2>
- "operator": "and"
- }
- }
- }
- }
- -----------------------------------
- // CONSOLE
- <1> The `autocomplete` analyzer indexes the terms `[qu, qui, quic, quick, fo, fox, foxe, foxes]`.
- <2> The `autocomplete_search` analyzer searches for the terms `[quick, fo]`, both of which appear in the index.
- /////////////////////
- [source,js]
- ----------------------------
- {
- "took": $body.took,
- "timed_out": false,
- "_shards": {
- "total": 1,
- "successful": 1,
- "skipped" : 0,
- "failed": 0
- },
- "hits": {
- "total" : {
- "value": 1,
- "relation": "eq"
- },
- "max_score": 0.5753642,
- "hits": [
- {
- "_index": "my_index",
- "_type": "_doc",
- "_id": "1",
- "_score": 0.5753642,
- "_source": {
- "title": "Quick Foxes"
- }
- }
- ]
- }
- }
- ----------------------------
- // TESTRESPONSE[s/"took".*/"took": "$body.took",/]
- /////////////////////
|