edgengram-tokenizer.asciidoc 2.6 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
  1. [[analysis-edgengram-tokenizer]]
  2. === Edge NGram Tokenizer
  3. A tokenizer of type `edgeNGram`.
  4. This tokenizer is very similar to `nGram` but only keeps n-grams which
  5. start at the beginning of a token.
  6. The following are settings that can be set for a `edgeNGram` tokenizer
  7. type:
  8. [cols="<,<,<",options="header",]
  9. |=======================================================================
  10. |Setting |Description |Default value
  11. |`min_gram` |Minimum size in codepoints of a single n-gram |`1`.
  12. |`max_gram` |Maximum size in codepoints of a single n-gram |`2`.
  13. |`token_chars` | Characters classes to keep in the
  14. tokens, Elasticsearch will split on characters that don't belong to any
  15. of these classes. |`[]` (Keep all characters)
  16. |=======================================================================
  17. `token_chars` accepts the following character classes:
  18. [horizontal]
  19. `letter`:: for example `a`, `b`, `ï` or `京`
  20. `digit`:: for example `3` or `7`
  21. `whitespace`:: for example `" "` or `"\n"`
  22. `punctuation`:: for example `!` or `"`
  23. `symbol`:: for example `$` or `√`
  24. [float]
  25. ==== Example
  26. [source,js]
  27. --------------------------------------------------
  28. curl -XPUT 'localhost:9200/test' -d '
  29. {
  30. "settings" : {
  31. "analysis" : {
  32. "analyzer" : {
  33. "my_edge_ngram_analyzer" : {
  34. "tokenizer" : "my_edge_ngram_tokenizer"
  35. }
  36. },
  37. "tokenizer" : {
  38. "my_edge_ngram_tokenizer" : {
  39. "type" : "edgeNGram",
  40. "min_gram" : "2",
  41. "max_gram" : "5",
  42. "token_chars": [ "letter", "digit" ]
  43. }
  44. }
  45. }
  46. }
  47. }'
  48. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_edge_ngram_analyzer' -d 'FC Schalke 04'
  49. # FC, Sc, Sch, Scha, Schal, 04
  50. --------------------------------------------------
  51. [float]
  52. ==== `side` deprecated
  53. There used to be a `side` parameter up to `0.90.1` but it is now deprecated. In
  54. order to emulate the behavior of `"side" : "BACK"` a
  55. <<analysis-reverse-tokenfilter,`reverse` token filter>> should be used together
  56. with the <<analysis-edgengram-tokenfilter,`edgeNGram` token filter>>. The
  57. `edgeNGram` filter must be enclosed in `reverse` filters like this:
  58. [source,js]
  59. --------------------------------------------------
  60. "filter" : ["reverse", "edgeNGram", "reverse"]
  61. --------------------------------------------------
  62. which essentially reverses the token, builds front `EdgeNGrams` and reverses
  63. the ngram again. This has the same effect as the previous `"side" : "BACK"` setting.