ngram-tokenizer.asciidoc 1.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
  1. [[analysis-ngram-tokenizer]]
  2. === NGram Tokenizer
  3. A tokenizer of type `nGram`.
  4. The following are settings that can be set for a `nGram` tokenizer type:
  5. [cols="<,<,<",options="header",]
  6. |=======================================================================
  7. |Setting |Description |Default value
  8. |`min_gram` |Minimum size in codepoints of a single n-gram |`1`.
  9. |`max_gram` |Maximum size in codepoints of a single n-gram |`2`.
  10. |`token_chars` |Characters classes to keep in the
  11. tokens, Elasticsearch will split on characters that don't belong to any
  12. of these classes. |`[]` (Keep all characters)
  13. |=======================================================================
  14. `token_chars` accepts the following character classes:
  15. [horizontal]
  16. `letter`:: for example `a`, `b`, `ï` or `京`
  17. `digit`:: for example `3` or `7`
  18. `whitespace`:: for example `" "` or `"\n"`
  19. `punctuation`:: for example `!` or `"`
  20. `symbol`:: for example `$` or `√`
  21. [float]
  22. ==== Example
  23. [source,js]
  24. --------------------------------------------------
  25. curl -XPUT 'localhost:9200/test' -d '
  26. {
  27. "settings" : {
  28. "analysis" : {
  29. "analyzer" : {
  30. "my_ngram_analyzer" : {
  31. "tokenizer" : "my_ngram_tokenizer"
  32. }
  33. },
  34. "tokenizer" : {
  35. "my_ngram_tokenizer" : {
  36. "type" : "nGram",
  37. "min_gram" : "2",
  38. "max_gram" : "3",
  39. "token_chars": [ "letter", "digit" ]
  40. }
  41. }
  42. }
  43. }
  44. }'
  45. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
  46. # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
  47. --------------------------------------------------