stemming.asciidoc 4.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
  1. [[stemming]]
  2. === Stemming
  3. _Stemming_ is the process of reducing a word to its root form. This ensures
  4. variants of a word match during a search.
  5. For example, `walking` and `walked` can be stemmed to the same root word:
  6. `walk`. Once stemmed, an occurrence of either word would match the other in a
  7. search.
  8. Stemming is language-dependent but often involves removing prefixes and
  9. suffixes from words.
  10. In some cases, the root form of a stemmed word may not be a real word. For
  11. example, `jumping` and `jumpiness` can both be stemmed to `jumpi`. While `jumpi`
  12. isn't a real English word, it doesn't matter for search; if all variants of a
  13. word are reduced to the same root form, they will match correctly.
  14. [[stemmer-token-filters]]
  15. ==== Stemmer token filters
  16. In {es}, stemming is handled by stemmer <<analyzer-anatomy-token-filters,token
  17. filters>>. These token filters can be categorized based on how they stem words:
  18. * <<algorithmic-stemmers,Algorithmic stemmers>>, which stem words based on a set
  19. of rules
  20. * <<dictionary-stemmers,Dictionary stemmers>>, which stem words by looking them
  21. up in a dictionary
  22. Because stemming changes tokens, we recommend using the same stemmer token
  23. filters during <<analysis-index-search-time,index and search analysis>>.
  24. [[algorithmic-stemmers]]
  25. ==== Algorithmic stemmers
  26. Algorithmic stemmers apply a series of rules to each word to reduce it to its
  27. root form. For example, an algorithmic stemmer for English may remove the `-s`
  28. and `-es` prefixes from the end of plural words.
  29. Algorithmic stemmers have a few advantages:
  30. * They require little setup and usually work well out of the box.
  31. * They use little memory.
  32. * They are typically faster than <<dictionary-stemmers,dictionary stemmers>>.
  33. However, most algorithmic stemmers only alter the existing text of a word. This
  34. means they may not work well with irregular words that don't contain their root
  35. form, such as:
  36. * `be`, `are`, and `am`
  37. * `mouse` and `mice`
  38. * `foot` and `feet`
  39. The following token filters use algorithmic stemming:
  40. * <<analysis-stemmer-tokenfilter,`stemmer`>>, which provides algorithmic
  41. stemming for several languages, some with additional variants.
  42. * <<analysis-kstem-tokenfilter,`kstem`>>, a stemmer for English that combines
  43. algorithmic stemming with a built-in dictionary.
  44. * <<analysis-porterstem-tokenfilter,`porter_stem`>>, our recommended algorithmic
  45. stemmer for English.
  46. * <<analysis-snowball-tokenfilter,`snowball`>>, which uses
  47. http://snowball.tartarus.org/[Snowball]-based stemming rules for several
  48. languages.
  49. [[dictionary-stemmers]]
  50. ==== Dictionary stemmers
  51. Dictionary stemmers look up words in a provided dictionary, replacing unstemmed
  52. word variants with stemmed words from the dictionary.
  53. In theory, dictionary stemmers are well suited for:
  54. * Stemming irregular words
  55. * Discerning between words that are spelled similarly but not related
  56. conceptually, such as:
  57. ** `organ` and `organization`
  58. ** `broker` and `broken`
  59. In practice, algorithmic stemmers typically outperform dictionary stemmers. This
  60. is because dictionary stemmers have the following disadvantages:
  61. * *Dictionary quality* +
  62. A dictionary stemmer is only as good as its dictionary. To work well, these
  63. dictionaries must include a significant number of words, be updated regularly,
  64. and change with language trends. Often, by the time a dictionary has been made
  65. available, it's incomplete and some of its entries are already outdated.
  66. * *Size and performance* +
  67. Dictionary stemmers must load all words, prefixes, and suffixes from its
  68. dictionary into memory. This can use a significant amount of RAM. Low-quality
  69. dictionaries may also be less efficient with prefix and suffix removal, which
  70. can slow the stemming process significantly.
  71. You can use the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter to
  72. perform dictionary stemming.
  73. [TIP]
  74. ====
  75. If available, we recommend trying an algorithmic stemmer for your language
  76. before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter.
  77. ====
  78. [[control-stemming]]
  79. ==== Control stemming
  80. Sometimes stemming can produce shared root words that are spelled similarly but
  81. not related conceptually. For example, a stemmer may reduce both `skies` and
  82. `skiing` to the same root word: `ski`.
  83. To prevent this and better control stemming, you can use the following token
  84. filters:
  85. * <<analysis-stemmer-override-tokenfilter,`stemmer_override`>>, which lets you
  86. define rules for stemming specific tokens.
  87. * <<analysis-keyword-marker-tokenfilter,`keyword_marker`>>, which marks
  88. specified tokens as keywords. Keyword tokens are not stemmed by subsequent
  89. stemmer token filters.
  90. * <<analysis-condition-tokenfilter,`conditional`>>, which can be used to mark
  91. tokens as keywords, similar to the `keyword_marker` filter.
  92. For built-in <<analysis-lang-analyzer,language analyzers>>, you also can use the
  93. <<_excluding_words_from_stemming,`stem_exclusion`>> parameter to specify a list
  94. of words that won't be stemmed.