overview.asciidoc 2.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
  1. [[analysis-overview]]
  2. == Text analysis overview
  3. ++++
  4. <titleabbrev>Overview</titleabbrev>
  5. ++++
  6. Text analysis enables {es} to perform full-text search, where the search returns
  7. all _relevant_ results rather than just exact matches.
  8. If you search for `Quick fox jumps`, you probably want the document that
  9. contains `A quick brown fox jumps over the lazy dog`, and you might also want
  10. documents that contain related words like `fast fox` or `foxes leap`.
  11. [discrete]
  12. [[tokenization]]
  13. === Tokenization
  14. Analysis makes full-text search possible through _tokenization_: breaking a text
  15. down into smaller chunks, called _tokens_. In most cases, these tokens are
  16. individual words.
  17. If you index the phrase `the quick brown fox jumps` as a single string and the
  18. user searches for `quick fox`, it isn't considered a match. However, if you
  19. tokenize the phrase and index each word separately, the terms in the query
  20. string can be looked up individually. This means they can be matched by searches
  21. for `quick fox`, `fox brown`, or other variations.
  22. [discrete]
  23. [[normalization]]
  24. === Normalization
  25. Tokenization enables matching on individual terms, but each token is still
  26. matched literally. This means:
  27. * A search for `Quick` would not match `quick`, even though you likely want
  28. either term to match the other
  29. * Although `fox` and `foxes` share the same root word, a search for `foxes`
  30. would not match `fox` or vice versa.
  31. * A search for `jumps` would not match `leaps`. While they don't share a root
  32. word, they are synonyms and have a similar meaning.
  33. To solve these problems, text analysis can _normalize_ these tokens into a
  34. standard format. This allows you to match tokens that are not exactly the same
  35. as the search terms, but similar enough to still be relevant. For example:
  36. * `Quick` can be lowercased: `quick`.
  37. * `foxes` can be _stemmed_, or reduced to its root word: `fox`.
  38. * `jump` and `leap` are synonyms and can be indexed as a single word: `jump`.
  39. To ensure search terms match these words as intended, you can apply the same
  40. tokenization and normalization rules to the query string. For example, a search
  41. for `Foxes leap` can be normalized to a search for `fox jump`.
  42. [discrete]
  43. [[analysis-customization]]
  44. === Customize text analysis
  45. Text analysis is performed by an <<analyzer-anatomy,_analyzer_>>, a set of rules
  46. that govern the entire process.
  47. {es} includes a default analyzer, called the
  48. <<analysis-standard-analyzer,standard analyzer>>, which works well for most use
  49. cases right out of the box.
  50. If you want to tailor your search experience, you can choose a different
  51. <<analysis-analyzers,built-in analyzer>> or even
  52. <<analysis-custom-analyzer,configure a custom one>>. A custom analyzer gives you
  53. control over each step of the analysis process, including:
  54. * Changes to the text _before_ tokenization
  55. * How text is converted to tokens
  56. * Normalization changes made to tokens before indexing or search