analysis.asciidoc 3.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
  1. [[analysis]]
  2. = Analysis
  3. [partintro]
  4. --
  5. _Analysis_ is the process of converting text, like the body of any email, into
  6. _tokens_ or _terms_ which are added to the inverted index for searching.
  7. Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
  8. either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
  9. defined per index.
  10. [float]
  11. == Index time analysis
  12. For instance at index time, the built-in <<english-analyzer,`english`>> _analyzer_ would
  13. convert this sentence:
  14. [source,text]
  15. ------
  16. "The QUICK brown foxes jumped over the lazy dog!"
  17. ------
  18. into these terms, which would be added to the inverted index.
  19. [source,text]
  20. ------
  21. [ quick, brown, fox, jump, over, lazi, dog ]
  22. ------
  23. [float]
  24. === Specifying an index time analyzer
  25. Each <<text,`text`>> field in a mapping can specify its own
  26. <<analyzer,`analyzer`>>:
  27. [source,js]
  28. -------------------------
  29. PUT my_index
  30. {
  31. "mappings": {
  32. "my_type": {
  33. "properties": {
  34. "title": {
  35. "type": "text",
  36. "analyzer": "standard"
  37. }
  38. }
  39. }
  40. }
  41. }
  42. -------------------------
  43. // CONSOLE
  44. At index time, if no `analyzer` has been specified, it looks for an analyzer
  45. in the index settings called `default`. Failing that, it defaults to using
  46. the <<analysis-standard-analyzer,`standard` analyzer>>.
  47. [float]
  48. == Search time analysis
  49. This same analysis process is applied to the query string at search time in
  50. <<full-text-queries,full text queries>> like the
  51. <<query-dsl-match-query,`match` query>>
  52. to convert the text in the query string into terms of the same form as those
  53. that are stored in the inverted index.
  54. For instance, a user might search for:
  55. [source,text]
  56. ------
  57. "a quick fox"
  58. ------
  59. which would be analysed by the same `english` analyzer into the following terms:
  60. [source,text]
  61. ------
  62. [ quick, fox ]
  63. ------
  64. Even though the exact words used in the query string don't appear in the
  65. original text (`quick` vs `QUICK`, `fox` vs `foxes`), because we have applied
  66. the same analyzer to both the text and the query string, the terms from the
  67. query string exactly match the terms from the text in the inverted index,
  68. which means that this query would match our example document.
  69. [float]
  70. === Specifying a search time analyzer
  71. Usually the same analyzer should be used both at
  72. index time and at search time, and <<full-text-queries,full text queries>>
  73. like the <<query-dsl-match-query,`match` query>> will use the mapping to look
  74. up the analyzer to use for each field.
  75. The analyzer to use to search a particular field is determined by
  76. looking for:
  77. * An `analyzer` specified in the query itself.
  78. * The <<search-analyzer,`search_analyzer`>> mapping parameter.
  79. * The <<analyzer,`analyzer`>> mapping parameter.
  80. * An analyzer in the index settings called `default_search`.
  81. * An analyzer in the index settings called `default`.
  82. * The `standard` analyzer.
  83. --
  84. include::analysis/anatomy.asciidoc[]
  85. include::analysis/testing.asciidoc[]
  86. include::analysis/analyzers.asciidoc[]
  87. include::analysis/normalizers.asciidoc[]
  88. include::analysis/tokenizers.asciidoc[]
  89. include::analysis/tokenfilters.asciidoc[]
  90. include::analysis/charfilters.asciidoc[]