1
0

token-graphs.asciidoc 3.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
  1. [[token-graphs]]
  2. === Token graphs
  3. When a <<analyzer-anatomy-tokenizer,tokenizer>> converts a text into a stream of
  4. tokens, it also records the following:
  5. * The `position` of each token in the stream
  6. * The `positionLength`, the number of positions that a token spans
  7. Using these, you can create a
  8. {wikipedia}/Directed_acyclic_graph[directed acyclic graph],
  9. called a _token graph_, for a stream. In a token graph, each position represents
  10. a node. Each token represents an edge or arc, pointing to the next position.
  11. image::images/analysis/token-graph-qbf-ex.svg[align="center"]
  12. [[token-graphs-synonyms]]
  13. ==== Synonyms
  14. Some <<analyzer-anatomy-token-filters,token filters>> can add new tokens, like
  15. synonyms, to an existing token stream. These synonyms often span the same
  16. positions as existing tokens.
  17. In the following graph, `quick` and its synonym `fast` both have a position of
  18. `0`. They span the same positions.
  19. image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"]
  20. [[token-graphs-multi-position-tokens]]
  21. ==== Multi-position tokens
  22. Some token filters can add tokens that span multiple positions. These can
  23. include tokens for multi-word synonyms, such as using "atm" as a synonym for
  24. "automatic teller machine."
  25. However, only some token filters, known as _graph token filters_, accurately
  26. record the `positionLength` for multi-position tokens. These filters include:
  27. * <<analysis-synonym-graph-tokenfilter,`synonym_graph`>>
  28. * <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>>
  29. Some tokenizers, such as the
  30. {plugins}/analysis-nori-tokenizer.html[`nori_tokenizer`], also accurately
  31. decompose compound tokens into multi-position tokens.
  32. In the following graph, `domain name system` and its synonym, `dns`, both have a
  33. position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in
  34. the graph have a default `positionLength` of `1`.
  35. image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]
  36. [[token-graphs-token-graphs-search]]
  37. ===== Using token graphs for search
  38. <<analysis-index-search-time,Indexing>> ignores the `positionLength` attribute
  39. and does not support token graphs containing multi-position tokens.
  40. However, queries, such as the <<query-dsl-match-query,`match`>> or
  41. <<query-dsl-match-query-phrase,`match_phrase`>> query, can use these graphs to
  42. generate multiple sub-queries from a single query string.
  43. .*Example*
  44. [%collapsible]
  45. ====
  46. A user runs a search for the following phrase using the `match_phrase` query:
  47. `domain name system is fragile`
  48. During <<analysis-index-search-time,search analysis>>, `dns`, a synonym for
  49. `domain name system`, is added to the query string's token stream. The `dns`
  50. token has a `positionLength` of `3`.
  51. image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]
  52. The `match_phrase` query uses this graph to generate sub-queries for the
  53. following phrases:
  54. [source,text]
  55. ------
  56. dns is fragile
  57. domain name system is fragile
  58. ------
  59. This means the query matches documents containing either `dns is fragile` _or_
  60. `domain name system is fragile`.
  61. ====
  62. [[token-graphs-invalid-token-graphs]]
  63. ===== Invalid token graphs
  64. The following token filters can add tokens that span multiple positions but
  65. only record a default `positionLength` of `1`:
  66. * <<analysis-synonym-tokenfilter,`synonym`>>
  67. * <<analysis-word-delimiter-tokenfilter,`word_delimiter`>>
  68. This means these filters will produce invalid token graphs for streams
  69. containing such tokens.
  70. In the following graph, `dns` is a multi-position synonym for `domain name
  71. system`. However, `dns` has the default `positionLength` value of `1`, resulting
  72. in an invalid graph.
  73. image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"]
  74. Avoid using invalid token graphs for search. Invalid graphs can cause unexpected
  75. search results.