mapper-annotated-text.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437
  1. [[mapper-annotated-text]]
  2. === Mapper annotated text plugin
  3. experimental[]
  4. The mapper-annotated-text plugin provides the ability to index text that is a
  5. combination of free-text and special markup that is typically used to identify
  6. items of interest such as people or organisations (see NER or Named Entity Recognition
  7. tools).
  8. The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token
  9. stream at the same position as the underlying text it annotates.
  10. :plugin_name: mapper-annotated-text
  11. include::install_remove.asciidoc[]
  12. [[mapper-annotated-text-usage]]
  13. ==== Using the `annotated-text` field
  14. The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see
  15. "limitations" below) but also injects any marked-up annotation tokens directly into
  16. the search index:
  17. [source,console]
  18. --------------------------
  19. PUT my-index-000001
  20. {
  21. "mappings": {
  22. "properties": {
  23. "my_field": {
  24. "type": "annotated_text"
  25. }
  26. }
  27. }
  28. }
  29. --------------------------
  30. Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text
  31. and structured tokens. The annotations use a markdown-like syntax using URL encoding of
  32. one or more values separated by the `&` symbol.
  33. We can use the "_analyze" api to test how an example annotation would be stored as tokens
  34. in the search index:
  35. [source,js]
  36. --------------------------
  37. GET my-index-000001/_analyze
  38. {
  39. "field": "my_field",
  40. "text":"Investors in [Apple](Apple+Inc.) rejoiced."
  41. }
  42. --------------------------
  43. // NOTCONSOLE
  44. Response:
  45. [source,js]
  46. --------------------------------------------------
  47. {
  48. "tokens": [
  49. {
  50. "token": "investors",
  51. "start_offset": 0,
  52. "end_offset": 9,
  53. "type": "<ALPHANUM>",
  54. "position": 0
  55. },
  56. {
  57. "token": "in",
  58. "start_offset": 10,
  59. "end_offset": 12,
  60. "type": "<ALPHANUM>",
  61. "position": 1
  62. },
  63. {
  64. "token": "Apple Inc.", <1>
  65. "start_offset": 13,
  66. "end_offset": 18,
  67. "type": "annotation",
  68. "position": 2
  69. },
  70. {
  71. "token": "apple",
  72. "start_offset": 13,
  73. "end_offset": 18,
  74. "type": "<ALPHANUM>",
  75. "position": 2
  76. },
  77. {
  78. "token": "rejoiced",
  79. "start_offset": 19,
  80. "end_offset": 27,
  81. "type": "<ALPHANUM>",
  82. "position": 3
  83. }
  84. ]
  85. }
  86. --------------------------------------------------
  87. // NOTCONSOLE
  88. <1> Note the whole annotation token `Apple Inc.` is placed, unchanged as a single token in
  89. the token stream and at the same position (position 2) as the text token (`apple`) it annotates.
  90. We can now perform searches for annotations using regular `term` queries that don't tokenize
  91. the provided search values. Annotations are a more precise way of matching as can be seen
  92. in this example where a search for `Beck` will not match `Jeff Beck` :
  93. [source,console]
  94. --------------------------
  95. # Example documents
  96. PUT my-index-000001/_doc/1
  97. {
  98. "my_field": "[Beck](Beck) announced a new tour"<1>
  99. }
  100. PUT my-index-000001/_doc/2
  101. {
  102. "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"<2>
  103. }
  104. # Example search
  105. GET my-index-000001/_search
  106. {
  107. "query": {
  108. "term": {
  109. "my_field": "Beck" <3>
  110. }
  111. }
  112. }
  113. --------------------------
  114. <1> As well as tokenising the plain text into single words e.g. `beck`, here we
  115. inject the single token value `Beck` at the same position as `beck` in the token stream.
  116. <2> Note annotations can inject multiple tokens at the same position - here we inject both
  117. the very specific value `Jeff Beck` and the broader term `Guitarist`. This enables
  118. broader positional queries e.g. finding mentions of a `Guitarist` near to `strat`.
  119. <3> A benefit of searching with these carefully defined annotation tokens is that a query for
  120. `Beck` will not match document 2 that contains the tokens `jeff`, `beck` and `Jeff Beck`
  121. WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will
  122. cause the document to be rejected with a parse failure. In future we hope to have a use for
  123. the equals signs so will actively reject documents that contain this today.
  124. [[annotated-text-synthetic-source]]
  125. ===== Synthetic `_source`
  126. IMPORTANT: Synthetic `_source` is Generally Available only for TSDB indices
  127. (indices that have `index.mode` set to `time_series`). For other indices
  128. synthetic `_source` is in technical preview. Features in technical preview may
  129. be changed or removed in a future release. Elastic will work to fix
  130. any issues, but features in technical preview are not subject to the support SLA
  131. of official GA features.
  132. `annotated_text` fields support {ref}/mapping-source-field.html#synthetic-source[synthetic `_source`] if they have
  133. a {ref}/keyword.html#keyword-synthetic-source[`keyword`] sub-field that supports synthetic
  134. `_source` or if the `annotated_text` field sets `store` to `true`. Either way, it may
  135. not have {ref}/copy-to.html[`copy_to`].
  136. If using a sub-`keyword` field then the values are sorted in the same way as
  137. a `keyword` field's values are sorted. By default, that means sorted with
  138. duplicates removed. So:
  139. [source,console,id=synthetic-source-text-example-default]
  140. ----
  141. PUT idx
  142. {
  143. "settings": {
  144. "index": {
  145. "mapping": {
  146. "source": {
  147. "mode": "synthetic"
  148. }
  149. }
  150. }
  151. },
  152. "mappings": {
  153. "properties": {
  154. "text": {
  155. "type": "annotated_text",
  156. "fields": {
  157. "raw": {
  158. "type": "keyword"
  159. }
  160. }
  161. }
  162. }
  163. }
  164. }
  165. PUT idx/_doc/1
  166. {
  167. "text": [
  168. "the quick brown fox",
  169. "the quick brown fox",
  170. "jumped over the lazy dog"
  171. ]
  172. }
  173. ----
  174. // TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
  175. Will become:
  176. [source,console-result]
  177. ----
  178. {
  179. "text": [
  180. "jumped over the lazy dog",
  181. "the quick brown fox"
  182. ]
  183. }
  184. ----
  185. // TEST[s/^/{"_source":/ s/\n$/}/]
  186. NOTE: Reordering text fields can have an effect on {ref}/query-dsl-match-query-phrase.html[phrase]
  187. and {ref}/span-queries.html[span] queries. See the discussion about {ref}/position-increment-gap.html[`position_increment_gap`] for more detail. You
  188. can avoid this by making sure the `slop` parameter on the phrase queries
  189. is lower than the `position_increment_gap`. This is the default.
  190. If the `annotated_text` field sets `store` to true then order and duplicates
  191. are preserved.
  192. [source,console,id=synthetic-source-text-example-stored]
  193. ----
  194. PUT idx
  195. {
  196. "settings": {
  197. "index": {
  198. "mapping": {
  199. "source": {
  200. "mode": "synthetic"
  201. }
  202. }
  203. }
  204. },
  205. "mappings": {
  206. "properties": {
  207. "text": { "type": "annotated_text", "store": true }
  208. }
  209. }
  210. }
  211. PUT idx/_doc/1
  212. {
  213. "text": [
  214. "the quick brown fox",
  215. "the quick brown fox",
  216. "jumped over the lazy dog"
  217. ]
  218. }
  219. ----
  220. // TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
  221. Will become:
  222. [source,console-result]
  223. ----
  224. {
  225. "text": [
  226. "the quick brown fox",
  227. "the quick brown fox",
  228. "jumped over the lazy dog"
  229. ]
  230. }
  231. ----
  232. // TEST[s/^/{"_source":/ s/\n$/}/]
  233. [[mapper-annotated-text-tips]]
  234. ==== Data modelling tips
  235. ===== Use structured and unstructured fields
  236. Annotations are normally a way of weaving structured information into unstructured text for
  237. higher-precision search.
  238. `Entity resolution` is a form of document enrichment undertaken by specialist software or people
  239. where references to entities in a document are disambiguated by attaching a canonical ID.
  240. The ID is used to resolve any number of aliases or distinguish between people with the
  241. same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved
  242. entity IDs woven into text.
  243. These IDs can be embedded as annotations in an annotated_text field but it often makes
  244. sense to include them in dedicated structured fields to support discovery via aggregations:
  245. [source,console]
  246. --------------------------
  247. PUT my-index-000001
  248. {
  249. "mappings": {
  250. "properties": {
  251. "my_unstructured_text_field": {
  252. "type": "annotated_text"
  253. },
  254. "my_structured_people_field": {
  255. "type": "text",
  256. "fields": {
  257. "keyword" : {
  258. "type": "keyword"
  259. }
  260. }
  261. }
  262. }
  263. }
  264. }
  265. --------------------------
  266. Applications would then typically provide content and discover it as follows:
  267. [source,console]
  268. --------------------------
  269. # Example documents
  270. PUT my-index-000001/_doc/1
  271. {
  272. "my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch",
  273. "my_twitter_handles": ["@kimchy"] <1>
  274. }
  275. GET my-index-000001/_search
  276. {
  277. "query": {
  278. "query_string": {
  279. "query": "elasticsearch OR logstash OR kibana",<2>
  280. "default_field": "my_unstructured_text_field"
  281. }
  282. },
  283. "aggregations": {
  284. "top_people" :{
  285. "significant_terms" : { <3>
  286. "field" : "my_twitter_handles.keyword"
  287. }
  288. }
  289. }
  290. }
  291. --------------------------
  292. <1> Note the `my_twitter_handles` contains a list of the annotation values
  293. also used in the unstructured text. (Note the annotated_text syntax requires escaping).
  294. By repeating the annotation values in a structured field this application has ensured that
  295. the tokens discovered in the structured field can be used for search and highlighting
  296. in the unstructured field.
  297. <2> In this example we search for documents that talk about components of the elastic stack
  298. <3> We use the `my_twitter_handles` field here to discover people who are significantly
  299. associated with the elastic stack.
  300. ===== Avoiding over-matching annotations
  301. By design, the regular text tokens and the annotation tokens co-exist in the same indexed
  302. field but in rare cases this can lead to some over-matching.
  303. The value of an annotation often denotes a _named entity_ (a person, place or company).
  304. The tokens for these named entities are inserted untokenized, and differ from typical text
  305. tokens because they are normally:
  306. * Mixed case e.g. `Madonna`
  307. * Multiple words e.g. `Jeff Beck`
  308. * Can have punctuation or numbers e.g. `Apple Inc.` or `@kimchy`
  309. This means, for the most part, a search for a named entity in the annotated text field will
  310. not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result
  311. you can drill down to highlight uses in the text without "over matching" on any text tokens
  312. like the word `apple` in this context:
  313. the apple was very juicy
  314. However, a problem arises if your named entity happens to be a single term and lower-case e.g. the
  315. company `elastic`. In this case, a search on the annotated text field for the token `elastic`
  316. may match a text document such as this:
  317. they fired an elastic band
  318. To avoid such false matches users should consider prefixing annotation values to ensure
  319. they don't name clash with text tokens e.g.
  320. [elastic](Company_elastic) released version 7.0 of the elastic stack today
  321. [[mapper-annotated-text-highlighter]]
  322. ==== Using the `annotated` highlighter
  323. The `annotated-text` plugin includes a custom highlighter designed to mark up search hits
  324. in a way which is respectful of the original markup:
  325. [source,console]
  326. --------------------------
  327. # Example documents
  328. PUT my-index-000001/_doc/1
  329. {
  330. "my_field": "The cat sat on the [mat](sku3578)"
  331. }
  332. GET my-index-000001/_search
  333. {
  334. "query": {
  335. "query_string": {
  336. "query": "cats"
  337. }
  338. },
  339. "highlight": {
  340. "fields": {
  341. "my_field": {
  342. "type": "annotated", <1>
  343. "require_field_match": false
  344. }
  345. }
  346. }
  347. }
  348. --------------------------
  349. <1> The `annotated` highlighter type is designed for use with annotated_text fields
  350. The annotated highlighter is based on the `unified` highlighter and supports the same
  351. settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using
  352. html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
  353. markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`
  354. is the key and the matched search term is the value e.g.
  355. The [cat](_hit_term=cat) sat on the [mat](sku3578)
  356. The annotated highlighter tries to be respectful of any existing markup in the original
  357. text:
  358. * If the search term matches exactly the location of an existing annotation then the
  359. `_hit_term` key is merged into the url-like syntax used in the `(...)` part of the
  360. existing annotation.
  361. * However, if the search term overlaps the span of an existing annotation it would break
  362. the markup formatting so the original annotation is removed in favour of a new annotation
  363. with just the search hit information in the results.
  364. * Any non-overlapping annotations in the original text are preserved in highlighter
  365. selections
  366. [[mapper-annotated-text-limitations]]
  367. ==== Limitations
  368. The annotated_text field type supports the same mapping settings as the `text` field type
  369. but with the following exceptions:
  370. * No support for `fielddata` or `fielddata_frequency_filter`
  371. * No support for `index_prefixes` or `index_phrases` indexing