| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320 | [[mapper-annotated-text]]=== Mapper Annotated Text Pluginexperimental[]The mapper-annotated-text plugin provides the ability to index text that is acombination of free-text and special markup that is typically used to identifyitems of interest such as people or organisations (see NER or Named Entity Recognitiontools). The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the tokenstream at the same position as the underlying text it annotates.:plugin_name: mapper-annotated-textinclude::install_remove.asciidoc[][[mapper-annotated-text-usage]]==== Using the `annotated-text` fieldThe `annotated-text` tokenizes text content as per the more common `text` field (see "limitations" below) but also injects any marked-up annotation tokens directly intothe search index:[source,console]--------------------------PUT my-index-000001{  "mappings": {    "properties": {      "my_field": {        "type": "annotated_text"      }    }  }}--------------------------Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both textand structured tokens. The annotations use a markdown-like syntax using URL encoding ofone or more values separated by the `&` symbol.We can use the "_analyze" api to test how an example annotation would be stored as tokensin the search index:[source,js]--------------------------GET my-index-000001/_analyze{  "field": "my_field",   "text":"Investors in [Apple](Apple+Inc.) rejoiced."}--------------------------// NOTCONSOLEResponse:[source,js]--------------------------------------------------{  "tokens": [    {      "token": "investors",      "start_offset": 0,      "end_offset": 9,      "type": "<ALPHANUM>",      "position": 0    },    {      "token": "in",      "start_offset": 10,      "end_offset": 12,      "type": "<ALPHANUM>",      "position": 1    },    {      "token": "Apple Inc.", <1>       "start_offset": 13,      "end_offset": 18,      "type": "annotation",      "position": 2    },    {      "token": "apple",      "start_offset": 13,      "end_offset": 18,      "type": "<ALPHANUM>",      "position": 2    },    {      "token": "rejoiced",      "start_offset": 19,      "end_offset": 27,      "type": "<ALPHANUM>",      "position": 3    }  ]}--------------------------------------------------// NOTCONSOLE<1> Note the whole annotation token `Apple Inc.` is placed, unchanged as a single token inthe token stream and at the same position (position 2) as the text token (`apple`) it annotates.We can now perform searches for annotations using regular `term` queries that don't tokenizethe provided search values. Annotations are a more precise way of matching as can be seen in this example where a search for `Beck` will not match `Jeff Beck` :[source,console]--------------------------# Example documentsPUT my-index-000001/_doc/1{  "my_field": "[Beck](Beck) announced a new tour"<1>}PUT my-index-000001/_doc/2{  "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"<2>}# Example searchGET my-index-000001/_search{  "query": {    "term": {        "my_field": "Beck" <3>    }  }}--------------------------<1> As well as tokenising the plain text into single words e.g. `beck`, here we inject the single token value `Beck` at the same position as `beck` in the token stream.<2> Note annotations can inject multiple tokens at the same position - here we inject boththe very specific value `Jeff Beck` and the broader term `Guitarist`. This enablesbroader positional queries e.g. finding mentions of a `Guitarist` near to `start`.<3> A benefit of searching with these carefully defined annotation tokens is that a query for `Beck` will not match document 2 that contains the tokens `jeff`, `beck` and `Jeff Beck`WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will cause the document to be rejected with a parse failure. In future we hope to have a use forthe equals signs so wil actively reject documents that contain this today.[[mapper-annotated-text-tips]]==== Data modelling tips===== Use structured and unstructured fieldsAnnotations are normally a way of weaving structured information into unstructured text forhigher-precision search.`Entity resolution` is a form of document enrichment undertaken by specialist software or people where references to entities in a document are disambiguated by attaching a canonical ID.The ID is used to resolve any number of aliases or distinguish between people with thesame name. The hyperlinks connecting Wikipedia's articles are a good example of resolved entity IDs woven into text. These IDs can be embedded as annotations in an annotated_text field but it often makes sense to include them in dedicated structured fields to support discovery via aggregations:[source,console]--------------------------PUT my-index-000001{  "mappings": {    "properties": {      "my_unstructured_text_field": {        "type": "annotated_text"      },      "my_structured_people_field": {        "type": "text",        "fields": {          "keyword" : {            "type": "keyword"          }        }      }    }  }}--------------------------Applications would then typically provide content and discover it as follows:[source,console]--------------------------# Example documentsPUT my-index-000001/_doc/1{  "my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch",  "my_twitter_handles": ["@kimchy"] <1>}GET my-index-000001/_search{  "query": {    "query_string": {        "query": "elasticsearch OR logstash OR kibana",<2>        "default_field": "my_unstructured_text_field"    }  },  "aggregations": {  	"top_people" :{  	    "significant_terms" : { <3>	       "field" : "my_twitter_handles.keyword"  	    }  	}  }}--------------------------<1> Note the `my_twitter_handles` contains a list of the annotation valuesalso used in the unstructured text. (Note the annotated_text syntax requires escaping). By repeating the annotation values in a structured field this application has ensured that the tokens discovered in the structured field can be used for search and highlighting in the unstructured field.  <2> In this example we search for documents that talk about components of the elastic stack<3> We use the `my_twitter_handles` field here to discover people who are significantlyassociated with the elastic stack.===== Avoiding over-matching annotationsBy design, the regular text tokens and the annotation tokens co-exist in the same indexed field but in rare cases this can lead to some over-matching.The value of an annotation often denotes a _named entity_ (a person, place or company).The tokens for these named entities are inserted untokenized, and differ from typical text tokens because they are normally:* Mixed case e.g. `Madonna`* Multiple words e.g. `Jeff Beck`* Can have punctuation or numbers e.g. `Apple Inc.` or `@kimchy`This means, for the most part, a search for a named entity in the annotated text field willnot have any false positives e.g. when selecting `Apple Inc.` from an aggregation result you can drill down to highlight uses in the text without "over matching" on any text tokens like the word `apple` in this context:    the apple was very juicy    However, a problem arises if your named entity happens to be a single term and lower-case e.g. the company `elastic`. In this case, a search on the annotated text field for the token `elastic`may match a text document such as this:    they fired an elastic bandTo avoid such false matches users should consider prefixing annotation values to ensure they don't name clash with text tokens e.g.    [elastic](Company_elastic) released version 7.0 of the elastic stack today[[mapper-annotated-text-highlighter]]==== Using the `annotated` highlighterThe `annotated-text` plugin includes a custom highlighter designed to mark up search hitsin a way which is respectful of the original markup:[source,console]--------------------------# Example documentsPUT my-index-000001/_doc/1{  "my_field": "The cat sat on the [mat](sku3578)"}GET my-index-000001/_search{  "query": {    "query_string": {        "query": "cats"     }  },  "highlight": {    "fields": {      "my_field": {        "type": "annotated", <1>        "require_field_match": false      }    }  }}--------------------------<1> The `annotated` highlighter type is designed for use with annotated_text fieldsThe annotated highlighter is based on the `unified` highlighter and supports the samesettings but does not use the `pre_tags` or `post_tags` parameters. Rather than usinghtml-like markup such as `<em>cat</em>` the annotated highlighter uses the same markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`is the  key and the matched search term is the value e.g.     The [cat](_hit_term=cat) sat on the [mat](sku3578)The annotated highlighter tries to be respectful of any existing markup in the original text:* If the search term matches exactly the location of an existing annotation then the `_hit_term` key is merged into the url-like syntax used in the `(...)` part of theexisting annotation. * However, if the search term overlaps the span of an existing annotation it would breakthe markup formatting so the original annotation is removed in favour of a new annotationwith just the search hit information in the results. * Any non-overlapping annotations in the original text are preserved in highlighterselections[[mapper-annotated-text-limitations]]==== LimitationsThe annotated_text field type supports the same mapping settings as the `text` field typebut with the following exceptions:* No support for `fielddata` or `fielddata_frequency_filter`* No support for `index_prefixes` or `index_phrases` indexing
 |