123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437 |
- [[mapper-annotated-text]]
- === Mapper annotated text plugin
- experimental[]
- The mapper-annotated-text plugin provides the ability to index text that is a
- combination of free-text and special markup that is typically used to identify
- items of interest such as people or organisations (see NER or Named Entity Recognition
- tools).
- The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token
- stream at the same position as the underlying text it annotates.
- :plugin_name: mapper-annotated-text
- include::install_remove.asciidoc[]
- [[mapper-annotated-text-usage]]
- ==== Using the `annotated-text` field
- The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see
- "limitations" below) but also injects any marked-up annotation tokens directly into
- the search index:
- [source,console]
- --------------------------
- PUT my-index-000001
- {
- "mappings": {
- "properties": {
- "my_field": {
- "type": "annotated_text"
- }
- }
- }
- }
- --------------------------
- Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text
- and structured tokens. The annotations use a markdown-like syntax using URL encoding of
- one or more values separated by the `&` symbol.
- We can use the "_analyze" api to test how an example annotation would be stored as tokens
- in the search index:
- [source,js]
- --------------------------
- GET my-index-000001/_analyze
- {
- "field": "my_field",
- "text":"Investors in [Apple](Apple+Inc.) rejoiced."
- }
- --------------------------
- // NOTCONSOLE
- Response:
- [source,js]
- --------------------------------------------------
- {
- "tokens": [
- {
- "token": "investors",
- "start_offset": 0,
- "end_offset": 9,
- "type": "<ALPHANUM>",
- "position": 0
- },
- {
- "token": "in",
- "start_offset": 10,
- "end_offset": 12,
- "type": "<ALPHANUM>",
- "position": 1
- },
- {
- "token": "Apple Inc.", <1>
- "start_offset": 13,
- "end_offset": 18,
- "type": "annotation",
- "position": 2
- },
- {
- "token": "apple",
- "start_offset": 13,
- "end_offset": 18,
- "type": "<ALPHANUM>",
- "position": 2
- },
- {
- "token": "rejoiced",
- "start_offset": 19,
- "end_offset": 27,
- "type": "<ALPHANUM>",
- "position": 3
- }
- ]
- }
- --------------------------------------------------
- // NOTCONSOLE
- <1> Note the whole annotation token `Apple Inc.` is placed, unchanged as a single token in
- the token stream and at the same position (position 2) as the text token (`apple`) it annotates.
- We can now perform searches for annotations using regular `term` queries that don't tokenize
- the provided search values. Annotations are a more precise way of matching as can be seen
- in this example where a search for `Beck` will not match `Jeff Beck` :
- [source,console]
- --------------------------
- # Example documents
- PUT my-index-000001/_doc/1
- {
- "my_field": "[Beck](Beck) announced a new tour"<1>
- }
- PUT my-index-000001/_doc/2
- {
- "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"<2>
- }
- # Example search
- GET my-index-000001/_search
- {
- "query": {
- "term": {
- "my_field": "Beck" <3>
- }
- }
- }
- --------------------------
- <1> As well as tokenising the plain text into single words e.g. `beck`, here we
- inject the single token value `Beck` at the same position as `beck` in the token stream.
- <2> Note annotations can inject multiple tokens at the same position - here we inject both
- the very specific value `Jeff Beck` and the broader term `Guitarist`. This enables
- broader positional queries e.g. finding mentions of a `Guitarist` near to `strat`.
- <3> A benefit of searching with these carefully defined annotation tokens is that a query for
- `Beck` will not match document 2 that contains the tokens `jeff`, `beck` and `Jeff Beck`
- WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will
- cause the document to be rejected with a parse failure. In future we hope to have a use for
- the equals signs so will actively reject documents that contain this today.
- [[annotated-text-synthetic-source]]
- ===== Synthetic `_source`
- IMPORTANT: Synthetic `_source` is Generally Available only for TSDB indices
- (indices that have `index.mode` set to `time_series`). For other indices
- synthetic `_source` is in technical preview. Features in technical preview may
- be changed or removed in a future release. Elastic will work to fix
- any issues, but features in technical preview are not subject to the support SLA
- of official GA features.
- `annotated_text` fields support {ref}/mapping-source-field.html#synthetic-source[synthetic `_source`] if they have
- a {ref}/keyword.html#keyword-synthetic-source[`keyword`] sub-field that supports synthetic
- `_source` or if the `annotated_text` field sets `store` to `true`. Either way, it may
- not have {ref}/copy-to.html[`copy_to`].
- If using a sub-`keyword` field then the values are sorted in the same way as
- a `keyword` field's values are sorted. By default, that means sorted with
- duplicates removed. So:
- [source,console,id=synthetic-source-text-example-default]
- ----
- PUT idx
- {
- "settings": {
- "index": {
- "mapping": {
- "source": {
- "mode": "synthetic"
- }
- }
- }
- },
- "mappings": {
- "properties": {
- "text": {
- "type": "annotated_text",
- "fields": {
- "raw": {
- "type": "keyword"
- }
- }
- }
- }
- }
- }
- PUT idx/_doc/1
- {
- "text": [
- "the quick brown fox",
- "the quick brown fox",
- "jumped over the lazy dog"
- ]
- }
- ----
- // TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
- Will become:
- [source,console-result]
- ----
- {
- "text": [
- "jumped over the lazy dog",
- "the quick brown fox"
- ]
- }
- ----
- // TEST[s/^/{"_source":/ s/\n$/}/]
- NOTE: Reordering text fields can have an effect on {ref}/query-dsl-match-query-phrase.html[phrase]
- and {ref}/span-queries.html[span] queries. See the discussion about {ref}/position-increment-gap.html[`position_increment_gap`] for more detail. You
- can avoid this by making sure the `slop` parameter on the phrase queries
- is lower than the `position_increment_gap`. This is the default.
- If the `annotated_text` field sets `store` to true then order and duplicates
- are preserved.
- [source,console,id=synthetic-source-text-example-stored]
- ----
- PUT idx
- {
- "settings": {
- "index": {
- "mapping": {
- "source": {
- "mode": "synthetic"
- }
- }
- }
- },
- "mappings": {
- "properties": {
- "text": { "type": "annotated_text", "store": true }
- }
- }
- }
- PUT idx/_doc/1
- {
- "text": [
- "the quick brown fox",
- "the quick brown fox",
- "jumped over the lazy dog"
- ]
- }
- ----
- // TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
- Will become:
- [source,console-result]
- ----
- {
- "text": [
- "the quick brown fox",
- "the quick brown fox",
- "jumped over the lazy dog"
- ]
- }
- ----
- // TEST[s/^/{"_source":/ s/\n$/}/]
- [[mapper-annotated-text-tips]]
- ==== Data modelling tips
- ===== Use structured and unstructured fields
- Annotations are normally a way of weaving structured information into unstructured text for
- higher-precision search.
- `Entity resolution` is a form of document enrichment undertaken by specialist software or people
- where references to entities in a document are disambiguated by attaching a canonical ID.
- The ID is used to resolve any number of aliases or distinguish between people with the
- same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved
- entity IDs woven into text.
- These IDs can be embedded as annotations in an annotated_text field but it often makes
- sense to include them in dedicated structured fields to support discovery via aggregations:
- [source,console]
- --------------------------
- PUT my-index-000001
- {
- "mappings": {
- "properties": {
- "my_unstructured_text_field": {
- "type": "annotated_text"
- },
- "my_structured_people_field": {
- "type": "text",
- "fields": {
- "keyword" : {
- "type": "keyword"
- }
- }
- }
- }
- }
- }
- --------------------------
- Applications would then typically provide content and discover it as follows:
- [source,console]
- --------------------------
- # Example documents
- PUT my-index-000001/_doc/1
- {
- "my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch",
- "my_twitter_handles": ["@kimchy"] <1>
- }
- GET my-index-000001/_search
- {
- "query": {
- "query_string": {
- "query": "elasticsearch OR logstash OR kibana",<2>
- "default_field": "my_unstructured_text_field"
- }
- },
- "aggregations": {
- "top_people" :{
- "significant_terms" : { <3>
- "field" : "my_twitter_handles.keyword"
- }
- }
- }
- }
- --------------------------
- <1> Note the `my_twitter_handles` contains a list of the annotation values
- also used in the unstructured text. (Note the annotated_text syntax requires escaping).
- By repeating the annotation values in a structured field this application has ensured that
- the tokens discovered in the structured field can be used for search and highlighting
- in the unstructured field.
- <2> In this example we search for documents that talk about components of the elastic stack
- <3> We use the `my_twitter_handles` field here to discover people who are significantly
- associated with the elastic stack.
- ===== Avoiding over-matching annotations
- By design, the regular text tokens and the annotation tokens co-exist in the same indexed
- field but in rare cases this can lead to some over-matching.
- The value of an annotation often denotes a _named entity_ (a person, place or company).
- The tokens for these named entities are inserted untokenized, and differ from typical text
- tokens because they are normally:
- * Mixed case e.g. `Madonna`
- * Multiple words e.g. `Jeff Beck`
- * Can have punctuation or numbers e.g. `Apple Inc.` or `@kimchy`
- This means, for the most part, a search for a named entity in the annotated text field will
- not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result
- you can drill down to highlight uses in the text without "over matching" on any text tokens
- like the word `apple` in this context:
- the apple was very juicy
- However, a problem arises if your named entity happens to be a single term and lower-case e.g. the
- company `elastic`. In this case, a search on the annotated text field for the token `elastic`
- may match a text document such as this:
- they fired an elastic band
- To avoid such false matches users should consider prefixing annotation values to ensure
- they don't name clash with text tokens e.g.
- [elastic](Company_elastic) released version 7.0 of the elastic stack today
- [[mapper-annotated-text-highlighter]]
- ==== Using the `annotated` highlighter
- The `annotated-text` plugin includes a custom highlighter designed to mark up search hits
- in a way which is respectful of the original markup:
- [source,console]
- --------------------------
- # Example documents
- PUT my-index-000001/_doc/1
- {
- "my_field": "The cat sat on the [mat](sku3578)"
- }
- GET my-index-000001/_search
- {
- "query": {
- "query_string": {
- "query": "cats"
- }
- },
- "highlight": {
- "fields": {
- "my_field": {
- "type": "annotated", <1>
- "require_field_match": false
- }
- }
- }
- }
- --------------------------
- <1> The `annotated` highlighter type is designed for use with annotated_text fields
- The annotated highlighter is based on the `unified` highlighter and supports the same
- settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using
- html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
- markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`
- is the key and the matched search term is the value e.g.
- The [cat](_hit_term=cat) sat on the [mat](sku3578)
- The annotated highlighter tries to be respectful of any existing markup in the original
- text:
- * If the search term matches exactly the location of an existing annotation then the
- `_hit_term` key is merged into the url-like syntax used in the `(...)` part of the
- existing annotation.
- * However, if the search term overlaps the span of an existing annotation it would break
- the markup formatting so the original annotation is removed in favour of a new annotation
- with just the search hit information in the results.
- * Any non-overlapping annotations in the original text are preserved in highlighter
- selections
- [[mapper-annotated-text-limitations]]
- ==== Limitations
- The annotated_text field type supports the same mapping settings as the `text` field type
- but with the following exceptions:
- * No support for `fielddata` or `fielddata_frequency_filter`
- * No support for `index_prefixes` or `index_phrases` indexing
|