123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420 |
- [[mapper-attachments]]
- === Mapper Attachments Plugin
- deprecated[5.0.0,The `mapper-attachments` plugin has been replaced by the <<ingest-attachment, `ingest-attachment`>> plugin]
- The mapper attachments plugin lets Elasticsearch index file attachments in common formats (such as PPT, XLS, PDF)
- using the Apache text extraction library http://lucene.apache.org/tika/[Tika].
- In practice, the plugin adds the `attachment` type when mapping properties so that documents can be populated with
- file attachment contents (encoded as `base64`).
- [[mapper-attachments-install]]
- [float]
- ==== Installation
- This plugin can be installed using the plugin manager:
- [source,sh]
- ----------------------------------------------------------------
- sudo bin/elasticsearch-plugin install mapper-attachments
- ----------------------------------------------------------------
- The plugin must be installed on every node in the cluster, and each node must
- be restarted after installation.
- [[mapper-attachments-remove]]
- [float]
- ==== Removal
- The plugin can be removed with the following command:
- [source,sh]
- ----------------------------------------------------------------
- sudo bin/elasticsearch-plugin remove mapper-attachments
- ----------------------------------------------------------------
- The node must be stopped before removing the plugin.
- [[mapper-attachments-helloworld]]
- ==== Hello, world
- Create a property mapping using the new type `attachment`:
- [source,js]
- --------------------------
- PUT /trying-out-mapper-attachments
- {
- "mappings": {
- "person": {
- "properties": {
- "cv": { "type": "attachment" }
- }}}}
- --------------------------
- // CONSOLE
- Index a new document populated with a `base64`-encoded attachment:
- [source,js]
- --------------------------
- POST /trying-out-mapper-attachments/person/1?refresh
- {
- "cv": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
- }
- --------------------------
- // CONSOLE
- // TEST[continued]
- Search for the document using words in the attachment:
- [source,js]
- --------------------------
- POST /trying-out-mapper-attachments/person/_search
- {
- "query": {
- "query_string": {
- "query": "ipsum"
- }}}
- --------------------------
- // CONSOLE
- // TEST[continued]
- If you get a hit for your indexed document, the plugin should be installed and working. It'll look like:
- [source,js]
- --------------------------
- {
- "timed_out": false,
- "took": 53,
- "hits": {
- "total": 1,
- "max_score": 0.25811607,
- "hits": [
- {
- "_score": 0.25811607,
- "_index": "trying-out-mapper-attachments",
- "_type": "person",
- "_id": "1",
- "_source": {
- "cv": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
- }
- }
- ]
- },
- "_shards": ...
- }
- --------------------------
- // TESTRESPONSE[s/"took": 53/"took": "$body.took"/]
- // TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards"/]
- [[mapper-attachments-usage]]
- ==== Usage
- Using the attachment type is simple, in your mapping JSON, simply set a certain JSON element as attachment, for example:
- [source,js]
- --------------------------
- PUT /test
- {
- "mappings": {
- "person" : {
- "properties" : {
- "my_attachment" : { "type" : "attachment" }
- }
- }
- }
- }
- --------------------------
- // CONSOLE
- In this case, the JSON to index can be:
- [source,js]
- --------------------------
- PUT /test/person/1
- {
- "my_attachment" : "... base64 encoded attachment ..."
- }
- --------------------------
- // CONSOLE
- Or it is possible to use more elaborated JSON if content type, resource name or language need to be set explicitly:
- [source,js]
- --------------------------
- PUT /test/person/1
- {
- "my_attachment" : {
- "_content_type" : "application/pdf",
- "_name" : "resource/name/of/my.pdf",
- "_language" : "en",
- "_content" : "... base64 encoded attachment ..."
- }
- }
- --------------------------
- // CONSOLE
- The `attachment` type not only indexes the content of the doc in `content` sub field, but also automatically adds meta
- data on the attachment as well (when available).
- The metadata supported are:
- * `date`
- * `title`
- * `name` only available if you set `_name` see above
- * `author`
- * `keywords`
- * `content_type`
- * `content_length` is the original content_length before text extraction (aka file size)
- * `language`
- They can be queried using the "dot notation", for example: `my_attachment.author`.
- Both the meta data and the actual content are simple core type mappers (text, date, …), thus, they can be controlled
- in the mappings. For example:
- [source,js]
- --------------------------
- PUT /test
- {
- "settings": {
- "index": {
- "analysis": {
- "analyzer": {
- "my_analyzer": {
- "type": "custom",
- "tokenizer": "standard",
- "filter": ["standard"]
- }
- }
- }
- }
- },
- "mappings": {
- "person" : {
- "properties" : {
- "file" : {
- "type" : "attachment",
- "fields" : {
- "content" : {"index" : true},
- "title" : {"store" : true},
- "date" : {"store" : true},
- "author" : {"analyzer" : "my_analyzer"},
- "keywords" : {"store" : true},
- "content_type" : {"store" : true},
- "content_length" : {"store" : true},
- "language" : {"store" : true}
- }
- }
- }
- }
- }
- }
- --------------------------
- // CONSOLE
- In the above example, the actual content indexed is mapped under `fields` name `content`, and we decide not to index it, so
- it will only be available in the `_all` field. The other fields map to their respective metadata names, but there is no
- need to specify the `type` (like `text` or `date`) since it is already known.
- ==== Querying or accessing metadata
- If you need to query on metadata fields, use the attachment field name dot the metadata field. For example:
- [source,js]
- --------------------------
- PUT /test
- PUT /test/person/_mapping
- {
- "person": {
- "properties": {
- "file": {
- "type": "attachment",
- "fields": {
- "content_type": {
- "type": "text",
- "store": true
- }
- }
- }
- }
- }
- }
- PUT /test/person/1?refresh=true
- {
- "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
- }
- GET /test/person/_search
- {
- "stored_fields": [ "file.content_type" ],
- "query": {
- "match": {
- "file.content_type": "text plain"
- }
- }
- }
- --------------------------
- // CONSOLE
- Will give you:
- [source,js]
- --------------------------
- {
- "took": 2,
- "timed_out": false,
- "_shards": {
- "total": 5,
- "successful": 5,
- "failed": 0
- },
- "hits": {
- "total": 1,
- "max_score": 0.53484553,
- "hits": [
- {
- "_index": "test",
- "_type": "person",
- "_id": "1",
- "_score": 0.53484553,
- "fields": {
- "file.content_type": [
- "text/plain; charset=ISO-8859-1"
- ]
- }
- }
- ]
- }
- }
- --------------------------
- // TESTRESPONSE[s/"took": 2,/"took": $body.took,/]
- [[mapper-attachments-indexed-characters]]
- ==== Indexed Characters
- By default, `100000` characters are extracted when indexing the content. This default value can be changed by setting
- the `index.mapping.attachment.indexed_chars` setting. It can also be provided on a per document indexed using the
- `_indexed_chars` parameter. `-1` can be set to extract all text, but note that all the text needs to be allowed to be
- represented in memory:
- [source,js]
- --------------------------
- PUT /test/person/1
- {
- "my_attachment" : {
- "_indexed_chars" : -1,
- "_content" : "... base64 encoded attachment ..."
- }
- }
- --------------------------
- // CONSOLE
- [[mapper-attachments-error-handling]]
- ==== Metadata parsing error handling
- While extracting metadata content, errors could happen for example when parsing dates.
- Parsing errors are ignored so your document is indexed.
- You can disable this feature by setting the `index.mapping.attachment.ignore_errors` setting to `false`.
- [[mapper-attachments-language-detection]]
- ==== Language Detection
- By default, language detection is disabled (`false`) as it could come with a cost.
- This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
- It can also be provided on a per document indexed using the `_detect_language` parameter.
- Note that you can force language using `_language` field when sending your actual document:
- [source,js]
- --------------------------
- PUT /test/person/1
- {
- "my_attachment" : {
- "_language" : "en",
- "_content" : "... base64 encoded attachment ..."
- }
- }
- --------------------------
- // CONSOLE
- [[mapper-attachments-highlighting]]
- ==== Highlighting attachments
- If you want to highlight your attachment content, you will need to set `"store": true` and
- `"term_vector":"with_positions_offsets"` for your attachment field. Here is a full script which does it:
- [source,js]
- --------------------------
- PUT /test
- PUT /test/person/_mapping
- {
- "person": {
- "properties": {
- "file": {
- "type": "attachment",
- "fields": {
- "content": {
- "type": "text",
- "term_vector":"with_positions_offsets",
- "store": true
- }
- }
- }
- }
- }
- }
- PUT /test/person/1?refresh=true
- {
- "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
- }
- GET /test/person/_search
- {
- "stored_fields": [],
- "query": {
- "match": {
- "file.content": "king queen"
- }
- },
- "highlight": {
- "fields": {
- "file.content": {
- }
- }
- }
- }
- --------------------------
- // CONSOLE
- It gives back:
- [source,js]
- --------------------------
- {
- "took": 9,
- "timed_out": false,
- "_shards": {
- "total": 5,
- "successful": 5,
- "failed": 0
- },
- "hits": {
- "total": 1,
- "max_score": 0.5446649,
- "hits": [
- {
- "_index": "test",
- "_type": "person",
- "_id": "1",
- "_score": 0.5446649,
- "highlight": {
- "file.content": [
- "\"God Save the <em>Queen</em>\" (alternatively \"God Save the <em>King</em>\"\n"
- ]
- }
- }
- ]
- }
- }
- --------------------------
- // TESTRESPONSE[s/"took": 9,/"took": $body.took,/]
|