123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103 |
- [[ingest-attachment]]
- === Ingest Attachment Processor Plugin
- The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by
- using the Apache text extraction library http://lucene.apache.org/tika/[Tika].
- You can use the ingest attachment plugin as a replacement for the mapper attachment plugin.
- The source field must be a base64 encoded binary. If you do not want to incur
- the overhead of converting back and forth between base64, you can use the CBOR
- format instead of JSON and specify the field as a bytes array instead of a string
- representation. The processor will skip the base64 decoding then.
- [[ingest-attachment-install]]
- [float]
- ==== Installation
- This plugin can be installed using the plugin manager:
- [source,sh]
- ----------------------------------------------------------------
- sudo bin/elasticsearch-plugin install ingest-attachment
- ----------------------------------------------------------------
- The plugin must be installed on every node in the cluster, and each node must
- be restarted after installation.
- [[ingest-attachment-remove]]
- [float]
- ==== Removal
- The plugin can be removed with the following command:
- [source,sh]
- ----------------------------------------------------------------
- sudo bin/elasticsearch-plugin remove ingest-attachment
- ----------------------------------------------------------------
- The node must be stopped before removing the plugin.
- [[using-ingest-attachment]]
- ==== Using the Attachment Processor in a Pipeline
- [[ingest-attachment-options]]
- .Attachment options
- [options="header"]
- |======
- | Name | Required | Default | Description
- | `field` | yes | - | The field to get the base64 encoded field from
- | `target_field` | no | attachment | The field that will hold the attachment information
- | `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
- | `properties` | no | all | Properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
- |======
- For example, this:
- [source,js]
- --------------------------------------------------
- PUT _ingest/pipeline/attachment
- {
- "description" : "Extract attachment information",
- "processors" : [
- {
- "attachment" : {
- "field" : "data"
- }
- }
- ]
- }
- PUT my_index/my_type/my_id?pipeline=attachment
- {
- "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
- }
- GET my_index/my_type/my_id
- --------------------------------------------------
- // CONSOLE
- Returns this:
- [source,js]
- --------------------------------------------------
- {
- "found": true,
- "_index": "my_index",
- "_type": "my_type",
- "_id": "my_id",
- "_version": 1,
- "_source": {
- "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
- "attachment": {
- "content_type": "application/rtf",
- "language": "ro",
- "content": "Lorem ipsum dolor sit amet",
- "content_length": 28
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE
- NOTE: Extracting contents from binary data is a resource intensive operation and
- consumes a lot of resources. It is highly recommended to run pipelines
- using this processor in a dedicated ingest node.
|