123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323 |
- [[ingest-attachment]]
- === Ingest Attachment Processor Plugin
- The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by
- using the Apache text extraction library https://tika.apache.org/[Tika].
- You can use the ingest attachment plugin as a replacement for the mapper attachment plugin.
- The source field must be a base64 encoded binary. If you do not want to incur
- the overhead of converting back and forth between base64, you can use the CBOR
- format instead of JSON and specify the field as a bytes array instead of a string
- representation. The processor will skip the base64 decoding then.
- :plugin_name: ingest-attachment
- include::install_remove.asciidoc[]
- [[using-ingest-attachment]]
- ==== Using the Attachment Processor in a Pipeline
- [[ingest-attachment-options]]
- .Attachment options
- [options="header"]
- |======
- | Name | Required | Default | Description
- | `field` | yes | - | The field to get the base64 encoded field from
- | `target_field` | no | attachment | The field that will hold the attachment information
- | `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
- | `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
- | `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
- | `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
- |======
- For example, this:
- [source,console]
- --------------------------------------------------
- PUT _ingest/pipeline/attachment
- {
- "description" : "Extract attachment information",
- "processors" : [
- {
- "attachment" : {
- "field" : "data"
- }
- }
- ]
- }
- PUT my-index-000001/_doc/my_id?pipeline=attachment
- {
- "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
- }
- GET my-index-000001/_doc/my_id
- --------------------------------------------------
- Returns this:
- [source,console-result]
- --------------------------------------------------
- {
- "found": true,
- "_index": "my-index-000001",
- "_id": "my_id",
- "_version": 1,
- "_seq_no": 22,
- "_primary_term": 1,
- "_source": {
- "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
- "attachment": {
- "content_type": "application/rtf",
- "language": "ro",
- "content": "Lorem ipsum dolor sit amet",
- "content_length": 28
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
- To specify only some fields to be extracted:
- [source,console]
- --------------------------------------------------
- PUT _ingest/pipeline/attachment
- {
- "description" : "Extract attachment information",
- "processors" : [
- {
- "attachment" : {
- "field" : "data",
- "properties": [ "content", "title" ]
- }
- }
- ]
- }
- --------------------------------------------------
- NOTE: Extracting contents from binary data is a resource intensive operation and
- consumes a lot of resources. It is highly recommended to run pipelines
- using this processor in a dedicated ingest node.
- [[ingest-attachment-extracted-chars]]
- ==== Limit the number of extracted chars
- To prevent extracting too many chars and overload the node memory, the number of chars being used for extraction
- is limited by default to `100000`. You can change this value by setting `indexed_chars`. Use `-1` for no limit but
- ensure when setting this that your node will have enough HEAP to extract the content of very big documents.
- You can also define this limit per document by extracting from a given field the limit to set. If the document
- has that field, it will overwrite the `indexed_chars` setting. To set this field, define the `indexed_chars_field`
- setting.
- For example:
- [source,console]
- --------------------------------------------------
- PUT _ingest/pipeline/attachment
- {
- "description" : "Extract attachment information",
- "processors" : [
- {
- "attachment" : {
- "field" : "data",
- "indexed_chars" : 11,
- "indexed_chars_field" : "max_size"
- }
- }
- ]
- }
- PUT my-index-000001/_doc/my_id?pipeline=attachment
- {
- "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
- }
- GET my-index-000001/_doc/my_id
- --------------------------------------------------
- Returns this:
- [source,console-result]
- --------------------------------------------------
- {
- "found": true,
- "_index": "my-index-000001",
- "_id": "my_id",
- "_version": 1,
- "_seq_no": 35,
- "_primary_term": 1,
- "_source": {
- "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
- "attachment": {
- "content_type": "application/rtf",
- "language": "sl",
- "content": "Lorem ipsum",
- "content_length": 11
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
- [source,console]
- --------------------------------------------------
- PUT _ingest/pipeline/attachment
- {
- "description" : "Extract attachment information",
- "processors" : [
- {
- "attachment" : {
- "field" : "data",
- "indexed_chars" : 11,
- "indexed_chars_field" : "max_size"
- }
- }
- ]
- }
- PUT my-index-000001/_doc/my_id_2?pipeline=attachment
- {
- "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
- "max_size": 5
- }
- GET my-index-000001/_doc/my_id_2
- --------------------------------------------------
- Returns this:
- [source,console-result]
- --------------------------------------------------
- {
- "found": true,
- "_index": "my-index-000001",
- "_id": "my_id_2",
- "_version": 1,
- "_seq_no": 40,
- "_primary_term": 1,
- "_source": {
- "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
- "max_size": 5,
- "attachment": {
- "content_type": "application/rtf",
- "language": "ro",
- "content": "Lorem",
- "content_length": 5
- }
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
- [[ingest-attachment-with-arrays]]
- ==== Using the Attachment Processor with arrays
- To use the attachment processor within an array of attachments the
- {ref}/foreach-processor.html[foreach processor] is required. This
- enables the attachment processor to be run on the individual elements
- of the array.
- For example, given the following source:
- [source,js]
- --------------------------------------------------
- {
- "attachments" : [
- {
- "filename" : "ipsum.txt",
- "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
- },
- {
- "filename" : "test.txt",
- "data" : "VGhpcyBpcyBhIHRlc3QK"
- }
- ]
- }
- --------------------------------------------------
- // NOTCONSOLE
- In this case, we want to process the data field in each element
- of the attachments field and insert
- the properties into the document so the following `foreach`
- processor is used:
- [source,console]
- --------------------------------------------------
- PUT _ingest/pipeline/attachment
- {
- "description" : "Extract attachment information from arrays",
- "processors" : [
- {
- "foreach": {
- "field": "attachments",
- "processor": {
- "attachment": {
- "target_field": "_ingest._value.attachment",
- "field": "_ingest._value.data"
- }
- }
- }
- }
- ]
- }
- PUT my-index-000001/_doc/my_id?pipeline=attachment
- {
- "attachments" : [
- {
- "filename" : "ipsum.txt",
- "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
- },
- {
- "filename" : "test.txt",
- "data" : "VGhpcyBpcyBhIHRlc3QK"
- }
- ]
- }
- GET my-index-000001/_doc/my_id
- --------------------------------------------------
- Returns this:
- [source,console-result]
- --------------------------------------------------
- {
- "_index" : "my-index-000001",
- "_id" : "my_id",
- "_version" : 1,
- "_seq_no" : 50,
- "_primary_term" : 1,
- "found" : true,
- "_source" : {
- "attachments" : [
- {
- "filename" : "ipsum.txt",
- "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
- "attachment" : {
- "content_type" : "text/plain; charset=ISO-8859-1",
- "language" : "en",
- "content" : "this is\njust some text",
- "content_length" : 24
- }
- },
- {
- "filename" : "test.txt",
- "data" : "VGhpcyBpcyBhIHRlc3QK",
- "attachment" : {
- "content_type" : "text/plain; charset=ISO-8859-1",
- "language" : "en",
- "content" : "This is a test",
- "content_length" : 16
- }
- }
- ]
- }
- }
- --------------------------------------------------
- // TESTRESPONSE[s/"_seq_no" : \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
- Note that the `target_field` needs to be set, otherwise the
- default value is used which is a top level field `attachment`. The
- properties on this top level field will contain the value of the
- first attachment only. However, by specifying the
- `target_field` on to a value on `_ingest._value` it will correctly
- associate the properties with the correct attachment.
|