123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514 |
- [[search-percolate]]
- == Percolator
- Traditionally you design documents based on your data, store them into an index, and then define queries via the search API
- in order to retrieve these documents. The percolator works in the opposite direction. First you store queries into an
- index and then, via the percolate API, you define documents in order to retrieve these queries.
- The reason that queries can be stored comes from the fact that in Elasticsearch both documents and queries are defined in
- JSON. This allows you to embed queries into documents via the index API. Elasticsearch can extract the query from a
- document and make it available to the percolate API. Since documents are also defined as JSON, you can define a document
- in a request to the percolate API.
- The percolator and most of its features work in realtime, so once a percolate query is indexed it can immediately be used
- in the percolate API.
- [IMPORTANT]
- =====================================
- Fields referred to in a percolator query must *already* exist in the mapping
- associated with the index used for percolation.
- There are two ways to make sure that a field mapping exist:
- * Add or update a mapping via the <<indices-create-index,create index>> or
- <<indices-put-mapping,put mapping>> APIs.
- * Percolate a document before registering a query. Percolating a document can
- add field mappings dynamically, in the same way as happens when indexing a
- document.
- =====================================
- [float]
- === Sample Usage
- Create an index with a mapping for the field `message`:
- [source,js]
- --------------------------------------------------
- curl -XPUT 'localhost:9200/my-index' -d '{
- "mappings": {
- "my-type": {
- "properties": {
- "message": {
- "type": "string"
- }
- }
- }
- }
- }'
- --------------------------------------------------
- Register a query in the percolator:
- [source,js]
- --------------------------------------------------
- curl -XPUT 'localhost:9200/my-index/.percolator/1' -d '{
- "query" : {
- "match" : {
- "message" : "bonsai tree"
- }
- }
- }'
- --------------------------------------------------
- Match a document to the registered percolator queries:
- [source,js]
- --------------------------------------------------
- curl -XGET 'localhost:9200/my-index/my-type/_percolate' -d '{
- "doc" : {
- "message" : "A new bonsai tree in the office"
- }
- }'
- --------------------------------------------------
- The above request will yield the following response:
- [source,js]
- --------------------------------------------------
- {
- "took" : 19,
- "_shards" : {
- "total" : 5,
- "successful" : 5,
- "failed" : 0
- },
- "total" : 1,
- "matches" : [ <1>
- {
- "_index" : "my-index",
- "_id" : "1"
- }
- ]
- }
- --------------------------------------------------
- <1> The percolate query with id `1` matches our document.
- [float]
- === Indexing Percolator Queries
- Percolate queries are stored as documents in a specific format and in an arbitrary index under a reserved type with the
- name `.percolator`. The query itself is placed as is in a JSON object under the top level field `query`.
- [source,js]
- --------------------------------------------------
- {
- "query" : {
- "match" : {
- "field" : "value"
- }
- }
- }
- --------------------------------------------------
- Since this is just an ordinary document, any field can be added to this document. This can be useful later on to only
- percolate documents by specific queries.
- [source,js]
- --------------------------------------------------
- {
- "query" : {
- "match" : {
- "field" : "value"
- }
- },
- "priority" : "high"
- }
- --------------------------------------------------
- On top of this, also a mapping type can be associated with this query. This allows to control how certain queries
- like range queries, shape filters, and other query & filters that rely on mapping settings get constructed. This is
- important since the percolate queries are indexed into the `.percolator` type, and the queries / filters that rely on
- mapping settings would yield unexpected behaviour. Note: By default, field names do get resolved in a smart manner,
- but in certain cases with multiple types this can lead to unexpected behavior, so being explicit about it will help.
- [source,js]
- --------------------------------------------------
- {
- "query" : {
- "range" : {
- "created_at" : {
- "gte" : "2010-01-01T00:00:00",
- "lte" : "2011-01-01T00:00:00"
- }
- }
- },
- "type" : "tweet",
- "priority" : "high"
- }
- --------------------------------------------------
- In the above example the range query really gets parsed into a Lucene numeric range query, based on the settings for
- the field `created_at` in the type `tweet`.
- Just as with any other type, the `.percolator` type has a mapping, which you can configure via the mappings APIs.
- The default percolate mapping doesn't index the query field, only stores it.
- Because `.percolate` is a type it also has a mapping. By default the following mapping is active:
- [source,js]
- --------------------------------------------------
- {
- ".percolator" : {
- "properties" : {
- "query" : {
- "type" : "object",
- "enabled" : false
- }
- }
- }
- }
- --------------------------------------------------
- If needed, this mapping can be modified with the update mapping API.
- In order to un-register a percolate query the delete API can be used. So if the previous added query needs to be deleted
- the following delete requests needs to be executed:
- [source,js]
- --------------------------------------------------
- curl -XDELETE localhost:9200/my-index/.percolator/1
- --------------------------------------------------
- [float]
- === Percolate API
- The percolate API executes in a distributed manner, meaning it executes on all shards an index points to.
- .Required options
- * `index` - The index that contains the `.percolator` type. This can also be an alias.
- * `type` - The type of the document to be percolated. The mapping of that type is used to parse document.
- * `doc` - The actual document to percolate. Unlike the other two options this needs to be specified in the request body. Note: This isn't required when percolating an existing document.
- [source,js]
- --------------------------------------------------
- curl -XGET 'localhost:9200/twitter/tweet/_percolate' -d '{
- "doc" : {
- "created_at" : "2010-10-10T00:00:00",
- "message" : "some text"
- }
- }'
- --------------------------------------------------
- .Additional supported query string options
- * `routing` - In case the percolate queries are partitioned by a custom routing value, that routing option makes sure
- that the percolate request only gets executed on the shard where the routing value is partitioned to. This means that
- the percolate request only gets executed on one shard instead of all shards. Multiple values can be specified as a
- comma separated string, in that case the request can be be executed on more than one shard.
- * `preference` - Controls which shard replicas are preferred to execute the request on. Works the same as in the search API.
- * `ignore_unavailable` - Controls if missing concrete indices should silently be ignored. Same as is in the search API.
- * `percolate_format` - If `ids` is specified then the matches array in the percolate response will contain a string
- array of the matching ids instead of an array of objects. This can be useful to reduce the amount of data being send
- back to the client. Obviously if there are two percolator queries with same id from different indices there is no way
- to find out which percolator query belongs to what index. Any other value to `percolate_format` will be ignored.
- .Additional request body options
- * `filter` - Reduces the number queries to execute during percolating. Only the percolator queries that match with the
- filter will be included in the percolate execution. The filter option works in near realtime, so a refresh needs to have
- occurred for the filter to included the latest percolate queries.
- * `query` - Same as the `filter` option, but also the score is computed. The computed scores can then be used by the
- `track_scores` and `sort` option.
- * `size` - Defines to maximum number of matches (percolate queries) to be returned. Defaults to unlimited.
- * `track_scores` - Whether the `_score` is included for each match. The `_score` is based on the query and represents
- how the query matched the *percolate query's metadata*, *not* how the document (that is being percolated) matched
- the query. The `query` option is required for this option. Defaults to `false`.
- * `sort` - Define a sort specification like in the search API. Currently only sorting `_score` reverse (default relevancy)
- is supported. Other sort fields will throw an exception. The `size` and `query` option are required for this setting. Like
- `track_score` the score is based on the query and represents how the query matched to the percolate query's metadata
- and *not* how the document being percolated matched to the query.
- * `aggs` - Allows aggregation definitions to be included. The aggregations are based on the matching percolator queries,
- look at the aggregation documentation on how to define aggregations.
- * `highlight` - Allows highlight definitions to be included. The document being percolated is being highlight for each
- matching query. This allows you to see how each match is highlighting the document being percolated. See highlight
- documentation on how to define highlights. The `size` option is required for highlighting, the performance of highlighting
- in the percolate API depends of how many matches are being highlighted.
- [float]
- === Dedicated Percolator Index
- Percolate queries can be added to any index. Instead of adding percolate queries to the index the data resides in,
- these queries can also be added to a dedicated index. The advantage of this is that this dedicated percolator index
- can have its own index settings (For example the number of primary and replica shards). If you choose to have a dedicated
- percolate index, you need to make sure that the mappings from the normal index are also available on the percolate index.
- Otherwise percolate queries can be parsed incorrectly.
- [float]
- === Filtering Executed Queries
- Filtering allows to reduce the number of queries, any filter that the search API supports, (except the ones mentioned in important notes)
- can also be used in the percolate API. The filter only works on the metadata fields. The `query` field isn't indexed by
- default. Based on the query we indexed before, the following filter can be defined:
- [source,js]
- --------------------------------------------------
- curl -XGET localhost:9200/test/type1/_percolate -d '{
- "doc" : {
- "field" : "value"
- },
- "filter" : {
- "term" : {
- "priority" : "high"
- }
- }
- }'
- --------------------------------------------------
- [float]
- === Percolator Count API
- The count percolate API, only keeps track of the number of matches and doesn't keep track of the actual matches
- Example:
- [source,js]
- --------------------------------------------------
- curl -XGET 'localhost:9200/my-index/my-type/_percolate/count' -d '{
- "doc" : {
- "message" : "some message"
- }
- }'
- --------------------------------------------------
- Response:
- [source,js]
- --------------------------------------------------
- {
- ... // header
- "total" : 3
- }
- --------------------------------------------------
- [float]
- === Percolating an Existing Document
- In order to percolate a newly indexed document, the percolate existing document can be used. Based on the response
- from an index request, the `_id` and other meta information can be used to immediately percolate the newly added
- document.
- .Supported options for percolating an existing document on top of existing percolator options:
- * `id` - The id of the document to retrieve the source for.
- * `percolate_index` - The index containing the percolate queries. Defaults to the `index` defined in the url.
- * `percolate_type` - The percolate type (used for parsing the document). Default to `type` defined in the url.
- * `routing` - The routing value to use when retrieving the document to percolate.
- * `preference` - Which shard to prefer when retrieving the existing document.
- * `percolate_routing` - The routing value to use when percolating the existing document.
- * `percolate_preference` - Which shard to prefer when executing the percolate request.
- * `version` - Enables a version check. If the fetched document's version isn't equal to the specified version then the request fails with a version conflict and the percolation request is aborted.
- Internally the percolate API will issue a GET request for fetching the `_source` of the document to percolate.
- For this feature to work, the `_source` for documents to be percolated needs to be stored.
- [float]
- ==== Example
- Index response:
- [source,js]
- --------------------------------------------------
- {
- "_index" : "my-index",
- "_type" : "message",
- "_id" : "1",
- "_version" : 1,
- "created" : true
- }
- --------------------------------------------------
- Percolating an Existing Document:
- [source,js]
- --------------------------------------------------
- curl -XGET 'localhost:9200/my-index1/message/1/_percolate'
- --------------------------------------------------
- The response is the same as with the regular percolate API.
- [float]
- === Multi Percolate API
- The multi percolate API allows to bundle multiple percolate requests into a single request, similar to what the multi
- search API does to search requests. The request body format is line based. Each percolate request item takes two lines,
- the first line is the header and the second line is the body.
- The header can contain any parameter that normally would be set via the request path or query string parameters.
- There are several percolate actions, because there are multiple types of percolate requests.
- .Supported actions:
- * `percolate` - Action for defining a regular percolate request.
- * `count` - Action for defining a count percolate request.
- Depending on the percolate action different parameters can be specified. For example the percolate and percolate existing
- document actions support different parameters.
- .The following endpoints are supported
- * `GET|POST /[index]/[type]/_mpercolate`
- * `GET|POST /[index]/_mpercolate`
- * `GET|POST /_mpercolate`
- The `index` and `type` defined in the url path are the default index and type.
- [float]
- ==== Example
- Request:
- [source,js]
- --------------------------------------------------
- curl -XGET 'localhost:9200/twitter/tweet/_mpercolate' --data-binary @requests.txt; echo
- --------------------------------------------------
- The index `twitter` is the default index, and the type `tweet` is the default type and will be used in the case a header
- doesn't specify an index or type.
- requests.txt:
- [source,js]
- --------------------------------------------------
- {"percolate" : {"index" : "twitter", "type" : "tweet"}}
- {"doc" : {"message" : "some text"}}
- {"percolate" : {"index" : "twitter", "type" : "tweet", "id" : "1"}}
- {}
- {"percolate" : {"index" : "users", "type" : "user", "id" : "3", "percolate_index" : "users_2012" }}
- {"size" : 10}
- {"count" : {"index" : "twitter", "type" : "tweet"}}
- {"doc" : {"message" : "some other text"}}
- {"count" : {"index" : "twitter", "type" : "tweet", "id" : "1"}}
- {}
- --------------------------------------------------
- For a percolate existing document item (headers with the `id` field), the response can be an empty JSON object.
- All the required options are set in the header.
- Response:
- [source,js]
- --------------------------------------------------
- {
- "responses" : [
- {
- "took" : 24,
- "_shards" : {
- "total" : 5,
- "successful" : 5,
- "failed" : 0,
- },
- "total" : 3,
- "matches" : [
- {
- "_index": "twitter",
- "_id": "1"
- },
- {
- "_index": "twitter",
- "_id": "2"
- },
- {
- "_index": "twitter",
- "_id": "3"
- }
- ]
- },
- {
- "took" : 12,
- "_shards" : {
- "total" : 5,
- "successful" : 5,
- "failed" : 0,
- },
- "total" : 3,
- "matches" : [
- {
- "_index": "twitter",
- "_id": "4"
- },
- {
- "_index": "twitter",
- "_id": "5"
- },
- {
- "_index": "twitter",
- "_id": "6"
- }
- ]
- },
- {
- "error" : "DocumentMissingException[[_na][_na] [user][3]: document missing]"
- },
- {
- "took" : 12,
- "_shards" : {
- "total" : 5,
- "successful" : 5,
- "failed" : 0,
- },
- "total" : 3
- },
- {
- "took" : 14,
- "_shards" : {
- "total" : 5,
- "successful" : 5,
- "failed" : 0,
- },
- "total" : 3
- }
- ]
- }
- --------------------------------------------------
- Each item represents a percolate response, the order of the items maps to the order in which the percolate requests
- were specified. In case a percolate request failed, the item response is substituted with an error message.
- [float]
- === How it Works Under the Hood
- When indexing a document that contains a query in an index and the `.percolator` type, the query part of the documents gets
- parsed into a Lucene query and is kept in memory until that percolator document is removed or the index containing the
- `.percolator` type gets removed. So, all the active percolator queries are kept in memory.
- At percolate time, the document specified in the request gets parsed into a Lucene document and is stored in a in-memory
- Lucene index. This in-memory index can just hold this one document and it is optimized for that. Then all the queries
- that are registered to the index that the percolate request is targeted for, are going to be executed on this single document
- in-memory index. This happens on each shard the percolate request needs to execute.
- By using `routing`, `filter` or `query` features the amount of queries that need to be executed can be reduced and thus
- the time the percolate API needs to run can be decreased.
- [float]
- === Important Notes
- Because the percolator API is processing one document at a time, it doesn't support queries and filters that run
- against child documents such as `has_child` and `has_parent`.
- The `inner_hits` feature on the `nested` query isn't supported in the percolate api.
- The `wildcard` and `regexp` query natively use a lot of memory and because the percolator keeps the queries into memory
- this can easily take up the available memory in the heap space. If possible try to use a `prefix` query or ngramming to
- achieve the same result (with way less memory being used).
- The `delete-by-query` plugin doesn't work to unregister a query, it only deletes the percolate documents from disk. In order
- to update the registered queries in memory the index needs be closed and opened.
- [float]
- === Forcing Unmapped Fields to be Handled as Strings
- In certain cases it is unknown what kind of percolator queries do get registered, and if no field mapping exists for fields
- that are referred by percolator queries then adding a percolator query fails. This means the mapping needs to be updated
- to have the field with the appropriate settings, and then the percolator query can be added. But sometimes it is sufficient
- if all unmapped fields are handled as if these were default string fields. In those cases one can configure the
- `index.percolator.map_unmapped_fields_as_string` setting to `true` (default to `false`) and then if a field referred in
- a percolator query does not exist, it will be handled as a default string field so that adding the percolator query doesn't
- fail.
|