navigation_title: "Pattern replace" mapped_pages:
The pattern_replace
character filter uses a regular expression to match characters which should be replaced with the specified replacement string. The replacement string can refer to capture groups in the regular expression.
::::{admonition} Beware of Pathological Regular Expressions :class: warning
The pattern replace character filter uses Java Regular Expressions.
A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.
Read more about pathological regular expressions and how to avoid them.
::::
The pattern_replace
character filter accepts the following parameters:
pattern
: A Java regular expression. Required.
replacement
: The replacement string, which can reference capture groups using the $1
..$9
syntax, as explained here.
flags
: Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS"
.
In this example, we configure the pattern_replace
character filter to replace any embedded dashes in numbers with underscores, i.e 123-456-789
→ 123_456_789
:
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}
The above example produces the following terms:
[ My, credit, card, is, 123_456_789 ]
::::{warning} Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting, as can be seen in the following example. ::::
This example inserts a space whenever it encounters a lower-case letter followed by an upper-case letter (i.e. fooBarBaz
→ foo Bar Baz
), allowing camelCase words to be queried individually:
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
"replacement": " "
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "The fooBarBaz method"
}
The above returns the following terms:
[ the, foo, bar, baz, method ]
Querying for bar
will find the document correctly, but highlighting on the result will produce incorrect highlights, because our character filter changed the length of the original text:
PUT my-index-000001/_doc/1?refresh
{
"text": "The fooBarBaz method"
}
GET my-index-000001/_search
{
"query": {
"match": {
"text": "bar"
}
},
"highlight": {
"fields": {
"text": {}
}
}
}
The output from the above is:
{
"timed_out": false,
"took": $body.took,
"_shards": {
"total": 1,
"successful": 1,
"skipped" : 0,
"failed": 0
},
"hits": {
"total" : {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "my-index-000001",
"_id": "1",
"_score": 0.2876821,
"_source": {
"text": "The fooBarBaz method"
},
"highlight": {
"text": [
"The foo<em>Ba</em>rBaz method" <1>
]
}
}
]
}
}