|
@@ -1,286 +1,224 @@
|
|
|
[[regexp-syntax]]
|
|
|
-==== Regular expression syntax
|
|
|
+== Regular expression syntax
|
|
|
|
|
|
-Regular expression queries are supported by the `regexp` and the `query_string`
|
|
|
-queries. The Lucene regular expression engine
|
|
|
-is not Perl-compatible but supports a smaller range of operators.
|
|
|
+A https://en.wikipedia.org/wiki/Regular_expression[regular expression] is a way to
|
|
|
+match patterns in data using placeholder characters, called operators.
|
|
|
|
|
|
-[NOTE]
|
|
|
-=====
|
|
|
-We will not attempt to explain regular expressions, but
|
|
|
-just explain the supported operators.
|
|
|
-=====
|
|
|
+{es} supports regular expressions in the following queries:
|
|
|
|
|
|
-===== Standard operators
|
|
|
+* <<query-dsl-regexp-query, `regexp`>>
|
|
|
+* <<query-dsl-query-string-query, `query_string`>>
|
|
|
|
|
|
-Anchoring::
|
|
|
-+
|
|
|
---
|
|
|
-
|
|
|
-Most regular expression engines allow you to match any part of a string.
|
|
|
-If you want the regexp pattern to start at the beginning of the string or
|
|
|
-finish at the end of the string, then you have to _anchor_ it specifically,
|
|
|
-using `^` to indicate the beginning or `$` to indicate the end.
|
|
|
-
|
|
|
-Lucene's patterns are always anchored. The pattern provided must match
|
|
|
-the entire string. For string `"abcde"`:
|
|
|
-
|
|
|
- ab.* # match
|
|
|
- abcd # no match
|
|
|
-
|
|
|
---
|
|
|
-
|
|
|
-Allowed characters::
|
|
|
-+
|
|
|
---
|
|
|
+{es} uses https://lucene.apache.org/core/[Apache Lucene]'s regular expression
|
|
|
+engine to parse these queries.
|
|
|
|
|
|
-Any Unicode characters may be used in the pattern, but certain characters
|
|
|
-are reserved and must be escaped. The standard reserved characters are:
|
|
|
+[float]
|
|
|
+[[regexp-reserved-characters]]
|
|
|
+=== Reserved characters
|
|
|
+Lucene's regular expression engine supports all Unicode characters. However, the
|
|
|
+following characters are reserved as operators:
|
|
|
|
|
|
....
|
|
|
. ? + * | { } [ ] ( ) " \
|
|
|
....
|
|
|
|
|
|
-If you enable optional features (see below) then these characters may
|
|
|
-also be reserved:
|
|
|
+Depending on the <<regexp-optional-operators, optional operators>> enabled, the
|
|
|
+following characters may also be reserved:
|
|
|
|
|
|
- # @ & < > ~
|
|
|
-
|
|
|
-Any reserved character can be escaped with a backslash `"\*"` including
|
|
|
-a literal backslash character: `"\\"`
|
|
|
+....
|
|
|
+# @ & < > ~
|
|
|
+....
|
|
|
|
|
|
-Additionally, any characters (except double quotes) are interpreted literally
|
|
|
-when surrounded by double quotes:
|
|
|
+To use one of these characters literally, escape it with a preceding
|
|
|
+backslash or surround it with double quotes. For example:
|
|
|
|
|
|
- john"@smith.com"
|
|
|
+....
|
|
|
+\@ # renders as a literal '@'
|
|
|
+\\ # renders as a literal '\'
|
|
|
+"john@smith.com" # renders as 'john@smith.com'
|
|
|
+....
|
|
|
+
|
|
|
|
|
|
+[float]
|
|
|
+[[regexp-standard-operators]]
|
|
|
+=== Standard operators
|
|
|
|
|
|
---
|
|
|
+Lucene's regular expression engine does not use the
|
|
|
+https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions[Perl
|
|
|
+Compatible Regular Expressions (PCRE)] library, but it does support the
|
|
|
+following standard operators.
|
|
|
|
|
|
-Match any character::
|
|
|
+`.`::
|
|
|
+
|
|
|
--
|
|
|
+Matches any character. For example:
|
|
|
|
|
|
-The period `"."` can be used to represent any character. For string `"abcde"`:
|
|
|
-
|
|
|
- ab... # match
|
|
|
- a.c.e # match
|
|
|
-
|
|
|
+....
|
|
|
+ab. # matches 'aba', 'abb', 'abz', etc.
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-One-or-more::
|
|
|
+`?`::
|
|
|
+
|
|
|
--
|
|
|
+Repeat the preceding character zero or one times. Often used to make the
|
|
|
+preceding character optional. For example:
|
|
|
|
|
|
-The plus sign `"+"` can be used to repeat the preceding shortest pattern
|
|
|
-once or more times. For string `"aaabbb"`:
|
|
|
-
|
|
|
- a+b+ # match
|
|
|
- aa+bb+ # match
|
|
|
- a+.+ # match
|
|
|
- aa+bbb+ # match
|
|
|
-
|
|
|
+....
|
|
|
+abc? # matches 'ab' and 'abc'
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-Zero-or-more::
|
|
|
+`+`::
|
|
|
+
|
|
|
--
|
|
|
+Repeat the preceding character one or more times. For example:
|
|
|
|
|
|
-The asterisk `"*"` can be used to match the preceding shortest pattern
|
|
|
-zero-or-more times. For string `"aaabbb`":
|
|
|
-
|
|
|
- a*b* # match
|
|
|
- a*b*c* # match
|
|
|
- .*bbb.* # match
|
|
|
- aaa*bbb* # match
|
|
|
-
|
|
|
+....
|
|
|
+ab+ # matches 'abb', 'abbb', 'abbbb', etc.
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-Zero-or-one::
|
|
|
+`*`::
|
|
|
+
|
|
|
--
|
|
|
+Repeat the preceding character zero or more times. For example:
|
|
|
|
|
|
-The question mark `"?"` makes the preceding shortest pattern optional. It
|
|
|
-matches zero or one times. For string `"aaabbb"`:
|
|
|
-
|
|
|
- aaa?bbb? # match
|
|
|
- aaaa?bbbb? # match
|
|
|
- .....?.? # match
|
|
|
- aa?bb? # no match
|
|
|
-
|
|
|
+....
|
|
|
+ab* # matches 'ab', 'abb', 'abbb', 'abbbb', etc.
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-Min-to-max::
|
|
|
+`{}`::
|
|
|
+
|
|
|
--
|
|
|
+Minimum and maximum number of times the preceding character can repeat. For
|
|
|
+example:
|
|
|
|
|
|
-Curly brackets `"{}"` can be used to specify a minimum and (optionally)
|
|
|
-a maximum number of times the preceding shortest pattern can repeat. The
|
|
|
-allowed forms are:
|
|
|
-
|
|
|
- {5} # repeat exactly 5 times
|
|
|
- {2,5} # repeat at least twice and at most 5 times
|
|
|
- {2,} # repeat at least twice
|
|
|
-
|
|
|
-For string `"aaabbb"`:
|
|
|
-
|
|
|
- a{3}b{3} # match
|
|
|
- a{2,4}b{2,4} # match
|
|
|
- a{2,}b{2,} # match
|
|
|
- .{3}.{3} # match
|
|
|
- a{4}b{4} # no match
|
|
|
- a{4,6}b{4,6} # no match
|
|
|
- a{4,}b{4,} # no match
|
|
|
-
|
|
|
+....
|
|
|
+a{2} # matches 'aa'
|
|
|
+a{2,4} # matches 'aa', 'aaa', and 'aaaa'
|
|
|
+a{2,} # matches 'a` repeated two or more times
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-Grouping::
|
|
|
+`|`::
|
|
|
+
|
|
|
--
|
|
|
-
|
|
|
-Parentheses `"()"` can be used to form sub-patterns. The quantity operators
|
|
|
-listed above operate on the shortest previous pattern, which can be a group.
|
|
|
-For string `"ababab"`:
|
|
|
-
|
|
|
- (ab)+ # match
|
|
|
- ab(ab)+ # match
|
|
|
- (..)+ # match
|
|
|
- (...)+ # no match
|
|
|
- (ab)* # match
|
|
|
- abab(ab)? # match
|
|
|
- ab(ab)? # no match
|
|
|
- (ab){3} # match
|
|
|
- (ab){1,2} # no match
|
|
|
-
|
|
|
+OR operator. The match will succeed if the longest pattern on either the left
|
|
|
+side OR the right side matches. For example:
|
|
|
+....
|
|
|
+abc|xyz # matches 'abc' and 'xyz'
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-Alternation::
|
|
|
+`( … )`::
|
|
|
+
|
|
|
--
|
|
|
+Forms a group. You can use a group to treat part of the expression as a single
|
|
|
+character. For example:
|
|
|
|
|
|
-The pipe symbol `"|"` acts as an OR operator. The match will succeed if
|
|
|
-the pattern on either the left-hand side OR the right-hand side matches.
|
|
|
-The alternation applies to the _longest pattern_, not the shortest.
|
|
|
-For string `"aabb"`:
|
|
|
-
|
|
|
- aabb|bbaa # match
|
|
|
- aacc|bb # no match
|
|
|
- aa(cc|bb) # match
|
|
|
- a+|b+ # no match
|
|
|
- a+b+|b+a+ # match
|
|
|
- a+(b|c)+ # match
|
|
|
-
|
|
|
+....
|
|
|
+abc(def)? # matches 'abc' and 'abcdef' but not 'abcd'
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-Character classes::
|
|
|
+`[ … ]`::
|
|
|
+
|
|
|
--
|
|
|
+Match one of the characters in the brackets. For example:
|
|
|
|
|
|
-Ranges of potential characters may be represented as character classes
|
|
|
-by enclosing them in square brackets `"[]"`. A leading `^`
|
|
|
-negates the character class. The allowed forms are:
|
|
|
-
|
|
|
- [abc] # 'a' or 'b' or 'c'
|
|
|
- [a-c] # 'a' or 'b' or 'c'
|
|
|
- [-abc] # '-' or 'a' or 'b' or 'c'
|
|
|
- [abc\-] # '-' or 'a' or 'b' or 'c'
|
|
|
- [^abc] # any character except 'a' or 'b' or 'c'
|
|
|
- [^a-c] # any character except 'a' or 'b' or 'c'
|
|
|
- [^-abc] # any character except '-' or 'a' or 'b' or 'c'
|
|
|
- [^abc\-] # any character except '-' or 'a' or 'b' or 'c'
|
|
|
+....
|
|
|
+[abc] # matches 'a', 'b', 'c'
|
|
|
+....
|
|
|
|
|
|
-Note that the dash `"-"` indicates a range of characters, unless it is
|
|
|
-the first character or if it is escaped with a backslash.
|
|
|
+Inside the brackets, `-` indicates a range unless `-` is the first character or
|
|
|
+escaped. For example:
|
|
|
|
|
|
-For string `"abcd"`:
|
|
|
+....
|
|
|
+[a-c] # matches 'a', 'b', or 'c'
|
|
|
+[-abc] # '-' is first character. Matches '-', 'a', 'b', or 'c'
|
|
|
+[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'
|
|
|
+....
|
|
|
|
|
|
- ab[cd]+ # match
|
|
|
- [a-d]+ # match
|
|
|
- [^a-d]+ # no match
|
|
|
+A `^` before a character in the brackets negates the character or range. For
|
|
|
+example:
|
|
|
|
|
|
+....
|
|
|
+[^abc] # matches any character except 'a', 'b', or 'c'
|
|
|
+[^a-c] # matches any character except 'a', 'b', or 'c'
|
|
|
+[^-abc] # matches any character except '-', 'a', 'b', or 'c'
|
|
|
+[^abc\-] # matches any character except 'a', 'b', 'c', or '-'
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-===== Optional operators
|
|
|
-
|
|
|
-These operators are available by default as the `flags` parameter defaults to `ALL`.
|
|
|
-Different flag combinations (concatenated with `"|"`) can be used to enable/disable
|
|
|
-specific operators:
|
|
|
+[float]
|
|
|
+[[regexp-optional-operators]]
|
|
|
+=== Optional operators
|
|
|
|
|
|
- {
|
|
|
- "regexp": {
|
|
|
- "username": {
|
|
|
- "value": "john~athon<1-5>",
|
|
|
- "flags": "COMPLEMENT|INTERVAL"
|
|
|
- }
|
|
|
- }
|
|
|
- }
|
|
|
+You can use the `flags` parameter to enable more optional operators for
|
|
|
+Lucene's regular expression engine.
|
|
|
|
|
|
-Complement::
|
|
|
-+
|
|
|
---
|
|
|
-
|
|
|
-The complement is probably the most useful option. The shortest pattern that
|
|
|
-follows a tilde `"~"` is negated. For instance, `"ab~cd" means:
|
|
|
+To enable multiple operators, use a `|` separator. For example, a `flags` value
|
|
|
+of `COMPLEMENT|INTERVAL` enables the `COMPLEMENT` and `INTERVAL` operators.
|
|
|
|
|
|
-* Starts with `a`
|
|
|
-* Followed by `b`
|
|
|
-* Followed by a string of any length that is anything but `c`
|
|
|
-* Ends with `d`
|
|
|
+[float]
|
|
|
+==== Valid values
|
|
|
|
|
|
-For the string `"abcdef"`:
|
|
|
+`ALL` (Default)::
|
|
|
+Enables all optional operators.
|
|
|
|
|
|
- ab~df # match
|
|
|
- ab~cf # match
|
|
|
- ab~cdef # no match
|
|
|
- a~(cb)def # match
|
|
|
- a~(bc)def # no match
|
|
|
-
|
|
|
-Enabled with the `COMPLEMENT` or `ALL` flags.
|
|
|
+`COMPLEMENT`::
|
|
|
++
|
|
|
+--
|
|
|
+Enables the `~` operator. You can use `~` to negate the shortest following
|
|
|
+pattern. For example:
|
|
|
|
|
|
+....
|
|
|
+a~bc # matches 'adc' and 'aec' but not 'abc'
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-Interval::
|
|
|
+`INTERVAL`::
|
|
|
+
|
|
|
--
|
|
|
+Enables the `<>` operators. You can use `<>` to match a numeric range. For
|
|
|
+example:
|
|
|
|
|
|
-The interval option enables the use of numeric ranges, enclosed by angle
|
|
|
-brackets `"<>"`. For string: `"foo80"`:
|
|
|
-
|
|
|
- foo<1-100> # match
|
|
|
- foo<01-100> # match
|
|
|
- foo<001-100> # no match
|
|
|
-
|
|
|
-Enabled with the `INTERVAL` or `ALL` flags.
|
|
|
-
|
|
|
-
|
|
|
+....
|
|
|
+foo<1-100> # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
|
|
|
+foo<01-100> # matches 'foo01', 'foo02' ... 'foo99', 'foo100'
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-Intersection::
|
|
|
+`INTERSECTION`::
|
|
|
+
|
|
|
--
|
|
|
+Enables the `&` operator, which acts as an AND operator. The match will succeed
|
|
|
+if patterns on both the left side AND the right side matches. For example:
|
|
|
|
|
|
-The ampersand `"&"` joins two patterns in a way that both of them have to
|
|
|
-match. For string `"aaabbb"`:
|
|
|
-
|
|
|
- aaa.+&.+bbb # match
|
|
|
- aaa&bbb # no match
|
|
|
-
|
|
|
-Using this feature usually means that you should rewrite your regular
|
|
|
-expression.
|
|
|
-
|
|
|
-Enabled with the `INTERSECTION` or `ALL` flags.
|
|
|
-
|
|
|
+....
|
|
|
+aaa.+&.+bbb # matches 'aaabbb'
|
|
|
+....
|
|
|
--
|
|
|
|
|
|
-Any string::
|
|
|
+`ANYSTRING`::
|
|
|
+
|
|
|
--
|
|
|
+Enables the `@` operator. You can use `@` to match any entire
|
|
|
+string.
|
|
|
|
|
|
-The at sign `"@"` matches any string in its entirety. This could be combined
|
|
|
-with the intersection and complement above to express ``everything except''.
|
|
|
-For instance:
|
|
|
+You can combine the `@` operator with `&` and `~` operators to create an
|
|
|
+"everything except" logic. For example:
|
|
|
|
|
|
- @&~(foo.+) # anything except string beginning with "foo"
|
|
|
-
|
|
|
-Enabled with the `ANYSTRING` or `ALL` flags.
|
|
|
+....
|
|
|
+@&~(abc.+) # matches everything except terms beginning with 'abc'
|
|
|
+....
|
|
|
--
|
|
|
+
|
|
|
+[float]
|
|
|
+[[regexp-unsupported-operators]]
|
|
|
+=== Unsupported operators
|
|
|
+Lucene's regular expression engine does not support anchor operators, such as
|
|
|
+`^` (beginning of line) or `$` (end of line). To match a term, the regular
|
|
|
+expression must match the entire string.
|