浏览代码

[DOCS] Rewrite `regexp` query (#42711)

James Rodewig 6 年之前
父节点
当前提交
8b2493ca9f

+ 1 - 0
docs/reference/index-modules.asciidoc

@@ -205,6 +205,7 @@ specific index module:
     The maximum number of terms that can be used in Terms Query.
     Defaults to `65536`.
 
+[[index-max-regex-length]]
  `index.max_regex_length`::
 
     The maximum length of regex that can be used in Regexp Query.

+ 3 - 1
docs/reference/query-dsl.asciidoc

@@ -47,4 +47,6 @@ include::query-dsl/term-level-queries.asciidoc[]
 
 include::query-dsl/minimum-should-match.asciidoc[]
 
-include::query-dsl/multi-term-rewrite.asciidoc[]
+include::query-dsl/multi-term-rewrite.asciidoc[]
+
+include::query-dsl/regexp-syntax.asciidoc[]

+ 63 - 75
docs/reference/query-dsl/regexp-query.asciidoc

@@ -4,98 +4,86 @@
 <titleabbrev>Regexp</titleabbrev>
 ++++
 
-The `regexp` query allows you to use regular expression term queries.
-See <<regexp-syntax>> for details of the supported regular expression language.
-The "term queries" in that first sentence means that Elasticsearch will apply
-the regexp to the terms produced by the tokenizer for that field, and not
-to the original text of the field.
+Returns documents that contain terms matching a
+https://en.wikipedia.org/wiki/Regular_expression[regular expression].
 
-*Note*: The performance of a `regexp` query heavily depends on the
-regular expression chosen. Matching everything like `.*` is very slow as
-well as using lookaround regular expressions. If possible, you should
-try to use a long prefix before your regular expression starts. Wildcard
-matchers like `.*?+` will mostly lower performance.
+A regular expression is a way to match patterns in data using placeholder
+characters, called operators. For a list of operators supported by the
+`regexp` query, see <<regexp-syntax, Regular expression syntax>>.
 
-[source,js]
---------------------------------------------------
-GET /_search
-{
-    "query": {
-        "regexp":{
-            "name.first": "s.*y"
-        }
-    }
-}
---------------------------------------------------
-// CONSOLE
+[[regexp-query-ex-request]]
+==== Example request
 
-Boosting is also supported
+The following search returns documents where the `user` field contains any term
+that begins with `k` and ends with `y`. The `.*` operators match any
+characters of any length, including no characters. Matching
+terms can include `ky`, `kay`, and `kimchy`.
 
 [source,js]
---------------------------------------------------
+----
 GET /_search
 {
     "query": {
-        "regexp":{
-            "name.first":{
-                "value":"s.*y",
-                "boost":1.2
+        "regexp": {
+            "user": {
+                "value": "k.*y",
+                "flags" : "ALL",
+                "max_determinized_states": 10000,
+                "rewrite": "constant_score"
             }
         }
     }
 }
---------------------------------------------------
+----
 // CONSOLE
 
-You can also use special flags
 
-[source,js]
---------------------------------------------------
-GET /_search
-{
-    "query": {
-        "regexp":{
-            "name.first": {
-                "value": "s.*y",
-                "flags" : "INTERSECTION|COMPLEMENT|EMPTY"
-            }
-        }
-    }
-}
---------------------------------------------------
-// CONSOLE
+[[regexp-top-level-params]]
+==== Top-level parameters for `regexp`
+`<field>`::
+(Required, object) Field you wish to search.
 
-Possible flags are `ALL` (default), `ANYSTRING`, `COMPLEMENT`,
-`EMPTY`, `INTERSECTION`, `INTERVAL`, or `NONE`. Please check the
-http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/util/automaton/RegExp.html[Lucene
-documentation] for their meaning
+[[regexp-query-field-params]]
+==== Parameters for `<field>`
+`value`::
+(Required, string) Regular expression for terms you wish to find in the provided
+`<field>`. For a list of supported operators, see <<regexp-syntax, Regular
+expression syntax>>.
++
+--
+By default, regular expressions are limited to 1,000 characters. You can change
+this limit using the <<index-max-regex-length, `index.max_regex_length`>>
+setting.
 
-Regular expressions are dangerous because it's easy to accidentally
-create an innocuous looking one that requires an exponential number of
-internal determinized automaton states (and corresponding RAM and CPU)
-for Lucene to execute.  Lucene prevents these using the
-`max_determinized_states` setting (defaults to 10000).  You can raise
-this limit to allow more complex regular expressions to execute.
+[WARNING]
+=====
+The performance of the `regexp` query can vary based on the regular expression
+provided. To improve performance, avoid using wildcard patterns, such as `.*` or
+`.*?+`, without a prefix or suffix.
+=====
+--
 
-[source,js]
---------------------------------------------------
-GET /_search
-{
-    "query": {
-        "regexp":{
-            "name.first": {
-                "value": "s.*y",
-                "flags" : "INTERSECTION|COMPLEMENT|EMPTY",
-                "max_determinized_states": 20000
-            }
-        }
-    }
-}
---------------------------------------------------
-// CONSOLE
+`flags`::
+(Optional, string) Enables optional operators for the regular expression. For
+valid values and more information, see <<regexp-optional-operators, Regular
+expression syntax>>.
+
+`max_determinized_states`::
++
+--
+(Optional, integer) Maximum number of
+https://en.wikipedia.org/wiki/Deterministic_finite_automaton[automaton states]
+required for the query. Default is `10000`.
+
+{es} uses https://lucene.apache.org/core/[Apache Lucene] internally to parse
+regular expressions. Lucene converts each regular expression to a finite
+automaton containing a number of determinized states.
 
-NOTE: By default the maximum length of regex string allowed in a Regexp Query 
-is limited to 1000. You can update the `index.max_regex_length` index setting 
-to bypass this limit.
+You can use this parameter to prevent that conversion from unintentionally
+consuming too many resources. You may need to increase this limit to run complex
+regular expressions.
+--
 
-include::regexp-syntax.asciidoc[]
+`rewrite`::
+(Optional, string) Method used to rewrite the query. For valid values and more
+information, see the <<query-dsl-multi-term-rewrite, `rewrite` parameter>>.

+ 141 - 203
docs/reference/query-dsl/regexp-syntax.asciidoc

@@ -1,286 +1,224 @@
 [[regexp-syntax]]
-==== Regular expression syntax
+== Regular expression syntax
 
-Regular expression queries are supported by the `regexp` and the `query_string`
-queries.  The Lucene regular expression engine
-is not Perl-compatible but supports a smaller range of operators.
+A https://en.wikipedia.org/wiki/Regular_expression[regular expression] is a way to
+match patterns in data using placeholder characters, called operators.
 
-[NOTE]
-=====
-We will not attempt to explain regular expressions, but
-just explain the supported operators.
-=====
+{es} supports regular expressions in the following queries:
 
-===== Standard operators
+* <<query-dsl-regexp-query, `regexp`>>
+* <<query-dsl-query-string-query, `query_string`>>
 
-Anchoring::
-+
---
-
-Most regular expression engines allow you to match any part of a string.
-If you want the regexp pattern to start at the beginning of the string or
-finish at the end of the string, then you have to _anchor_ it specifically,
-using `^` to indicate the beginning or `$` to indicate the end.
-
-Lucene's patterns are always anchored.  The pattern provided must match
-the entire string. For string `"abcde"`:
-
-    ab.*     # match
-    abcd     # no match
-
---
-
-Allowed characters::
-+
---
+{es} uses https://lucene.apache.org/core/[Apache Lucene]'s regular expression
+engine to parse these queries.
 
-Any Unicode characters may be used in the pattern, but certain characters
-are reserved and must be escaped.  The standard reserved characters are:
+[float]
+[[regexp-reserved-characters]]
+=== Reserved characters
+Lucene's regular expression engine supports all Unicode characters. However, the
+following characters are reserved as operators:
 
 ....
 . ? + * | { } [ ] ( ) " \
 ....
 
-If you enable optional features (see below) then these characters may
-also be reserved:
+Depending on the <<regexp-optional-operators, optional operators>> enabled, the
+following characters may also be reserved:
 
-    # @ & < >  ~
-
-Any reserved character can be escaped with a backslash `"\*"` including
-a literal backslash character: `"\\"`
+....
+# @ & < >  ~
+....
 
-Additionally, any characters (except double quotes) are interpreted literally
-when surrounded by double quotes:
+To use one of these characters literally, escape it with a preceding
+backslash or surround it with double quotes. For example:
 
-    john"@smith.com"
+....
+\@                  # renders as a literal '@'
+\\                  # renders as a literal '\'
+"john@smith.com"    # renders as 'john@smith.com'
+....
+    
 
+[float]
+[[regexp-standard-operators]]
+=== Standard operators
 
---
+Lucene's regular expression engine does not use the
+https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions[Perl
+Compatible Regular Expressions (PCRE)] library, but it does support the
+following standard operators.
 
-Match any character::
+`.`::
 +
 --
+Matches any character. For example:
 
-The period `"."` can be used to represent any character.  For string `"abcde"`:
-
-    ab...   # match
-    a.c.e   # match
-
+....
+ab.     # matches 'aba', 'abb', 'abz', etc.
+....
 --
 
-One-or-more::
+`?`::
 +
 --
+Repeat the preceding character zero or one times. Often used to make the
+preceding character optional. For example:
 
-The plus sign `"+"` can be used to repeat the preceding shortest pattern
-once or more times. For string `"aaabbb"`:
-
-    a+b+        # match
-    aa+bb+      # match
-    a+.+        # match
-    aa+bbb+     # match
-
+....
+abc?     # matches 'ab' and 'abc'
+....
 --
 
-Zero-or-more::
+`+`::
 +
 --
+Repeat the preceding character one or more times. For example:
 
-The asterisk `"*"` can be used to match the preceding shortest pattern
-zero-or-more times.  For string `"aaabbb`":
-
-    a*b*        # match
-    a*b*c*      # match
-    .*bbb.*     # match
-    aaa*bbb*    # match
-
+....
+ab+     # matches 'abb', 'abbb', 'abbbb', etc.
+....
 --
 
-Zero-or-one::
+`*`::
 +
 --
+Repeat the preceding character zero or more times. For example:
 
-The question mark `"?"` makes the preceding shortest pattern optional. It
-matches zero or one times.  For string `"aaabbb"`:
-
-    aaa?bbb?    # match
-    aaaa?bbbb?  # match
-    .....?.?    # match
-    aa?bb?      # no match
-
+....
+ab*     # matches 'ab', 'abb', 'abbb', 'abbbb', etc.
+....
 --
 
-Min-to-max::
+`{}`::
 +
 --
+Minimum and maximum number of times the preceding character can repeat. For
+example:
 
-Curly brackets `"{}"` can be used to specify a minimum and (optionally)
-a maximum number of times the preceding shortest pattern can repeat.  The
-allowed forms are:
-
-    {5}     # repeat exactly 5 times
-    {2,5}   # repeat at least twice and at most 5 times
-    {2,}    # repeat at least twice
-
-For string `"aaabbb"`:
-
-    a{3}b{3}        # match
-    a{2,4}b{2,4}    # match
-    a{2,}b{2,}      # match
-    .{3}.{3}        # match
-    a{4}b{4}        # no match
-    a{4,6}b{4,6}    # no match
-    a{4,}b{4,}      # no match
-
+....
+a{2}    # matches 'aa'
+a{2,4}  # matches 'aa', 'aaa', and 'aaaa'
+a{2,}   # matches 'a` repeated two or more times
+....
 --
 
-Grouping::
+`|`::
 +
 --
-
-Parentheses `"()"` can be used to form sub-patterns. The quantity operators
-listed above operate on the shortest previous pattern, which can be a group.
-For string `"ababab"`:
-
-    (ab)+       # match
-    ab(ab)+     # match
-    (..)+       # match
-    (...)+      # no match
-    (ab)*       # match
-    abab(ab)?   # match
-    ab(ab)?     # no match
-    (ab){3}     # match
-    (ab){1,2}   # no match
-
+OR operator. The match will succeed if the longest pattern on either the left
+side OR the right side matches. For example:
+....
+abc|xyz  # matches 'abc' and 'xyz'
+....
 --
 
-Alternation::
+`( … )`::
 +
 --
+Forms a group. You can use a group to treat part of the expression as a single
+character. For example:
 
-The pipe symbol `"|"` acts as an OR operator. The match will succeed if
-the pattern on either the left-hand side OR the right-hand side matches.
-The alternation applies to the _longest pattern_, not the shortest.
-For string `"aabb"`:
-
-    aabb|bbaa   # match
-    aacc|bb     # no match
-    aa(cc|bb)   # match
-    a+|b+       # no match
-    a+b+|b+a+   # match
-    a+(b|c)+    # match
-
+....
+abc(def)?  # matches 'abc' and 'abcdef' but not 'abcd'
+....
 --
 
-Character classes::
+`[ … ]`::
 +
 --
+Match one of the characters in the brackets. For example:
 
-Ranges of potential characters may be represented as character classes
-by enclosing them in square brackets `"[]"`. A leading `^`
-negates the character class. The allowed forms are:
-
-    [abc]   # 'a' or 'b' or 'c'
-    [a-c]   # 'a' or 'b' or 'c'
-    [-abc]  # '-' or 'a' or 'b' or 'c'
-    [abc\-] # '-' or 'a' or 'b' or 'c'
-    [^abc]  # any character except 'a' or 'b' or 'c'
-    [^a-c]  # any character except 'a' or 'b' or 'c'
-    [^-abc]  # any character except '-' or 'a' or 'b' or 'c'
-    [^abc\-] # any character except '-' or 'a' or 'b' or 'c'
+....
+[abc]   # matches 'a', 'b', 'c'
+....
 
-Note that the dash `"-"` indicates a range of characters, unless it is
-the first character or if it is escaped with a backslash.
+Inside the brackets, `-` indicates a range unless `-` is the first character or
+escaped. For example:
 
-For string `"abcd"`:
+....
+[a-c]   # matches 'a', 'b', or 'c'
+[-abc]  # '-' is first character. Matches '-', 'a', 'b', or 'c'
+[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'
+....
 
-    ab[cd]+     # match
-    [a-d]+      # match
-    [^a-d]+     # no match
+A `^` before a character in the brackets negates the character or range. For
+example:
 
+....
+[^abc]      # matches any character except 'a', 'b', or 'c'
+[^a-c]      # matches any character except 'a', 'b', or 'c'
+[^-abc]     # matches any character except '-', 'a', 'b', or 'c'
+[^abc\-]    # matches any character except 'a', 'b', 'c', or '-'
+....
 --
 
-===== Optional operators
-
-These operators are available by default as the `flags` parameter defaults to `ALL`.
-Different flag combinations (concatenated with `"|"`) can be used to enable/disable
-specific operators:
+[float]
+[[regexp-optional-operators]]
+=== Optional operators
 
-    {
-        "regexp": {
-            "username": {
-                "value": "john~athon<1-5>",
-                "flags": "COMPLEMENT|INTERVAL"
-            }
-        }
-    }
+You can use the `flags` parameter to enable more optional operators for
+Lucene's regular expression engine.
 
-Complement::
-+
---
-
-The complement is probably the most useful option. The shortest pattern that
-follows a tilde `"~"` is negated.  For instance, `"ab~cd" means:
+To enable multiple operators, use a `|` separator. For example, a `flags` value
+of `COMPLEMENT|INTERVAL` enables the `COMPLEMENT` and `INTERVAL` operators.
 
-* Starts with `a`
-* Followed by `b`
-* Followed by a string of any length that is anything but `c`
-* Ends with `d`
+[float]
+==== Valid values 
 
-For the string `"abcdef"`:
+`ALL` (Default)::
+Enables all optional operators.
 
-    ab~df     # match
-    ab~cf     # match
-    ab~cdef   # no match
-    a~(cb)def # match
-    a~(bc)def # no match
-
-Enabled with the `COMPLEMENT` or `ALL` flags.
+`COMPLEMENT`::
++
+--
+Enables the `~` operator. You can use `~` to negate the shortest following
+pattern. For example:
 
+....
+a~bc   # matches 'adc' and 'aec' but not 'abc'
+....
 --
 
-Interval::
+`INTERVAL`::
 +
 --
+Enables the `<>` operators. You can use `<>` to match a numeric range. For
+example:
 
-The interval option enables the use of numeric ranges, enclosed by angle
-brackets `"<>"`. For string: `"foo80"`:
-
-    foo<1-100>     # match
-    foo<01-100>    # match
-    foo<001-100>   # no match
-
-Enabled with the `INTERVAL` or `ALL` flags.
-
-
+....
+foo<1-100>      # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
+foo<01-100>     # matches 'foo01', 'foo02' ... 'foo99', 'foo100'
+....
 --
 
-Intersection::
+`INTERSECTION`::
 +
 --
+Enables the `&` operator, which acts as an AND operator. The match will succeed
+if patterns on both the left side AND the right side matches. For example:
 
-The ampersand `"&"` joins two patterns in a way that both of them have to
-match. For string `"aaabbb"`:
-
-    aaa.+&.+bbb     # match
-    aaa&bbb         # no match
-
-Using this feature usually means that you should rewrite your regular
-expression.
-
-Enabled with the `INTERSECTION` or `ALL` flags.
-
+....
+aaa.+&.+bbb  # matches 'aaabbb'
+....
 --
 
-Any string::
+`ANYSTRING`::
 +
 --
+Enables the `@` operator. You can use `@` to match any entire
+string.
 
-The at sign `"@"` matches any string in its entirety.  This could be combined
-with the intersection and complement above to express ``everything except''.
-For instance:
+You can combine the `@` operator with `&` and `~` operators to create an
+"everything except" logic. For example:
 
-    @&~(foo.+)      # anything except string beginning with "foo"
-
-Enabled with the `ANYSTRING` or `ALL` flags.
+....
+@&~(abc.+)  # matches everything except terms beginning with 'abc'
+....
 --
+
+[float]
+[[regexp-unsupported-operators]]
+=== Unsupported operators
+Lucene's regular expression engine does not support anchor operators, such as
+`^` (beginning of line) or `$` (end of line). To match a term, the regular
+expression must match the entire string.

+ 1 - 1
x-pack/docs/en/rest-api/security/role-mapping-resources.asciidoc

@@ -49,7 +49,7 @@ The value specified in the field rule can be one of the following types:
 | Simple String      | Exactly matches the provided value.                             | "esadmin"
 | Wildcard String    | Matches the provided value using a wildcard.                    | "*,dc=example,dc=com"
 | Regular Expression | Matches the provided value using a
-                       {ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp]. | "/.\*-admin[0-9]*/"
+                       {ref}/regexp-syntax.html[Lucene regexp]. | "/.\*-admin[0-9]*/"
 | Number             | Matches an equivalent numerical value.                          | 7
 | Null               | Matches a null or missing value.                                | null
 | Array              | Tests each element in the array in

+ 1 - 1
x-pack/docs/en/security/auditing/output-logfile.asciidoc

@@ -132,7 +132,7 @@ Please take time to review these policies whenever your system architecture chan
 
 A policy is a named set of filter rules. Each filter rule applies to a single event attribute,
 one of the `users`, `realms`, `roles` or `indices` attributes. The filter rule defines
-a list of {ref}/query-dsl-regexp-query.html#regexp-syntax[Lucene regexp], *any* of which has to match the value of the audit
+a list of {ref}/regexp-syntax.html[Lucene regexp], *any* of which has to match the value of the audit
 event attribute for the rule to match.
 A policy matches an event if *all* the rules comprising it match the event.
 An audit event is ignored, therefore not printed, if it matches *any* policy. All other