regexp-syntax.asciidoc 5.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278
  1. [[regexp-syntax]]
  2. ==== Regular expression syntax
  3. Regular expression queries are supported by the `regexp` and the `query_string`
  4. queries. The Lucene regular expression engine
  5. is not Perl-compatible but supports a smaller range of operators.
  6. [NOTE]
  7. =====
  8. We will not attempt to explain regular expressions, but
  9. just explain the supported operators.
  10. =====
  11. ===== Standard operators
  12. Anchoring::
  13. +
  14. --
  15. Most regular expression engines allow you to match any part of a string.
  16. If you want the regexp pattern to start at the beginning of the string or
  17. finish at the end of the string, then you have to _anchor_ it specifically,
  18. using `^` to indicate the beginning or `$` to indicate the end.
  19. Lucene's patterns are always anchored. The pattern provided must match
  20. the entire string. For string `"abcde"`:
  21. ab.* # match
  22. abcd # no match
  23. --
  24. Allowed characters::
  25. +
  26. --
  27. Any Unicode characters may be used in the pattern, but certain characters
  28. are reserved and must be escaped. The standard reserved characters are:
  29. ....
  30. . ? + * | { } [ ] ( ) " \
  31. ....
  32. If you enable optional features (see below) then these characters may
  33. also be reserved:
  34. # @ & < > ~
  35. Any reserved character can be escaped with a backslash `"\*"` including
  36. a literal backslash character: `"\\"`
  37. Additionally, any characters (except double quotes) are interpreted literally
  38. when surrounded by double quotes:
  39. john"@smith.com"
  40. --
  41. Match any character::
  42. +
  43. --
  44. The period `"."` can be used to represent any character. For string `"abcde"`:
  45. ab... # match
  46. a.c.e # match
  47. --
  48. One-or-more::
  49. +
  50. --
  51. The plus sign `"+"` can be used to repeat the preceding shortest pattern
  52. once or more times. For string `"aaabbb"`:
  53. a+b+ # match
  54. aa+bb+ # match
  55. a+.+ # match
  56. aa+bbb+ # match
  57. --
  58. Zero-or-more::
  59. +
  60. --
  61. The asterisk `"*"` can be used to match the preceding shortest pattern
  62. zero-or-more times. For string `"aaabbb`":
  63. a*b* # match
  64. a*b*c* # match
  65. .*bbb.* # match
  66. aaa*bbb* # match
  67. --
  68. Zero-or-one::
  69. +
  70. --
  71. The question mark `"?"` makes the preceding shortest pattern optional. It
  72. matches zero or one times. For string `"aaabbb"`:
  73. aaa?bbb? # match
  74. aaaa?bbbb? # match
  75. .....?.? # match
  76. aa?bb? # no match
  77. --
  78. Min-to-max::
  79. +
  80. --
  81. Curly brackets `"{}"` can be used to specify a minimum and (optionally)
  82. a maximum number of times the preceding shortest pattern can repeat. The
  83. allowed forms are:
  84. {5} # repeat exactly 5 times
  85. {2,5} # repeat at least twice and at most 5 times
  86. {2,} # repeat at least twice
  87. For string `"aaabbb"`:
  88. a{3}b{3} # match
  89. a{2,4}b{2,4} # match
  90. a{2,}b{2,} # match
  91. .{3}.{3} # match
  92. a{4}b{4} # no match
  93. a{4,6}b{4,6} # no match
  94. a{4,}b{4,} # no match
  95. --
  96. Grouping::
  97. +
  98. --
  99. Parentheses `"()"` can be used to form sub-patterns. The quantity operators
  100. listed above operate on the shortest previous pattern, which can be a group.
  101. For string `"ababab"`:
  102. (ab)+ # match
  103. ab(ab)+ # match
  104. (..)+ # match
  105. (...)+ # no match
  106. (ab)* # match
  107. abab(ab)? # match
  108. ab(ab)? # no match
  109. (ab){3} # match
  110. (ab){1,2} # no match
  111. --
  112. Alternation::
  113. +
  114. --
  115. The pipe symbol `"|"` acts as an OR operator. The match will succeed if
  116. the pattern on either the left-hand side OR the right-hand side matches.
  117. The alternation applies to the _longest pattern_, not the shortest.
  118. For string `"aabb"`:
  119. aabb|bbaa # match
  120. aacc|bb # no match
  121. aa(cc|bb) # match
  122. a+|b+ # no match
  123. a+b+|b+a+ # match
  124. a+(b|c)+ # match
  125. --
  126. Character classes::
  127. +
  128. --
  129. Ranges of potential characters may be represented as character classes
  130. by enclosing them in square brackets `"[]"`. A leading `^`
  131. negates the character class. The allowed forms are:
  132. [abc] # 'a' or 'b' or 'c'
  133. [a-c] # 'a' or 'b' or 'c'
  134. [-abc] # '-' or 'a' or 'b' or 'c'
  135. [abc\-] # '-' or 'a' or 'b' or 'c'
  136. [^abc] # any character except 'a' or 'b' or 'c'
  137. [^a-c] # any character except 'a' or 'b' or 'c'
  138. [^-abc] # any character except '-' or 'a' or 'b' or 'c'
  139. [^abc\-] # any character except '-' or 'a' or 'b' or 'c'
  140. Note that the dash `"-"` indicates a range of characters, unless it is
  141. the first character or if it is escaped with a backslash.
  142. For string `"abcd"`:
  143. ab[cd]+ # match
  144. [a-d]+ # match
  145. [^a-d]+ # no match
  146. --
  147. ====== Optional operators
  148. These operators are available by default as the `flags` parameter defaults to `ALL`.
  149. Different flag combinations (concatened with `"\"`) can be used to enable/disable
  150. specific operators:
  151. {
  152. "regexp": {
  153. "username": {
  154. "value": "john~athon<1-5>",
  155. "flags": "COMPLEMENT|INTERVAL"
  156. }
  157. }
  158. }
  159. Complement::
  160. +
  161. --
  162. The complement is probably the most useful option. The shortest pattern that
  163. follows a tilde `"~"` is negated. For the string `"abcdef"`:
  164. ab~df # match
  165. ab~cf # no match
  166. a~(cd)f # match
  167. a~(bc)f # no match
  168. Enabled with the `COMPLEMENT` or `ALL` flags.
  169. --
  170. Interval::
  171. +
  172. --
  173. The interval option enables the use of numeric ranges, enclosed by angle
  174. brackets `"<>"`. For string: `"foo80"`:
  175. foo<1-100> # match
  176. foo<01-100> # match
  177. foo<001-100> # no match
  178. Enabled with the `INTERVAL` or `ALL` flags.
  179. --
  180. Intersection::
  181. +
  182. --
  183. The ampersand `"&"` joins two patterns in a way that both of them have to
  184. match. For string `"aaabbb"`:
  185. aaa.+&.+bbb # match
  186. aaa&bbb # no match
  187. Using this feature usually means that you should rewrite your regular
  188. expression.
  189. Enabled with the `INTERSECTION` or `ALL` flags.
  190. --
  191. Any string::
  192. +
  193. --
  194. The at sign `"@"` matches any string in its entirety. This could be combined
  195. with the intersection and complement above to express ``everything except''.
  196. For instance:
  197. @&~(foo.+) # anything except string beginning with "foo"
  198. Enabled with the `ANYSTRING` or `ALL` flags.
  199. --