Regular Expressions (REs) are commonly used used to determine if a character string of interest is matched somewhere in a set of character strings. They allow for specifying complex matching criteria and are thus much more flexible than regular search strings.

&xwp; uses a combination of many standards for RE syntaxes. The level of functionality is most closely modelled on extended regular expressions (EREs) as supported by the UNIX egrep and awk programs. Some enhancements introduced by POSIX, those typically found in UNIX ex and vi, and most found in GNU software are supported too.

The following rules apply (see BNF definition for a formal definition):

Basic characters

Character sets

Character sets are specified in angle brackets and define a set of acceptable characters, or possibly a set of unacceptable characters.

  [abc]   matches any one character which can be a, b, or c
  [^abc]  matches any one character which is neither a, b, nor c
  [a-z]   is the set of characters in the range 'a' to 'z'
  [a-j-t] is the set of characters in the range 'a' to 't'
To match a -, ^, ] character in a bracket element, escape it with a backslash. For example, [\])}] matches ], or ), or }. You can also specify a complete POSIX character classes in an additional brackets pair with colons. In that context,
  [:alnum:]   means all alphanumeric characters
              (with ASCII, same as A-Za-z0-9)
  [:lower:]   means all lowercase characters
  ... etc.
For example, specifying [[:alnum:] \t] would allow all alphanumeric characters plus space plus tab.

Any character

The dot (.) character matches any single character, without caring what it actually is. For example, t.e matches the, but not tree.

Word constituent characters

GNU EREs define the term "word constituent character" to include any alphanumeric character, or the underscore _ character to save some typing.

\w is a shorthand for any word consitiuent character and is equivalent to [[:alnum:]_] or [A-Za-z0-9_]. \W for any non-word constituent character and is equivalent to [^[:alnum:]_] or [^A-Za-z0-9_].

These shorthands are invalid within square bracket character sets.

Not a character

A Microsoft extension to EREs is to allow you to say "any character except this character" by preceeding it with a tilde ~. This is simply shorter to type than the equivelent square bracket character set (~a is shorter to type than [^a]). For example, t~he matches tie, but not the.

Anchors

Anchors specify conditions where an expression may occur:

  ^       matches if we are at the start of the string
  $       matches if we are at the end of the string
  \`      GNU alternative way of writing ^
  \'      GNU alternative way of writing $
  \<      matches if we are at the start of a word (or the whole string)
  \>      matches if we are at the end of a word (or the whole string)
  \B      matches if we are within a word
  \y      matches if we are at the start or end of a word (or whole string)
Here "word" is according to the GNU definition as given above.

For example, ^xyz matches xyz only if it is at the start of the line. \<fred matches fred, freddy, but not alfred.

Repetitions

By using ?, +, {M}, or {M,N} you can search for repetitions:

  ab?c      matches a, then zero or one occurrance of b, then c
  ab+c      matches a, then one or more occurrances of b, then c
  ab*c      matches a, then zero or more occurrances of b, then c
  ab{M}c    matches a, then M occurrances of b, then c
  ab{M,}c   matches a, then M or more occurrances of b, then c
  ab{M,N}c  matches a, then between M and N occurrances of b, then c

In the above, M and N are numbers given in decimal, where if both are given, M must be <= N.

For example, [A-Za-z_][A-Za-z0-9_]* matches any legal C or C++ identifier. \w{10,} matches at least 10 word constituent characters. \<[0-9]{5}\> matches a 5 digit number.

Alternation

You can search for one thing or another using the | symbol. For example, fred|bill matches fred or bill. fred|bill|rob matches any one of these 3 names.

Nested EREs

Brackets may be used to group EREs into sub-EREs, so that operators such as the repetitions or alternation operator may be applied to them. For example, (frog|toad)+ matches one or more occurence of frog or toad.

Nesting may be performed several levels deep.

Backreferences

Every time a nested ERE is matched, what it matched is recorded. You can do a backreference later in the main ERE (which means that whatever was matched before must be matched again) by specifying the backslash with a single decimal number. Up to 9 nested EREs may thus be backreferenced.

For example, \2 says the second nested ERE must be matched again. As a result, (a|b)(c|d)\2\1 will match acca, or adda, or bcca, or bddb.

Note that the above example is not a shorthand for writing (a|b)(c|d)(c|d)(a|b). Backreferences do not refer to the ERE, but to what was matched by it.

Backreferences can be done within nested EREs to backreference nested EREs. For example, frog|((a|bc)d\1) matches frog, or ada, or bcdbc, but not adbc.