&xwp; uses a combination of many standards for RE syntaxes.
The level of functionality is most closely modelled on extended
regular expressions (EREs) as supported by the UNIX
egrep
and awk
programs.
Some enhancements introduced by POSIX, those typically found in UNIX
ex
and vi
, and most found in GNU software
are supported too.
The following rules apply (see BNF definition for a formal definition):
Basic characters
abc
matches a
, then b
,
then c
.
~ . ^ $ ( ) { } [ ] ? + * |For example,
1\*2
matches 1
, then *
,
then 2
.
c:\\config\.sys
matches c:\config.sys
.
\n newline ASCII 10 \t tab ASCII 9 \r carriage return ASCII 13 \b backspace ASCII 8 \f formfeed \e escape ASCII 27 \x07 bel ASCII 7C and C++ programmers will find this notation familiar. Strictly speaking, according to POSIX, if
\x2e
was put in an ERE
it should mean "any character" (ASCII character 0x2e is the dot, which means
any character, see later), but this implementation treats it as
"match a full stop".
\<
means 'start of word' (see later).
The character after the backslash is generally chosen to something that
you would not need to escape, so this syntax should not cause any
problems.
Character sets are specified in angle brackets and define a set of acceptable characters, or possibly a set of unacceptable characters.
[abc] matches any one character which can be a, b, or c [^abc] matches any one character which is neither a, b, nor c [a-z] is the set of characters in the range 'a' to 'z' [a-j-t] is the set of characters in the range 'a' to 't'To match a
-
, ^
, ]
character
in a bracket element, escape it with a backslash. For example,
[\])}]
matches ]
, or )
, or }
.
You can also specify a complete POSIX character classes in an additional
brackets pair with colons. In that context,
[:alnum:] means all alphanumeric characters (with ASCII, same as A-Za-z0-9) [:lower:] means all lowercase characters ... etc.For example, specifying
[[:alnum:] \t]
would
allow all alphanumeric characters plus space plus tab.
Any character
The dot (.
) character matches any single character, without
caring what it actually is. For example, t.e
matches
the
, but not tree
.
Word constituent characters
GNU EREs define the term "word constituent character" to
include any alphanumeric character, or the underscore _
character to save some typing.
\w
is a shorthand for any word consitiuent character
and is equivalent to [[:alnum:]_]
or [A-Za-z0-9_]
.
\W
for any non-word constituent character and is
equivalent to [^[:alnum:]_]
or [^A-Za-z0-9_]
.
These shorthands are invalid within square bracket character sets.
Not a character
A Microsoft extension to EREs is to allow you to say
"any character except this character" by preceeding it with a tilde
~
.
This is simply shorter to type than the equivelent
square bracket character set
(~a
is shorter to type than [^a]
).
For example, t~he
matches
tie
, but not the
.
Anchors
Anchors specify conditions where an expression may occur:
^ matches if we are at the start of the string $ matches if we are at the end of the string \` GNU alternative way of writing ^ \' GNU alternative way of writing $ \< matches if we are at the start of a word (or the whole string) \> matches if we are at the end of a word (or the whole string) \B matches if we are within a word \y matches if we are at the start or end of a word (or whole string)Here "word" is according to the GNU definition as given above.
For example, ^xyz
matches xyz
only if it is at
the start of the line. \<fred
matches fred
,
freddy
, but not alfred
.
Repetitions
By using ?
, +
, {M}
, or {M,N}
you can search for repetitions:
ab?c matches a, then zero or one occurrance of b, then c ab+c matches a, then one or more occurrances of b, then c ab*c matches a, then zero or more occurrances of b, then c ab{M}c matches a, then M occurrances of b, then c ab{M,}c matches a, then M or more occurrances of b, then c ab{M,N}c matches a, then between M and N occurrances of b, then c
In the above, M and N are numbers given in decimal, where if both are given, M must be <= N.
For example, [A-Za-z_][A-Za-z0-9_]*
matches any legal C or C++ identifier. \w{10,}
matches at least
10 word constituent characters. \<[0-9]{5}\>
matches a 5
digit number.
Alternation
You can search for one thing or another using the |
symbol.
For example, fred|bill
matches fred
or bill
.
fred|bill|rob
matches any one of these 3 names.
Nested EREs
Brackets may be used to group EREs into sub-EREs, so that operators
such as the repetitions or alternation operator may be applied to them.
For example, (frog|toad)+
matches one or more occurence of
frog
or toad
.
Nesting may be performed several levels deep.
Backreferences
Every time a nested ERE is matched, what it matched is recorded. You can do a backreference later in the main ERE (which means that whatever was matched before must be matched again) by specifying the backslash with a single decimal number. Up to 9 nested EREs may thus be backreferenced.
For example, \2
says the second nested ERE must
be matched again. As a result,
(a|b)(c|d)\2\1
will match acca
,
or adda
, or bcca
, or bddb
.
Note that the above example is not a shorthand for
writing (a|b)(c|d)(c|d)(a|b)
. Backreferences do
not refer to the ERE, but to what was matched by it.
Backreferences can be done within nested EREs to backreference
nested EREs. For example, frog|((a|bc)d\1)
matches
frog
, or ada
, or bcdbc
,
but not adbc
.