Metacharacters
Some characters have a special meaning to the searcher. These
characters are called metacharacters. Although they
may seem confusing at first, they add a great deal of
flexibility and convenience to the searcher.
The period (.) is a commonly used
metacharacter. It matches exactly one character, regardless of
what the character is. For example, the regular expression:
2,.-Dimethylbutane
will match "2,2-Dimethylbutane" and
"2,3-Dimethylbutane". Note that the period matches exactly
one character-- it will not match a string of characters,
nor will it match the null string. Thus,
"2,200-Dimethylbutane" and
"2,-Dimenthylbutane" will not be matched by
the above regular expression.
But what if you wanted to search for a string containing a
period? For example, suppose we wished to search for
references to pi. The following regular expression would not
work:
3.14 (THIS IS WRONG!)
This would indeed match "3.14", but it would also
match "3514", "3f14", or even
"3+14". In short, any string of the form
"3x14", where x is any character, would be matched
by the regular expression above.
To get around this, we introduce a second metacharacter,
the backslash (\). The backslash can
be used to indicate that the character immediately to its
right is to be taken literally. Thus, to search for the string
"3.14", we would use:
3\.14 (This will work.)
This is called "quoting". We would say that the
period in the regular expression above has been quoted. In
general, whenever the backslash is placed before a
metacharacter, the searcher treats the metacharacter literally
rather than invoking its special meaning.
(Unfortunately, the backslash is used for other things
besides quoting metacharacters. Many "normal"
characters take on special meanings when preceded by a
backslash. The rule of thumb is, quoting a metacharacter turns
it into a normal character, and quoting a normal character may
turn it into a metacharacter.)
Let's look at some more common metacharacters. We consider
first the question mark (?). The
question mark indicates that the character immediately
preceding it either zero times or one time. Thus
m?ethane
would match either "ethane" or "methane".
Similarly,
comm?a
would match either "coma" or "comma".
Another metacharacter is the star (*).
This indicates that the character immediately to its left may
be repeated any number of times, including zero. Thus
ab*c
would match "ac", "abc", "abbc",
"abbbc", "abbbbbbbbc", and any string that
starts with an "a", is followed by a sequence of
"b"'s, and ends with a "c".
The plus (+) metacharacter
indicates that the character immediately preceding it may be
repeated one or more times. It is just like the star
metacharacter, except it doesn't match the null string. Thus
ab+c
would not match "ac", but it would
match "abc", "abbc", "abbbc",
"abbbbbbbbc" and so on.
Metacharacters may be combined. A common combination
includes the period and star metacharacters, with the star
immediately following the period. This is used to match an
arbitrary string of any length, including the null string. For
example:
cyclo.*ane
would match "cyclodecane", "cyclohexane"
and even "cyclones drive me insane." Any string that
starts with "cyclo", is followed by an arbitrary
string, and ends with "ane" will be matched. Note
that the null string will be matched by the period-star pair;
thus, "cycloane" would be matche by the above
expression.
If you wanted to search for articles on cyclodecane and
cyclohexane, but didn't want to match articles about how
cyclones drive one insane, you could string together three
periods, as follows:
cyclo...ane
This would match "cyclodecane" and "cyclohexane",
but would not match "cyclones drive me insane." Only
strings eleven characters long which start with "cyclo"
and end with "ane" will be matched. (Note that
"cyclopentane" would not be matched, however, since
cyclopentane has twelve characters, not eleven.)
Here are some more examples. These involve the backslash.
Note that the placement of backslash is important.
a\.*z
- Matches any string starting with "a", followed
by a series of periods (including the "series"
of length zero), and terminated by "z". Thus,
"az", "a.z", "a..z",
"a...z" and so forth are all matched.
a.\*z
- (Note that the backslash and period are reversed in this
regular expression.)
Matches any string starting with an "a",
followed by one arbitrary character, and terminated with
"*z". Thus, "ag*z", "a5*z"
and "a@*z" are all matched. Only strings of
length four, where the first character is "a",
the third "*", and the fourth "z", are
matched.
a\++z
- Matches any string starting with "a", followed
by a series of plus signs, and terminated by
"z". There must be at least one plus sign
between the "a" and the "z". Thus,
"az" is not matched, but "a+z",
"a++z", "a+++z", etc. will be matched.
a\+\+z
- Matches only the string "a++z".
a+\+z
- Matches any string starting with a series of "a"'s,
followed by a single plus sign and ending with a
"z". There must be at least one "a" at
the start of the string. Thus "a+z", "aa+z",
"aaa+z" and so on will match, but "+z"
will not.
a.?e
- Matches "ace", "ale",
"axe" and any other three-character string
beginning with "a" and ending with
"e"; will also match "ae".
a\.?e
- Matches "ae" and "a.e". No other
string is matched.
a.\?e
- Matches any four-character string starting with
"a" and ending with "?e". Thus, "ad?e",
"a1?e" and "a%?e" will all be matched.
a\.\?e
- Matches only "a.?e" and nothing else.
Earlier it was mentioned that the backslash can turn ordinary
characters into metacharacters, as well as the other way
around. One such use of this is the digit
metacharacter, which is invoked by following a backslash with
a lower-case "d", like this: "\d".
The "d" must be lower case, for reasons
explained later. The digit metacharacter matches exactly one
digit; that is, exactly one occurence of "0",
"1", "2", "3", "4",
"5", "6", "7", "8" or
"9". For example, the regular expression:
2,\d-Dimethylbutane
would match "2,2-Dimethylbutane",
"2,3-Dimethylbutane" and so forth. Similarly,
1\.\d\d\d\d\d
would match any six-digit floating-point number from 1.00000
to 1.99999 inclusive. We could combine the digit metacharacter
with other metacharacters; for instance,
a\d+z
matches any string starting with "a", followed by a
string of numbers, followed by a "z". (Note that the
plus is used, and thus "az" is not matched.)
The letter "d" in the string "\d"
must be lower-case. This is because there is another
metacharacter, the non-digit metacharacter, which
uses the uppercase "D". The non-digit metacharacter
looks like "\D" and matches any
character except a digit. Thus,
a\Dz
would match "abz", "aTz" or "a%z",
but would not match "a2z", "a5z"
or "a9z". Similarly,
\D+
Matches any non-null string which contains no numeric
characters.
Notice that in changing the "d" from lower-case
to upper-case, we have reversed the meaning of the digit
metacharacter. This holds true for most other metacharacters
of the format backslash-letter.
There are three other metacharacters in the
backslash-letter format. The first is the word
metacharacter, which matches exactly one letter, one number,
or the underscore character (_
). It is written as
"\w". It's opposite, "\W",
matches any one character except a letter, a number
or the underscore. Thus,
a\wz
would match "abz", "aTz", "a5z",
"a_z", or any three-character string starting with
"a", ending with "z", and whose second
character was either a letter (upper- or lower-case), a
number, or the underscore. Similarly,
a\Wz
would not match "abz", "aTz",
"a5z", or "a_z". It would match
"a%z", "a{z", "a?z" or any
three-character string starting with "a" and ending
with "z" and whose second character was not a
letter, number, or underscore. (This means the second
character must either be a symbol or a whitespace character.)
The whitespace metacharacter matches exactly one
character of whitespace. (Whitespace is defined as spaces,
tabs, newlines, or any character which would not use ink if
printed on a printer.) The whitespace metacharacter looks like
this: "\s". It's opposite, which
matches any character that is not whitespace, looks
like this: "\S". Thus,
a\sz
would match any three-character string starting with
"a" and ending with "z" and whose second
character was a space, tab, or newline. Likewise,
a\Sz
would match any three-character string starting with
"a" and ending with "z" whose second
character was not a space, tab or newline. (Thus, the
second character could be a letter, number or symbol.)
The word boundary metacharacter matches the
boundaries of words; that is, it matches whitespace,
punctuation and the very beginning and end of the text. It
looks like "\b". It's opposite
searches for a character that is not a word boundary.
Thus:
\bcomput
will match "computer" or "computing", but
not "supercomputer" since there is no spaces or
punctuation between "super" and
"computer". Similarly,
\Bcomput
will not match "computer" or
"computing", unless it is part of a bigger word such
as "supercomputer" or "recomputing".
Note that the underscore (_
) is considered a
"word" character. Thus,
super\bcomputer
will not match "super_computer".
There is one other metacharacter starting with a backslash,
the octal metacharacter. The octal metacharacter
looks like this: "\nnn", where
"n" is a number from zero to seven. This is used for
specifying control characters that have no typed equivalent.
For example,
\007
would find all subjects with an embedded ASCII
"bell" character. (The bell is specified by an ASCII
value of 7.) You will rarely need to use the octal
metacharacter.
There are three other metacharacters that may be of use.
The first is the braces metacharacter. This
metacharacter follows a normal character and contains two
number separated by a comma (,) and
surrounded by braces ({}). It is like the
star metacharacter, except the length of the string it matches
must be within the minimum and maximum length specified by the
two numbers in braces. Thus,
ab{3,5}c
will match "abbbc", "abbbbc" or "abbbbbc".
No other string is matched. Likewise,
.{3,5}pentane
will match "cyclopentane", "isopentane" or
"neopentane", but not "n-pentane", since
"n-" is only two characters long.
The alternative metacharacter is represented by a vertical
bar (|). It indicates an either/or behavior
by separating two or more possible choices. For example:
isopentane|cyclopentane
will match any subject containing the strings "isopentane"
or "cyclopentane" or both. However, It will not
match "pentane" or "n-pentane" or "neopentane."
The last metacharacter is the brackets metacharacter.
The bracket metacharacter matches one occurence of any
character inside the brackets ([]). For
example,
\s[cmt]an\s
will match "can", "man" and
"tan", but not "ban", "fan" or
"pan". Similarly,
2,[23]-dimethylbutane
will match "2,2-dimethylbutane" or
"2,3-dimethylbutane", but not
"2,4-dimethylbutane",
"2,23-dimethylbutane" or "2,-dimethybutane".
Ranges of characters can be used by using the dash (-)
within the brackets. For example,
a[a-d]z
will match "aaz", "abz", "acz"
or "adz", and nothing else. Likewise,
textfile0[3-5]
will match "textfile03", "textfile04", or
"textfile05" and nothing else.
If you wish to include a dash within brackets as one of the
characters to match, instead of to denote a range, put the
dash immediately before the right bracket. Thus:
a[1234-]z
and
a[1-4-]z
both do the same thing. They both match "a1z",
"a2z", "a3z", "a4z" or
"a-z", and nothing else.
The bracket metacharacter can also be inverted by placing a
caret (^) immediately after the left bracket.
Thus,
textfile0[^02468]
matches any ten-character string starting with
"textfile0" and ending with anything except an even
number. Inversion and ranges can be combined, so that
\W[^f-h]ood\W
matches any four letter wording ending in "ood" except
for "food", "good" or "hood".
(Thus "mood" and "wood" would both be
matched.)
Note that within brackets, ordinary quoting rules do not
apply and other metacharacters are not available. The only
characters that can be quoted in brackets are "[
",
"]
", and "\
".
Thus,
[\[\\\]]abc
matches any four letter string ending with "abc" and
starting with "[
", "]
",
or "\
".