Perl Tutorial

Pattern Matching

Pattern matching is the searching of a sequence of characters within a character string. When doing pattern matching if a pattern is found then a match is said to have occurred.

Perl has three main functions which are used for pattern matching (although pattern matching can be used in other functions such as the split() function). They are the m//, s///, and tr///.

The m// operator is the match operator, this operator will just let us know if a match was found, the syntax for using this operator is:

m/PATTERN/OPTIONS;

PATTERN is the pattern that we are searching for, and OPTIONS are optional options that can be used. When using the match operator, you can omit the m if you are using a forward slash, if you do not wish to use a forward slash then you can substitute the slashes for another character but you must use the m in order to let perl know that you want to use the match operator. If you do use a pattern delimeter that is normally a special-pattern character, then you will not be able to use that special-pattern character in your pattern.

m!PATTERN!OPTIONS;

The match operator has certain optional options that can be used:

OPTION	DESCRIPTION
g	Match all possible patterns
i	Ignore case
m	Test string as multiple lines
o	Only evaluate once
s	Treat string as single line
x	Ignore white space in pattern

The s/// operator is known as the Substitution operator, the syntax for this operator is:

s/pattern/replacement/options;

pattern holds the pattern that we want to search for, and replacement hold that value that we want to use as our replacement value when the pattern that we are searching for is found, for example:

s/abc/xyz/;

In the above example we are saying that we want to search for "abc" in that order, and replace it with "xyz". You can also use Pattern-Sequence variables in substitutions, these will be discussed later. The substitution operator also has optional options that can be used:

OPTION	DESCRIPTION
g	Change all occurrences of the pattern
i	Ignore case in pattern
e	Evaluate replacement strings as expression
m	Treat string to be matched as multiple lines
o	Evaluate only once
s	Treat string to be matched as single line
x	Ignore white space in pattern

The last operator that I will cover is the Translation operator, this operator provides us with another method to substitute one group of characters for another, the translation operator syntax is:

tr/string1/string2/;

Here string1 contains a list of characters to be replacecd, and string2 contains the characters that replace them. The first character in string1 is replaced by the first characters in string2, the second character does the same thing and so on.

tr/abc/def/;

In the above example, abc is string, a is being replaced by d, b is being replaced by e, c is being replaced f. If you wanted to convert all the characters from uppercase to lowercase you would do:

tr/A-Z/a-x/;

As you can see the range operator is supported in the pattern matching operations. Once again the translation operator also has options which can be used:

OPTION	DESCRIPTION
c	Translate all characters not specified
d	Delete all specified characters
s	Replace multiple identical output characters with a single character

Please remember that if you are using the slash operator and you pattern contains a forward slash also, then you must escape it using the escape character "\".

Now let's see how we build patterns:

When doing pattern matching, the pattern being sought is by default being looked at using the contents of the default variable ($_). If we wanted to using a different variable we would have to use a match operator along with the one of the three functions I mentioned before.

The =~ operator binds the pattern to the string on the left hand side of the operator. This says that the pattern should be searched for in the scalar variable

$string =~ m/hello/;

As the above example demonstratres, we are searching for the "hello" string in the $string variable. If a pattern is found then true (a non-zero value) is returned, otherwise false (a 0 value) is returned.

The !~ operator binds the pattern to the string on the left hand side of the operator:, and will return true when the pattern is not found.

Now Let's discuss special characters in patterns:

The + character means "one or more of the preceding characters. That means if we have a pattern:

m/abc+/;

We should give true if a match is found on abc, abcc, abccc, abcccc and so on.

The [ ] characters allow you to define patterns that match a group. Meaning that whatever is in the brackets is treated as a group from which we can take our pick:

m/a[bc]d/;

The above pattern says that we will find a match if we find either "abd" or "acd".

Another special character is the * character, this characters means "zero or more of the preceding character". This means that if we have a pattern:

m/ab*/;

We will find a match for "a", "ab", "abb", "abbb", and so on.

The ? character means "zero or one occurrence of the preceding character", this pattern character works the same way as the * operator except that the most characters that are accepted is 1 and least is 0.

You can also anchor patterns using the ^ and $ operators.

The ^ operators anchors a pattern to the beginning of a line:

m/^The/;

In the above example we will find a match if the line starts with "The", if the ^ character is being used in brackets and is at the beginning it means anything not in that group, so be careful since the meaning changes when it is used in brackets.

The $ operator anchors a pattern at the end of the line, therefore:

m/end$/;

For the above example we will find a match only if the line ends with "end".

Word-boundry pattern anchors specify whether a matched pattern must be on a word boundry or inside a word boundry. The \b pattern anchor matches only if the specified pattern is at the beginning or end of a word while the \B pattern anchor matches if the specified pattern is inside a word.

When using character ranges within brackets, you can shorten up the process by using special character-range escape sequences:

Escape Sequence	Description	Range
\d	Any digit	[0-9]
\D	Anything other than a digit	[^0-9]
\w	Any word character	[_0-9a-zA-Z]
\W	Anything not a word character	[^_0-0a-zA-Z]
\s	White space	[ \r\n\r\f]
\S	Anything other than white space	[^ \r\t\n\f]

If you wanted to match a single character you would use the . (period) character. This will match any character except the newline.

If you want to match a specified number of occurrences then you would use the {x,y} characters after the character you want to specify the number of occurrences for. x is the minimum and y is the maximum.

If you want to specify options like the or statement, the you would use the | character.

You can group portions of characters using teh ( ) characters, this will also allow you to reuse this matched patter later on. In order to reuse it later on you would juse \number, where number is the number of teh ( ) in the order they were entered.

This section is not meant to be a total tutorial in pattern matching, but an introduction to give you an idea on pattern matching. For details on pattern matching I suggest the book: "Mastering Regular Expression" by J. Friedl from O'reilly.