The Deco Group Partnership of California

Using Perl - Page 3

Anatomy of a PERL Screen-Scraper

We are going to build a simple screen scraper program. The first step is to declare the external libraries we are using. The USE statement tells the PERL interpreter to look in C:\Perl\lib\WWW\ for the file Mechanize.pm and import it.

__________________________________

# External libraries

use WWW::Mechanize;

# Create local instances

my $agent = WWW::Mechanize->new();

__________________________________

Next, we create a local instance of the variables and data structures defined in Mechanize.pm, using the my statement.

Once we have the data structures set up, the next step is to open the file we intend to write our data out to. This is accomplished by the open statement. An item of note here is that if the open fails for any reason (file in use, disk full, whatever) the program will terminate with a message.

_________________________________

open(EX, "> Temperature.TXT ")

|| die "ERROR! Problem opening TXT file.";

_________________________________

In three lines of code, we now have our screen-scraper initialized and our destination file open. Now we need only provide a URL and the program will go retrieve the HTML text.

___________________________________________

# Go grab the web page into content

$agent->get("http://weather.yahoo.com/forecast/USCA0195_f.html");

# Change URL to match desired zip code

___________________________________________

The $agent-get function takes the URL as an argument and returns the data in an element named content.

Regular Expressions

Regular expressions ("regex's" for short) are sets of symbols and syntactic elements used to match patterns of text.

Regular expressions figure into all kinds of text-manipulation tasks. Searching and search-and-replace are among the more common uses, but regular expressions can also be used to test for certain conditions in a text file or data stream.

The remaining lines of the program are regex and print statements writing the results of the regex to the output file. Actually, we only use two regex statements, Match (m) and Substitute (s). The first looks at the content variable and returns a string matching the match statement. First we look for a statement beginning ‘at:’ and containing a space, one or two digits, a colon, two more digits, another space, two letters, another space and finally three letters. This contains the timestamp information.

___________________________________________

# Find the word AT: and grab the timestamp following it

# match the string 'at: nn:nn a|pm PS|DT-->'

$agent->{content} =~ m/at:\s\d{1,2}:\d\d\s\S\S\s\S\S\S/;

$t = $&; # place matched data into a string

$t =~ s/://; # remove the colon

$t =~ s/PST/Pacific Time/; # expand abbreviation

$t =~ s/PDT/Pacific Daylight Time/; # expand abbreviation

print (EX $t); # ‘at 9:54 pacific time’

# Skip ahead past the '22'

$agent->{content} =~ m/22/;

# Find the end of the line and grab the temperature

$agent->{content} =~ m/-->\n\d\d

$t = $&; # place matched data into a string

$t =~ s/\n//; # Remove the newline

$t =~ s/-->/ the temperature was /; # replace the arrow

chop($t); # Chop off the back arrow.

print (EX $t); print (EX " degrees, according to yahoo");

close(EX);

___________________________________________

We then look for the PST or PDT field and replace that with the words Pacific Time and Pacific Daylight Time. Lastly, we grab the actual temperature reading and pad a few words around it and we’re done.