Anatomy of a PERL Screen-Scraper
We are going to build a simple screen scraper program. The first step is to declare the external libraries
we are using. The USE statement tells the PERL interpreter to look in C:\Perl\lib\WWW\ for the file
Mechanize.pm and import it.
__________________________________
#!
# External libraries
use WWW::Mechanize;
# Create local instances
my $agent = WWW::Mechanize->new();
__________________________________
Next, we create a local instance of the variables and data structures defined in Mechanize.pm, using the my statement.
Once we have the data structures set up, the next step is to open the file we intend to write our data out to. This is accomplished by the open statement. An item of note here is that if the open fails for any reason (file in use, disk full, whatever) the program will terminate with a message.
_________________________________
open(EX, "> Temperature.TXT ")
|| die "ERROR! Problem opening TXT file.";
_________________________________
In three lines of code, we now have our screen-scraper initialized and our destination file open. Now we need only provide a URL and the program will go retrieve the HTML text.
___________________________________________
# Go grab the web page into content
$agent->get("http://weather.yahoo.com/forecast/USCA0195_f.html");
# Change URL to match desired zip code
___________________________________________
The $agent-get function takes the URL as an argument and returns the data in an element named content.
Regular Expressions
Regular expressions ("regex's" for short) are sets of symbols and syntactic elements used to match patterns of text.
Regular expressions figure into all kinds of text-manipulation tasks. Searching and search-and-replace are among the more common uses, but regular expressions can also be used to test for certain conditions in a text file or data stream.
The remaining lines of the program are regex and print statements writing the results of the regex to the output file. Actually, we only use two regex statements, Match (m) and Substitute (s). The first looks at the content variable and returns a string matching the match statement. First we look for a statement beginning ‘at:’ and containing a space, one or two digits, a colon, two more digits, another space, two letters, another space and finally three letters. This contains the timestamp information.
___________________________________________
# Find the word AT: and grab the timestamp following it
# match the string 'at: nn:nn a|pm PS|DT-->'
$agent->{content} =~ m/at:\s\d{1,2}:\d\d\s\S\S\s\S\S\S/;
$t = $&; # place matched data into a string
$t =~ s/://; # remove the colon
$t =~ s/PST/Pacific Time/; # expand abbreviation
$t =~ s/PDT/Pacific Daylight Time/; # expand abbreviation
print (EX $t); # ‘at 9:54 pacific time’
# Skip ahead past the '22'
$agent->{content} =~ m/22/;
# Find the end of the line and grab the temperature
$agent->{content} =~ m/-->\n\d\d;
$t = $&; # place matched data into a string
$t =~ s/\n//; # Remove the newline
$t =~ s/-->/ the temperature was /; # replace the arrow
chop($t); # Chop off the back arrow.
print (EX $t); print (EX " degrees, according to yahoo");
close(EX);
___________________________________________
We then look for the PST or PDT field and replace that with the words Pacific Time and Pacific Daylight Time.
Lastly, we grab the actual temperature reading and pad a few words around it and we’re done.