What you need to know
You need to be able to differentiate between a PC and a toaster.
No programming experience is necessary. You do need to understand
the basics of PC operation. If you don't understand what directories
and files are then you'll find this difficult. You might find it
difficult even if you do :-)
You do need to exercise the brain cells, and you need time.
What you need to have
- A PC which can run a Win32 operating system. That's Windows NT
3.5, 3.51, 4.0 or later, or Windows 95 or Windows 98. Not
Windows 3.1. Sorry. Now, you finally have a reason to upgrade.
- You need to get hold of a copy of Perl, so for that you might
need an Internet connection. But if you can get it some other
way, you don't.
Note: You don't even need a Win32 PC if you are
comfortable installing Perl under other operating systems like
Linux, but not all the information here will be relevant.
You don't need a complier. Perl is an interpreted language, which
means you run code directly, not compile it then run it.
How to use this tutorial...
Just work through from start to finish.
Generally, the explanation follows the code sample. Before you
read the explanation, try and work out what the code does. Then
check if you're right. In this way, you'll derive maximum value from
the tutorial and exercise the old grey cells a little.
When you finish, please send me a critique. In fact, send one
even if you don't finish. I appreciate all feedback! Please
note -- I am not a source of free technical support. Do not
email me your general Perl problems. If you want support, ask on
Usenet or the ActiveState mailing lists. That said, I welcome
problems related to the tutorial itself.
Conventions used in this Tutorial
The humour is non-conventional. I think. Of more importance, the
text is coloured strangely in places. My intention is to aid your
comprehension, not attempt beautification. The meaning of the
colours:
- Sometimes you'll need to type something in on the command
line. These commands will be in green, for example :
perl changeworld.pl parm1
datafile.txt
- Code that you should load into your editor and run is in blue
(don't run this now, it's just an example):
while (<DATFILE>) {
printf "%2s : $_",$.;
}
- when functions are referred to in the text, their names are
highlighted in red. For example, later we discover an
interesting function called
split
.
All the code examples have been tested, and you can just
cut'n'paste (brave statement). I haven't listed the output of each
example. You need to run it and see for yourself. Consider this
course interactive. Consider it any which way you like.
Use of this document
Personal Printouts
Fine by me, feel free print to a copy for your own use.
Intranet usage
Just email me and let me know.
Mirroring
Again, all I ask is an email.
Translations
Every so often someone offers to translate the tutorial. Nobody
has actually done so. If you want to, the conditions are:
- You don't change the text other than what can be reasonably
expected during a translation;
- The content, format and notices authorship remains the same;
- You can add a 'translated by' notice in the intro and at the
end, plus your own message;
- Version numbers are respected but the ISO code for your
country is added, eg 3.3.2.ES;
- and you need to email me to discuss.
Remember this document is copyrighted and all
associated rights are strictly reserved.
--
Robert Pepper
mailto:Robert@netcat.co.uk
A Short Introduction To Perl
If you already understand what Perl is designed to do, know
its features and limitations then you can skip this very small but
highly informative section, over which I laboured long and hard for
those that didn't know. If you are really sure, jump to the Setup
Section.
What is Perl?
Perl is a programming language. Perl stands for Practical Report
and Extraction Language. You'll notice people refer to 'perl' and
"Perl". "Perl" is the programming language as a
whole whereas 'perl' is the name of the core executable. There is no
language called "Perl5" -- that just means "Perl
version 5". Versions of Perl prior to 5 are very old and very
unsupported.
Some of Perl's many strengths are:
- Speed of development. You edit a text file, and just
run it. You can develop programs very quickly like this. No
separate compiler needed. I find Perl runs a program quicker
than Java, let alone compare the complete
modify-compile-run-oh-no-forgot-that-semicolon sequence.
- Power. Perl's regular expressions are some of the best
available. You can work with objects, sockets...everything a
systems administrator could want. And that's just the standard
distribution. Add the wealth of modules available on CPAN and
you have it all. Don't equate scripting languages with toy
languages.
- Usuability. All that power and capability can be learnt
in easy stages. If you can write a batch file you can program
Perl. You don't have to learn object oriented programming, but
you can write OO programs in Perl. If autoincrementing
non-existent variables scares you, make perl refuse to let you.
There is always more than one way to do it in Perl. You decide
your style of programming, and Perl will accommodate you.
- Portability. On the Superhighway to the Portability
Panacea, Perl's Porsche powers past Java's jaded jalopy. Many
people develop Perl scripts on NT, or Win95, then just FTP them
to a Unix server where they run. No modification necessary.
- Editing tools You don't need the latest Integrated
Development Environment for Perl. You can develop Perl scripts
with any text editor. Notepad, vi, MS Word 97, or even direct
off the console. Of course, you can make things easy and use one
of the many freeware or shareware programmer's file editors.
- Price. Yes, 0 guilders, pounds, dmarks, dollars or
whatever. And the peer to peer support is also free, and often
far better than you'd ever get by paying some company to answer
the phone and tell you to do what you just tried several times
already, then look up the same reference books you already own.
What is ActivePerl? Are the other Perls inactive?
A company named ActiveState exists to provide Perl tools for the
Win32 environment. ActiveState used to be ActiveWare, and before
that it was sort of a part of Hip Communications. It now appears to
be happy with its current name, having not changed it for over a
year. Win32 means, at the time of writing, Windows 95, Windows 98
and Windows NT. It does not mean Windows 3.11, even with
Win32s installed.
Prior to Perl version 5.005, there was one version of Perl for
Win32, and another for all the other systems. The other version was
known as the "native version".
The Win32 version was developed by ActiveState, called "Perl
for Win32" and typically lagged slightly behind the native
version. As of the 5.005 release, Perl for Win32 and the native
version have merged -- the native version now supports Win32
directly and doesn't need any tweaking by ActiveState.
ActiveState have dropped "Perl for Win32" and renamed
their distribution, which comes with an InstallShield installer,
"ActivePerl".
Incidentally, a few months before 5.005 merge the native Perl
version was changed so it would run on Win32 directly. This version
was best known by the creator's name, "Gurusamy Sarathy".
However, there were still quite a few differences between it and
Perl for Win32, so many people ran both. The merge brought the best
of both worlds together.
Can I run Perl on my computer?
Probably. Perl runs on everything from Amigas to Macintoshes to
Unix boxen. Perl also runs on Microsoft operating systems, namely
Windows 95, Windows 98 and Windows NT 3.51 and later. There are
versions of Perl that run on earlier versions of these operating
systems but they are no longer developed or supported. See http://www.perl.com/
for full details.
What can I do with Perl ?
Just two popular examples :
The Internet
Go surf. Notice how many websites have dynamic pages with .pl
or similar as the filename extension? That's Perl. It is the most
popular language for CGI programming for many reasons, most of which
are mentioned above. In fact, there are a great many more dynamic
pages written with perl that may not have a .pl
extension. If you code in Active Server Pages, then you should try
using ActiveState's PerlScript. Quite frankly, coding in PerlScript
rather than VBScript or JScript is like driving a car as opposed to
riding a bicycle. Perl powers a good deal of the Internet.
Systems Administration
If you are a Unix sysadmin you'll know about sed, awk and shell
scripts. Perl can do everything they can do and far more besides.
Furthermore, Perl does it much more efficiently and portably. Don't
take my word for it, ask around.
If you are an NT sysadmin, chances are you aren't used to
programming. In which case, the advantages of Perl may not be clear.
Do you need it? Is it worth it?
After you read this tutorial you will know more than enough to
start using Perl productively. You really need very little knowledge
to save time. Imagine driving a car for years, then realising it has
five gears, not four. That's the sort of improvement learning Perl
means to your daily sysadminery. When you are proficient, you find
the difference like realising the same car has a reverse gear and
you don't have to push it backwards. Perl means you can be lazier.
Lazy sysadmins are good sysadmins, as I keep telling my boss.
A few examples of how I use Perl to ease NT sysadmin life:
- User account creation. If you have a text file with the
user's names in it, that is all you need. Create usernames
automatically, generate a unique password for each one and
create the account, plus create and share the home directory,
and set the permissions.
- Event log munging. NT has great Event Logging. Not so
great Event Reading. You can use Perl to create reports on the
event logs from multiple NT servers.
- Anything else that you would have used a batch file
for, or wished that you could automate somehow. Now you can.
What can't I do with Perl ?
The question is, "what shouldn't I do with Perl". Write
office suites is one answer. Perl, like most scripting languages, is
a glue language designed for short and relatively simple tasks. Just
don't equate this philosophy with a lack of power or
"serious" features.
Support
See the FAQs at www.perl.com. Of course there are Usenet groups,
but also many mailing lists. Microsoft Windows users will be
interested in those hosted by http://www.activestate.com/
which discuss all things Perl and Windows.
Please, before you ask any question, anywhere:
- Make sure you read the group charter. Many people put
time and effort into the creation of those charter in the
interests of efficient discussion, so don't degrade the
discussion quality and insult us by ignoring the guidelines.
- Read the FAQs at least twice. Try and find related FAQs.
Try hard. You won't be popular if you post a question starting
"I've looked at all the FAQs..." and then ask
something that actually is in the FAQs. Or the manual for
that matter. Believe me, it will be patently obvious to all on
the list if you haven't done your homework.
- Carefully phrase the questions and provide source code
because if you do that, you may well end up solving the problem
yourself because you have thought it through a little more.
Think to yourself -- honestly -- if I was a busy Perl
Professional, would I want to answer my own question?
Does it clearly state what I want an answer to? Preferably just
one question at a time. Am I being unreasonable, for example asking
for someone to code it for me? Have I shown evidence that I have
tried to help myself? Have I made any mistakes in grammar? Is it
polite? Is there enough information in there for the answer to be
given?
Why should you care? Well, if you ask poorly-formed questions or
those already answered in the FAQ...let's just say you won't get the
answers you want. If you care about your online reputation and
wasting other people's time -- two more reasons.
Setup
There are four stages:
- Get the software.
- Install it.
- Run a test Script.
- Celebrate or troubleshoot.
1. Getting the Software
An old version of Perl for Win32 is included with the Windows
NT Resource Kit. It is sadly out of date. Follow the steps below to
get a newer version. Having said that, you can complete the tutorial
with the Resource Kit version but you should upgrade as soon as you
can.
Go to http://www.activestate.com/
and follow the links to download ActivePerl. It will be a single
file, and the name will be something like api508e.exe
.
The i
stands for Intel. If you have an Alpha, download apaXXXe.exe
.
If you're not sure, download the Intel version.
The 508e
is the version number, so expect this to
change quite rapidly. The file size will be just over 5Mb, so it
will take a while to download via modem. If you know how to use FTP,
try ftp.activestate.com/activeperl/
.
When you find ActivePerl, save the file into any directory you
please. I like to organise my downloads into c:\downloads
but that is just personal preference. As long as ActivePerl ends up
on your hard disk somewhere it doesn't matter.
2. Installation
So you now have apixxxx.exe
. If you forget where you
saved it, don't panic, just run Windows Explorer and search for api*e.exe
- Double-click the
apixxxx.exe
. You'll see
the fantastic ActivePerl graphic and be advised to close all
open applications before proceeding. The lizard thing is a
gecko, which adorns the famous O'Reilly book "Learning Perl
on Win32 Systems". This tutorial is aimed at a more basic
level than that book, in terms of the author's knowledge,
intended audience and quality of humour.
- Agree to the license agreement or cancel the install,
stop this tutorial and deny yourself any hope of hackership.
- Destination directory is whatever you want. I usually
install Perl in
c:\progs\perl
rather than c:\program
files\perl
because many Win32 programs don't properly
handle long filenames, let alone those with spaces in. Or you
could accept the default. Your choice.
- Select Components. All you'll need for this tutorial is
"Perl for Win32 Core", but installing the "Online
Help and Documentation" and "Example Files" is
highly recommended. If you run Internet Information Server (IIS)
3 or later, or Personal Web Server (PWS), then install
"Perl for ISAPI" and "PerlScript" too,
although don't try either of these until you are proficient with
the basics. The phrase running before walking comes to mind.
- Select Options.
- "Associate '.pl' with Perl.exe". If you
select this option then you can just type in the name of a
script at the command line, or double-click it and the
script will run. If you don't, then in order to get a script
to execute you'll need to type:
perl myscript.pl
to execute myscript.pl
. Personally, I prefer
double-clicking to allow me to edit the file so I do not
select this option. Also, perl has a plethora of command
line arguments which are difficult to pass to a script if
you run it by association. For the purposes of this tutorial
I'm assuming that you haven't associated .pl
with perl.
- "Add the Perl bin directory to your path".
Do this, otherwise you'll have to specify the full path to
perl.exe every time you use it. Not fun.
- "Standard I/O redirection for IIS". If
you run IIS or PWS, select this. It is a Good Thing.
Understand it later.
- IIS Options If you use IIS or PWS you'll have this
screen -- just accept both options.
- Program Folder whatever your preference is. This is
just a link to the documentation, to the perl.exe itself.
- Confirmation make sure that what is displayed is what
you have selected...
- The install program will now copy files. At the end it will
run a few perl scripts itself, which briefly appear as DOS
boxes. Don't worry, it is all quite normal.
- Release notes. Well worth a read.
- Reboot! Just so the path statement takes effect. In any
case, it is always good practice to reboot after a new install.
3. Testing - Your First Perl Script
So you know what this tutorial is designed to do. You know what
Perl is designed to do, and you have even installed it. It is now
time to start the tutorial proper, and actually hack some code.
The Tutorial: The Journey Begins
Your First Time
Assuming all has gone to plan, you can now create your first Perl
script. Follow these instructions, but before you start read them
through once, then begin. That's a good idea with any form of
computer-related procedure. So, to begin:
- Create a new directory for your perl scripts, separate to your
data files and the perl installation. For example,
c:\scripts\
,
which is what I'll assume you are using in this tutorial.
- Start up whatever text editor you're going to hack Perl with.
Notepad.Exe is just fine. If you can't find Notepad on your
Start menu, press the Start button, then select Run, type in
'notepad' and click OK.
- Type the following in Notepad
print "My first Perl script\n";
- Save the to
c:\scripts\myfirst.pl
. Be careful!
Notepad will may save files with a .txt
extension,
so you will end up with myfirst.txt.pl
by default.
Perl won't mind, it'll still execute the file. If your version
of Notepad does this, select "All files" before saving
or rename the file then load it again. Better yet, use a decent
text editor!
- You don't need to exit Notepad -- keep it open, as we'll be
making changes very soon.
- Switch to your command prompt. If you don't know how to start
a command prompt, click 'Start' and then 'Run'. If using Windows
9x, type in 'command' and press enter. If using NT, type in 'cmd'
and press Enter.
- Change to your perl scripts directory, for example
cd
\scripts
.
- Hold your breath, and execute the script:
perl
myfirst.pl
and you'll see the output. Welcome to the world of Perl ! See
what I mean about it being easy to start ? However, it is difficult
to finish with Perl once you begin :-)
What if it doesn't...?
So you typed in perl myfirst.pl
and you didn't see My first Perl script
on the screen.
If you saw "bad command or filename" then either you
haven't installed Perl or perl.exe is not in your path. Probably the
latter. Reboot, then try again.
If you saw Can't open perl script "xxxx.pl": No
such file or directory
then perl is defintely installed, but
you have either got the name of the script wrong or the script is
not in the same directory as where you are trying to run it from.
For example, maybe you saved in script in c:\windows
and you are in c:\scripts
so of course Perl complains
it can't find the script. Could you? Well, don't expect Perl to
then. You don't have to run the script from the directory in which
it resides, but it is easier.
Assuming it's now all right...
W need to analyse what's going on here a little. First note that
the line ends with a semicolon ;
.
Almost all lines of code in Perl have to end with semicolons, and
those that don't have to will accept semicolons anyway. The moral is
-- use semicolons. Sorry; the moral is; use semicolons.
Oh, one more thing -- if you haven't already done so, continue
breathing.
Also note the \n
. This
is the code to tell Perl to output a newline. What's a newline?
Delete the \n
from the
program and run it again:
print "My first Perl script";
and all should become clear. You have now
written your first Perl script.
Shebang
Almost every Perl book is written for UN*X, which is a problem
for Win32. This leads to scripts like:
#!c:/perl/perl.exe
print "I'm a cool Perl hacker\n";
The function of the 'shebang' line is to tell the shell how to
execute the file. Under UNIX, this makes sense. Under Win32, the
system must already know how to execute the file before it is loaded
so the line is not needed.
However, the line is not completely ignored, as it is searched
for any switches you may have given Perl (for example -w
to turn on warnings).
You may also choose to add the line so your scripts run directly
on UNIX without modification, as UNIX boxes probably do need
it. Win32 systems do not. We shall continue with the lesson.
Variables
Scalars
So Perl is working, and you are working with Perl. Now for
something more interesting than simple printing. Variables. Let's
take simple scalar variables first. A scalar variable is a single
value. Like $var=10
which sets the variables $var
to
the value of 10. Later, we'll look at lists like arrays and hashes,
where @var
refers to more
than one value. For the moment, remember that Scalar is Singular.
If weird metaphors help, think of lots of scaly snakes at a singles
bar. If that didn't help, I apologise for putting the thought into
your mind.
$ % @ are Good Things
If you have any experience with other programming languages you
might be surprised by the code $var=10
.
With most languages, if you want to assign the value 10
to a variable called var
you'd
write var=10
.
Not so in Perl. This is a Feature. All variables are prefixed
with a symbol such as $ @ %
.
This has certain advantages, like making programs easier to read.
Honestly, I'm serious! It just takes some getting used to. The
prefixes mean that you can see where the variables are quite
easily. And not only that, what sort of variable it is. The
human language German has a similar principle (except nouns are
capitalised, not prefixed with $
and
Perl is easier to pronounce). You'll agree later, I think.
So, ever onwards. Time to try some more variables:
$string="perl";
$num1=20;
$num2=10.75;
print "The string is $string, number 1 is $num1 and number 2 is
$num2\n";
Typing
A closer look...notice you don't have to say what type of
variable you are declaring. In other languages you need to say
if the variable is a string, array, what sort of number it is and so
on. You might even have to declare what type of number it is. As an
example, in Java you'd been saying things like int var=10
which defines the variable var as an integer, with the value 10.
So, why do these other programming languages force you to declare
exactly what your variables are? Wouldn't it be easier if we could
just not bother?
For short programs, yes. For really big projects with many
programmers working on the same application, no. That's because
forcing variable type declaration also forces a certain discipline
and rigour which is what you need on big projects.
As you know, Perl is not designed for gigantic software
engineering efforts. It is all about small, quick programs. For
these purposes you don't need the rigour of variable controls as
much, so Perl doesn't bother.
This idea of forcing a programmer to declare what sort of
variable is being created is called typing. As Perl doesn't
by default enforce any rules on typing, it is said to be a loosely
typed language, as opposed to something like C++ which is strongly
typed.
Variable Interpolation
We still haven't finished learning from that humble bit of code.
To refresh your memory, here it is again:
$string="perl";
$num1=20;
$num2=10.75;
print "The string is $string, number 1 is $num1 and number 2 is
$num2\n";
Notice the way the variables are used in the string. Sticking
variables inside of strings has a technical term
- "variable interpolation". Now,
if we didn't have the handy $
prefix
for we'd have to do something like the example
below, which is pseudocode. Pseudocode is code to demonstrate a
concept, not designed to be run. Like certain Microsoft software.
print "The string is
".string." and the number is ".num."\n";
which is much more work. Convinced about those prefixes yet ?
Try running the following code:
$string="perl";
$num=20;
print "Doubles: The string is $string and the number is
$num\n";
print 'Singles: The string is $string and the number is $num\n';
Double quotes allow the aforementioned variable interpolation.
Single quotes do not. Both have their uses as you will see later,
depending on whether you wish to interpolate anything.
Changing Variables
Auto(de|in)crements
If you want to add 1 to a variable you can, logically, do this; $num=$num+1
. There is a shorter way to do this, which is $num++
.
This is an autoincrement. Guess what this is; $num--
. Yes, an autodecrement.
This example illustrates the above:
$num=10;
print "\$num is $num\n";
$num++;
print "\$num is $num\n";
$num--;
print "\$num is $num\n";
$num+=3;
print "\$num is $num\n";
The last example demonstrates that it doesn't have to be just 1
you can add or decrease by.
Escaping
There's something else new in the code above. The \
. You can see what this does -- it 'escapes'
the special meaning of $
.
Escaping means that just the $
symbol
is printed instead of it referring to a variable.
Actually \
has a deeper
meaning -- it escapes all of Perl's special characters, not
just $
. Also, it turns
some non-special characters into something special. Like what ? Like
n
. Add the magic \
and the humble 'n' becomes the mighty NewLine ! The \
character can also escape itself. So if you want to
print a single \
try:
print "the MS-DOS path is c:\\scripts\\";
Oh, '\' is also used for other things like references. But that's
not even covered here.
There is a technical term for these 'special characters' such as @
$ %
. They are called metacharacters. Perl uses
plenty of metacharacters. In fact, you'll wear your keyboard pretty
evenly during a night's perl hacking. I think it is safe to say that
Perl uses every possible keystroke and shifted keystroke on a
standard US PC keyboard.
You'll be working with all sorts of obscure characters in your
Perl hacking career, and I also mean those on your keyboard. This
has earned perl a reputation for being difficult to understand.
That's entirely true. Perl does have such a reputation, no
doubt about it.
Is the reputation justified? In my opinion, Perl does have a
short but steep learning curve to begin with simply because it is so
different. However, once you learn the character meanings reading
perl code becomes much easier precisely because of all these
strange characters.
Context: About Perl and @^$%&~`/?
Perl uses so many weird characters that there aren't enough to go
round. So sometimes the same character has two or more meanings,
depending on its context. As an example, the humble dot .
can join two variables together, act as a wildcard or
become a range operator if there are two of them together. The caret
^
has different effects in
[^abc]
as opposed to [a^bc]
.
If this sounds crazy, think about the English language. What do
the following mean to you ?
Mean is, in one context, is a word to used describe the purpose
of something. It is also another word for average. Furthermore, it
describes a nasty person, or a person who doesn't like spending
money, and is used in slang to refer to something impressive and
good.
That's five different uses for 'mean', and you don't have any
trouble understanding which one I
mean
due to context.
Polish, when capitalised, can either mean pertaining to the
country Poland, or the act of making something shiny. And 'like' can
mean similar to, or affection for.
So, when you speak or write English (think of two, to and too)
you know what these words mean by their context. It is exactly the
same way with Perl. Just don't assume a given metacharacter always
means what you first thought it did.
To finish off this section, try the following:
Strings and Increments
$string="perl";
$num=20;
$mx=3;
print "The string is $string and the number is $num\n";
$num*=$mx;
$string++;
print "The string is $string and the number is $num\n";
Note the easy shortcut *=
meaning
'multiply $num by $mx' or, $num=$num*$mx
.
Of course Perl supports the usual + - *
/ ** %
operators. The last two are
exponentiation (to the power of) and modulus (remainder of x divided
by y).
Also note the way you can increment a string ! Is this language
flexible or what ?
Print: A List Operator
The print
function is a
list operator. That means it accepts a list of things to
print, separated by commas. As an example:
print "a doublequoted string ", $var, 'that was a
variable called var', $num," and a newline \n";
Of course, you just put all the above inside a single
doublequoted string:
print "a doublequoted string $var that was a variable called
var $num and a newline \n";
to achieve the same effect. The advantage of using the print
function in list context
is that expressions are evaluated before being printed. For example,
try this:
$var="Perl";
$num=10;
print "Two \$nums are $num * 2 and adding one to \$var makes $var++\n";
print "Two \$nums are ", $num * 2," and adding one to
\$var makes ", $var++,"\n";
You might have been slightly surprised by the result of that last
experiment. In particular, what happened to our variable $var
? It should have been incremented by one, resulting in
Perm
. The reason being that 'm' is the next letter
after 'l' :-)
Actually, it was incremented by 1. We are postincrementing
$var++
the variable,
rather than preincrementing it.
The difference is that with postincrements, the value of the
variable is returned, then the operation is performed on it. So in
the example above, the current value of $var
was returned to the print
function,
then 1 was added. You can prove this to yourself by adding the line print
"\$var is now $var\n";
to the end of the
example above.
If we want the operation to be performed on $var
before the value is returned to the print function, then
preincrement is the way to go. ++$var
will do the trick.
Subroutines -- A First Look
Let's take a another look at the example we used to show how the
autoincrement system works. Messy, isn't it ? This is Batch File
Writing Mentality. Notice how we use exactly the same code four
times. Why not just put it in a subroutine?
$num=10; # sets $num to 10
&print_results; # prints variable $num
$num++;
&print_results;
$num*=3;
&print_results;
$num/=3;
&print_results;
sub print_results {
print "\$num is
$num\n";
}
Easier and neater. The subroutine can go anywhere in your script,
at the beginning, end, middle...makes no difference. Personally I
put all mine at the bottom and reserve the top part for setting
variables and main program flow.
A subroutine is just some code you want to use more than once in
the same script. In Perl, a subroutine is a user-defined function.
There is no difference. For the purposes of clarity I'll refer to
them as subroutines.
A subroutine is defined by starting with sub
then the name. After that you need a curly left
bracket {
, then all the
code for your subroutine. Finish it off with a closing brace }
. The area between the two braces is called a
block. Remember this. There are such things as anonymous
subroutines but not here. Everything here has a name.
Subroutines are usually called by prefixing their name with an
ampersand, that is one of these -- &
, like so &print_results;
. It used to be cool to omit the &
prefix but all perl hackers are now encouraged to use
it to avoid ambiguity. Ambiguity can hurt you if you don't avoid it.
If you are worrying about variable visibility, don't. All the
variables we are using so far are visible everywhere. You can
restrict visibility quite easily, but that's not important right
now. If you weren't worrying about variable visibility, please don't
start. I'd tell you it's not important but that'll only make you
worried. (paranoid ?) We'll cover it later.
Comments
Did you see a #
crept
in there. That's a comment. Everything after a #
is ignored. You can't continue it onto a newline
however, so if your comment won't fit on one line start a new one
with #
. There are ways to
create Plain Old Documentation (POD) and more ways to comment but
they are not detailed here.
Comparisons
An iffy start
An if
statement is
simple. if the day is Sunday, then lie in bed
. A simple
test, with two outcomes. Perl conversion (don't run this):
if ($day eq "sunday") {
&lie_in_bed;
}
You already know that &lie_in_bed
is a call to a subroutine. We assume $day
is set earlier in the program. If $day
is not equal to 'Sunday' &lie_in_bed
is not executed (pity). You don't need to say anything
else. Try this:
$day="sunday";
if ($day eq "sunday") {
print "Zzzzz....\n";
}
Note the syntax. The if
statement
requires something to test for Truth. This expression must
be in (parens), then you have the braces to form a block.
The Truth According to Perl
There are many Perl functions which test for Truth. Some are if,
while, unless
. So it is important you know what truth
is, as defined by Perl, not your tax forms. There are three main
rules:
- Any string is true except for
""
and "0"
.
- Any number is true except for
0
.
This includes negative numbers.
- Any undefined variable is false. A undefined variable is one
which doesn't have a value, ie has not been assigned to.
Some example code to illustrate the point:
&isit;
# $test1 is at this moment undefined
$test1="hello";
# a string, not equal to "" or "0"
&isit;
$test1=0.0;
# $test1 is now a number, effectively 0
&isit;
$test1="0.0";
# $test1 is a string, but NOT effectively 0 !
&isit;
sub isit {
if ($test1) {
# tests $test1 for truth or not
print "$test1 is true\n";
} else {
# else statement if it is not true
print "$test1 is false\n";
}
}
The first test fails because $test1
is
undefined. This means it has not been created by
assigning a value to it. So according to Rule 3 it is false. The
last two tests are
interesting. Of course, 0.0 is the same as 0 in a numeric
context. But it is not the
same as 0 in a string context, so in that case it is true.
So here we are testing single variables. What's more useful is
testing the result of an expression. For example, this is an
expression; $x * 2
and so
is this; $var1 + $var2
.
It is the end result of these expressions that is evaluated for
truth.
An example demonstrates the point:
$x=5;
$y=5;
if ($x - $y) {
print '$x - $y is
',$x-$y," which is true\n";
} else {
print '$x - $y is
',$x-$y," which is false\n";
}
The test fails because 5-5 of course is 0, which is false. The print
statement might look a little strange. Remember that print
is a list operator? So we hand it a list. First item,
a single-quoted string. It is single quoted because it we do not
want to perform variable interpolation on it. Next item is an expression
which is evaluated, and the result printed. Finally, a double-quoted
string is used because we want to print a newline, and without the
doublequotes the \n
won't
be interpolated.
What is probably more useful than testing a specific variable for
truth is equality testing. For example, has your lucky number been
drawn?
$lucky=15;
$drawnum=15;
if ($lucky == $drawnum) {
print
"Congratulations!\n";
} else {
print "Guess who
hasn't won!\n";
}
The important point about the above code is the equality
operator, ==
.
Equality and Perl
Now pay close attention, otherwise you'll end up posting an
annoying question somewhere. This is a FAQ, as in a Frequently Asked
Question.
The symbol =
is an assignment
operator, not a comparison operator. Therefore:
if ($x = 10)
is
always true, because $x
has
been assigned the value 10 successfully.
if ($x == 10)
compares
the two values, which might not be equal.
So far we have been testing numbers, but there is more to life
than numbers. There are strings too, and these need testing too.
$name = 'Mark';
$goodguy = 'Tony';
if ($name == $goodguy) {
print "Hello,
Sir.\n";
} else {
print "Begone, evil
peon!\n";
}
Something seems to have gone wrong here. Obviously Mark is
different to Tony, so why does perl consider them equal?
Mark and Tony are equal -- numerically. We should be
testing them as strings, not as numbers. To do this, simply
substitute ==
for eq
and everything will work as expected.
All Equality is Not Equal: Numeric versus String
There are two types of comparison operator; numeric and string.
You've already seen two, ==
and
eq
. Run this:
$foo=291;
$bar=30;
if ($foo < $bar) {
print "$foo is less
than $bar (numeric)\n";
}
if ($foo lt $bar) {
print "$foo is less
than $bar (string)\n";
}
The lt
operator
compares in a string context, and of course <
compares in a numeric context.
Alphabetically, that is in a string context, 291 comes before 30.
It is actually decided by the ASCII value, but alphabetically is
close enough. Change the numbers around a little. Notice how Perl
doesn't care whether it uses a string comparison operator on a
numeric value, or vice versa. This is typical of Perl's
flexibility.
Bondage and discipline are pretty much alien concepts to Perl
(and the author). This flexibility does have a drawback. If you're
on a programming precipice, threatening suicide by jumping off, Perl
won't talk you out of your decision but will provide several ways of
jumping, stepping or falling to your doom while silently watching
your early conclusion. So be careful.
An interlude -- The Perl Motto
The Perl Motto is; "There is More Than One Way to Do
It" or TIMTOWTDI. Pronounced 'Tim-Toady'. This tutorial
doesn't try and mention all possible ways of doing everything,
mainly because the author is far too lazy. Write your Perl programs
the way you want to.
The Comparison Operators Listed
The rest of the operators are:
Comparison |
Numeric |
String |
Equal |
== |
eq |
Not equal |
!= |
ne |
Greater than |
> |
gt |
Less than |
< |
lt |
Greater than or equal to |
>= |
ge |
Less than or equal to |
<= |
le |
The Golden Rule of Comparisons
They may be odious, but remember the following:
- if you are testing a value as a string there should be
only letters in your comparison operator.
- if you are testing a value as a number there should
only be non-alpha characters in your comparison operator
- note 'as a' above. You can test numbers as string and vice
versa. Perl never complains.
More About If: Multiples
More about if
statements.
Run this:
$age=25;
$max=30;
if ($age > $max) {
print "Too old
!\n";
} else {
print "Young person
!\n";
}
It is easy to see what else
does.
If the expression is false then whatever is in
the else
block is
evaluated (or carried out, executed, whatever term you choose to
use).
Simple. But what if you want another test ? Perl can do that too.
elsif
$age=25;
$max=30;
$min=18;
if ($age > $max) {
print "Too old
!\n";
} elsif ($age < $min) {
print "Too young
!\n";
} else {
print "Just right
!\n";
}
If the first test fails, the second is evaluated. This carries on
until there are no
more elsif
statements, or
an else
statement is
reached. An else
statement
is optional,
and no elsif
statements
should come after it. Logical, really.
There is a big difference between the above example the one
below:
if ($age > $max) {
print "Too old
!\n";
}
if ($age < $min) {
print "Too young
!\n";
}
If you run it, it will return the same result - in this case.
However, it is Bad Programming Practice. In this case we are testing
a number, but suppose we were testing a string to see if it
contained R or S. It is possible that a string could contain both
R and S. So it would pass both 'if' tests. Using an elsif
avoids this. As soon as the first statement is true,
no more elsif
statements
(and no else
statement)
are executed.
You don't need to take up a whole three lines:
print "Too old\n" if $age >
$max;
print "Too old\n" unless $age < $max;
I added some whitespace there for aesthetic beauty. There are
other operators that you can use instead of if
and unless
,
but that's for later on.
Incidentally, the two lines of code above do not do exactly the
same thing. Consider a maximum age of 50 and input age of 50.
Therefore, you should be very careful about your logic when writing
code (nice obvious statement there).
For those that were wondering, Perl has no case statement. This
is all explained in the FAQ, which is located at http://www.perl.com/.
User Input
STDIN and other filehandles
Sometimes you have to interact with the user. It is a pain, but
sometimes necessary, especially for the live ones. To ask for input
and do something with it try this:
print "Please tell me your name: ";
$name=<STDIN>;
print "Thanks for making me happy, $name !\n";
New things to learn here. Firstly, <STDIN>
. STDIN is a filehandle. Filehandles are what
you use to interact with things such as files, console input, socket
connections and more.
You could say STDIN is the standard source for input. Guess what
STDIN stands for. In this case the STDIN filehandle is reading from
the console.
The angle brackets <>
read
data from a filehandle. Exactly how much is dependent on what you
do, but in this case it is whatever was input at the prompt.
So we are reading from the STDIN filehandle. The value is
assigned to $name
and
printed. Any idea why the ! ends up on a new line ? on a new line
on a newline ????
As you pressed Enter, you of course included a newline with your
name. The easy way to get rid of it is to chop
it off:
Chop
print "Please tell me your name: ";
$name=<STDIN>;
chop $name
print "Thanks for making me happy, $name !\n"
and that fails with a syntax error. Can you spot why? Look at the
error code, look at the
line number and see where the syntax is wrong. The answer is a
missing semicolon
( ;
) on the end of the last two lines.
If you add a ;
to the end of line 3, but not to the
last line, then the program works as it should. This is because Perl
doesn't need a semicolon to end the last statement of a block.
However, I'd advise ending all your statements with semicolons
because you may well be adding more code to them and it is only one
little keystroke.
When you add the semicolon(s), the program runs correctly. The chop
function removes the last character of whatever it is
given to chop, in this case removing the newline for us. In fact,
that can be shortened:
print "Please tell me your name: ";
chop ($name=<STDIN>);
print "Thanks for making me happy, $name !";
The parentheses ( )
force
chop
to act on the result
of what is inside them. So $name=<STDIN>
is evaluated first, then the result from that, which
is $name
, is chopped. Try
it without.
You can read from STDIN as much as you like. For your
entertainment I have created a sophisticated multinational greeting
machine:
print "Please tell me your name: ";
chop ($name=<STDIN>);
print "Please tell me your nationality: ";
chop ($nation=<STDIN>);
if ($nation eq "British" or $nation eq "New
Zealand") {
print "Hallo $name, pleased to meet you!\n";
} elsif ($nation eq "Dutch" or $nation eq
"Flemish") {
print "Hoi $name, hoe gaat het met u vandaag?!\n";
} else {
print "HELLO!!! SPEAKEEE ENGLIEESH???\n";
}
Aside from demonstrating the native English speaker's linguistic
talents, this script also introduces the or
logical operator. We'll cover or
and its associates in more detail later on. First, a
word of warning.
Chopping is dangerous, as my friend One Hand Harold will tell
you. Everyone is concerned about various forms of safety these days,
and your perl code should be no exception.
Safe Chopping with Chomp
Rather than just wantonly remove the last character regardless of
whatever it is, without a care in the world, just simply consigning
the poor little thing to the Great Bit Bucket in the Sky, you can
remove the last character only if it is a newline with chomp
:
chomp ($name=<STDIN>);
At this point the perl gurus are screaming "I found an error
!". Well, chomp
doesn't
always remove the last character if it is a newline but if it
doesn't, you have set a special variable, namely $/
, to something different. I presume that if you do set
$/
you know what it does.
It is explained later in this very document. Of course, being a good
pupil, you wouldn't experiment with the unknown, blindly changing
things just for the hell of it to see what happens.
If you don't, you'll never learn anything useful.
Arrays
Lists, herds -- what are arrays?
Perl has two types of array, associative arrays (hashes) and
arrays. Both types are lists. A list is just a collection of
variables referred to as the collection, not as individual elements.
You can think of Perl's lists as a herd of animals. List context
refers to the entire herd, scalar context refers to a single
element. A list is a herd of variables. The variables don't have to
be all of the same type -- you might have a herd of ten sheep, three
lions and two wolves. It would probably be just three lions and one
wolf before long, but bear with me. In the same way, you might have
a Perl list of three scalar variables, two array elements and ten
hash elements.
Certain types of lists are known by certain names. Just as a herd
of sheep is called a flock, a herd of lions is called a pride, a
herd of wolves is called a pack and a herd of managers a confusion,
some types of Perl list have a special names.
Basic Array Work
For example, an array is an ordered list of scalar variables.
This list can be referred to as a whole, or you can refer to
individual elements in the list. The program below defines a an
array, called @names
. It
puts five values into the array.
@names=("Muriel","Gavin","Susanne","Sarah","Anna");
print "The elements of \@names are @names\n";
print "The first element is $names[0] \n";
print "The third element is $names[2] \n";
print 'There are ',scalar(@names)," elements in the
array\n";
Firstly, notice how we define @names
.
As it is in a list context, we are using parens. Each value
is comma separated, which is Perl's default list delimiter.
The double quotes are not necessary, but as these are string values
it makes it easier to read and change later on.
Next, notice how we print it. Simply refer to it as a whole, that
is in list context.. List context means referring to more
than one element of a list at a time. The code print
@names;
will work perfectly well too. But....
I usually learn something about Perl every time I work with it.
When running a course, a student taught me this trick which he had
discovered:
@names=("Muriel","Gavin","Susanne","Sarah","Anna","Paul","Trish","Simon");
print @names;
print "\n";
print "@names";
When a list is placed inside doublequotes, it is space delimited
when interpolated. Useful.
If we want to do anything with the array as a list, that
is doing something with more than one value, then refer to
the array as @array
.
That's important. The @
prefix
is used when you want to refer to more than one element of a list.
When you refer to more than one, but not all elements of an array
that is known as a slice . Cake analogies are appropriate.
Pie analogies are probably healthier but equally accurate.
Elements of Arrays
Arrays are not much use unless we can get to individual elements.
Firstly, we are dealing with a single element of the list, so we
cannot use @
which refers
to multiple elements of the array. It is a single, scalar
variable, so $
is used.
Secondly, we must specify which element we want. That's easy - $array[0]
for the first, $array[1]
for
the second and so forth. Array indexes start at 0, unless you do
something which is so highly deprecated ('deprecated' means allowed,
usually for backwards compatibility, but disapproved of because
there are better ways) I'm not even going to mention it.
Finally, we force what is normally list context (more than one
element) into scalar context (single element) to give us the amount
of elements in the array. Without the scalar
, it would be the same as the second line of the
program.
How to refer to elements of an array
Please understand this:
$myvar="scalar variable";
@myvar=("one","element","of","an","array","called","myvar");
print $myvar; # refers to
the contents of a scalar variable called myvar
print $myvar[1]; # refers to the second
element of the array myvar
print @myvar; # refers to
all the elements of array myvar
The two variables $myvar
and
@myvar
are not, in any
way, related. Not even distantly. Technically, they are in different
namespaces.
Going back to the animal analogy, it is like having a dog named 'Myvar'
and a goldfish called 'Myvar'. You'll never get the two mixed up
because when you call 'Myvar !!!!' or open a can of dog food the 'Myvar'
dog will come running and goldfish won't. Now, you couldn't have two
dogs called 'Myvar' and in the same way you can't have two Perl
variables in the same namespace called 'Myvar'.
More ways to access arrays
The element number can be a variable.
print "Enter a number :";
chomp ($x=<STDIN>);
@names=("Muriel","Gavin","Susanne","Sarah","Anna");
print "You requested element $x who is $names[$x]\n";
print "The index number of the last element is $#names
\n";
This is useful. Notice the last line of the example. It
returns the index number of
the last element. Of course you could always just do this $last=scalar(@names)-1;
but
this is more efficient. It is an easy way to get the last element,
as follows:
print "Enter the number of the element
you wish to view :";
chomp ($x=<STDIN>);
@names=("Muriel","Gavin","Susanne","Sarah","Anna","Paul","Trish","Simon");
print "The first two elements are @names[0,1]\n";
print "The first three elements are @names[0..2]\n";
print "You requested element $x who is $names[$x-1]\n"; #
starts at 0
print "The elements before and after are :
@names[$x-2,$x]\n";
print "The first, second, third and fifth elements are
@names[0..2,4]\n";
print "a) The last element is $names[$#names]\n"; # one
way
print "b) The last element is @names[-1]\n"; # different
way
It looks complex, but it is not. Really. Notice you can have
multiple values separated
by a comma. As many as you like, in whatever order. The range
operator ..
gives you
everything between and including the values. And finally look at how
we print the last
element - remember $#names
gives
us a number ? Simply enclose it inside square brackets
and you have the last element.
Do also note that because element accesses such as [0,1]
are more than one variable, we cannot use the scalar
prefix, namely the $
symbol.
We are accessing the array in list context, so we use the @
symbol. Doesn't matter that it is not the entire
array. Remember, accessing more than one element of an array but not
the entire array is called a slice. I won't go over the food
analogies again.
For Loops
A for Loop demonstrated
All well and good, but what if we want to load each element of
the array in turn ? Well, we could build a for loop like this:
@names=("Muriel","Gavin","Susanne","Sarah","Anna","Paul","Trish","Simon");
for ($x=0; $x <= $#names; $x++) {
print "$names[$x]\n";
}
which sets $x
to 0, runs
the loop once, then adds one to $x
,
checks it is less than
$#names
, if so carries
on. By the way, that was your introduction to for
loops. Just
to go into a little detail there, the for
loop has three parts to it:
- Initialisation
- Test Condition
- Modification
In this case, the variable $x
is
initialised to 0. It is immediately tested to see if it is smaller
than, or equal to $#names
.
If that is true, then the block is executed once. Critically, if it
is not true the block is not executed at all.
Once the block has been executed, the modification expression is
evaluated. That's $x++
.
Then, the test condition is checked to see if the block should be
executed or not.
For loops with .. , the range operator
There is a another version:
for $x (0 .. $#names) {
print "$names[$x]\n";
}
which takes advantage of the range operator ..
(two dots together). This simply gives $x
the
value of 0, then increments $x
by
1 until it is equal to $#names
.
foreach
For true beauty we must use foreach
.
foreach $person (@names) {
print
"$person";
}
This goes through each element ('iterates', another good
technical word to use)
of @names
, and assigns
each element in turn to the variable $person
. Then you can do
what you like with the variable. Much easier. You can use
for $person (@names) {
print
"$person";
}
if you want. Makes no difference at all, aside from a little
clarity.
The infamous $_
In fact, that gets shorter. And now I need to introduce you to $_
, which is the Default Input and Pattern Searching
Variable.
foreach (@names) {
print "$_";
}
If you don't specify a variable to put each element into, $_
is used instead as it is
the default for this operation, and many, many others in Perl.
Including the print
function
:
foreach (@names) {
print ;
}
As we haven't supplied any arguments to print
, $_
is
printed as default. You'll be seeing
a lot of $_
in Perl.
Actually, that statement is not exactly true. You will be seeing lot
of places where $_
is
used, but quite often when it is used, it is not actually written.
In the above example, you don't actually see $_
but you know it is there.
A Premature End to your loop
A loop, by its nature, continues. If that didn't make sense,
start reading this sentence again.
The old jokes are the best, aren't they?
The joke above is a loop. You continue re-reading the sentence
until you realise I'm trying to be funny. Then you exit the loop. Or
maybe somebody doesn't exit it. Whatever, loops always run until the
expression they are testing returns false. In the case of the
examples above, a false value is returned when all the elements of
the array have been cycled through, and the loop ends.
If you want an everlasting loop, just test an condition you know
will always be true:
while (1) {
$x++;
print "$x: Did
you know you can press CTRL-C to interrupt a perl program?\n";
}
Another way to exit a loop is a simple foreach
over the elements, as we have seen. But if we
don't know when we want to exit a loop? For example, suppose we want
to print out a list of
names but stop when we find one with a particular title? You are
throwing a huge party,
someone is allergic to vodka, and this person has drunk from the
punch bowl despite being
assured by someone holding two empty bottles of Absolut that he was
just using the bottles
to convey yet more orange juice into said punch bowl. So you need a
doctor, and so you write
a Perl script to find one from the list of attendees, wanting the
doctor's name to be the last
item printed:
@names=('Mrs Smith','Mr Jones','Ms Samuel','Dr Jansen','Sir
Philip');
foreach $person (@names) {
print "$person\n";
last if $person=~/Dr /;
}
The last
operator is our
friend. Don't worry about the /Dr /
business -- that is
a regular
expression which we cover next. All you need to know is that it
returns true if the name begins
with 'Dr '. When it does return true, last
is operated and the loop ends early.
A little more control over the premature ending: Labels
So that's easy enough. But wait! We need a medical, human-fixer
type doctor, not just anyone with a PhD. So, the same principle
applies in this example here:
@names =('Mrs Smith','Mr Jones','Ms Samuel','Dr
Jansen','Sir Philip');
@medics =('Dr Black','Dr Waymour','Dr Jansen','Dr Pettle');
foreach $person (@names) {
print "$person\n";
if ($person=~/Dr /) {
foreach $doc (@medics) {
print "\t$doc\n";
last if $doc eq $person;
}
}
}
Aside from showing one way to indent your code, this also
demonstrates a nested loop. A nested
loop is a loop within a loop. What happens is that the @names
array is searched for a 'Dr ',
and if it is found then the @medics
array is searched
to make sure the doctor is a human-fixing
doctor not a professor of physics or something. The regular
expression has been shifted into
an if
statement, where it
works nicely as it only returns true or false.
The problem with the code is that after we find our medical
doctor we want it to stop. But it doesn't. It only stops the loop it
is in, so Dr Pettle never gets printed. However, the code just
carries on with Sir Philip who is terribly sorry old chap, but can't
be of any bally use at all, what ho! What we need is a way to break
out of the entire loop from within a nest. Like so:
@names =('Mrs Smith','Mr Jones','Ms Samuel','Dr
Jansen','Sir Philip');
@medics =('Dr Black','Dr Waymour','Dr Jansen','Dr Pettle');
LBL: foreach $person (@names) {
print "$person\n";
if ($person=~/Dr /) {
foreach $doc (@medics) {
print "\t$doc\n";
last LBL if $doc eq $person;
}
}
}
Only two changes here. We have defined a label, namely LBL
.
Instead of breaking out from
the current loop, which is the default, we specify a label to break
out to, which is in
the outer loop. This works with as many nested loops as your brain
can handle. You don't have
to use uppercase names but for namespace reasons it is recommended,
and you can call your
labels whatever you please. I was just being unimaginative with the
name of LBL, feel free
to invent labels called DORIS or MATILDA if that's what floats your
personal boat.
Changing the Elements of an Array
So we have @names
. We
want to change it. Run this:
print "Enter a name :";
chomp ($x=<STDIN>);
@names=("Muriel","Gavin","Susanne","Sarah");
print "@names\n";
push (@names, $x);
print "@names\n";
Fairly self explanatory. The push
function
just adds a value on to the end of the array.
Of course, Perl being Perl, it doesn't have to be just the one
value:
print "Enter a name :";
chop ($x=<STDIN>);
@names=("Muriel","Gavin","Susanne","Sarah");
@cities=("Brussels","Hamburg","London","Breda");
print "@names\n";
push (@names, $x, 10, @cities[2..4]);
print "@names\n";
This is worth looking at in more detail. It appears there is no
fifth element of
@cities
, as referred to
by @cities[2..4]
.
Actually, there is a fifth element. Add this to the end of the
example :
print "There are ",scalar(@names)," elements in
\@names\n";
There appear to be 8 elements in @names
.
However, we have just proved there are in fact 9.
The reason there are 9 is that we referred to non-existent elements
of @cities
, and Perl
has quite happily extended @names
to
suit. The array @cities
remains
unchanged. Try pop
ing
the array if you don't believe me.
So that's push
. Now
for some...
Jiggerypokery with Arrays
@names=("Muriel","Gavin","Susanne","Sarah");
@cities=("Brussels","Hamburg","London","Breda");
&look;
$last=pop(@names);
unshift (@cities, $last);
&look;
sub look {
print "Names :
@names\n";
print "Cities:
@cities\n";
}
Now we have two arrays. The pop
function
removes the last element of an array and returns it,
which means you can do something like assign the returned value to a
variable.
The unshift
function adds
a value to the beginning of the array. Hope you didn't forget that
&subroutinename
calls
a subroutine. Presented below are the functions you can use to work
with arrays:
A table of array hacking functions
push |
Adds value to the end of the array |
pop |
Removes and returns value from end of array |
shift |
Removes and returns value from beginning of array |
unshift |
Adds value to the beginning of array |
Now, accessing other elements of arrays. May I present the splice
function ?
Splice
@names=("Muriel","Sarah","Susanne","Gavin");
&look;
@middle=splice (@names, 1, 2);
&look;
sub look {
print "Names :
@names\n";
print "The Splice
Girls are: @middle\n";
}
The first argument for splice
is
an array. Then second is the offset. The offset is the index
number of the list element to begin splicing at. In this case it is
1. Then
comes the number of elements to remove, which is sensibly 1 or more
in this
case. You can set it to 0 and perl, in true perl style, won't
complain.
Setting to 0 is handy because splice
can
add elements to the middle of an array, and if you don't
want any deleted 0 is the number to use. Like so:
@names=("Muriel","Gavin","Susanne","Sarah");
@cities=("Brussels","Hamburg","London","Breda");
&look;
splice (@names, 1, 0, @cities[1..3]);
&look;
sub look {
print "Names :
@names\n";
print "Cities:
@cities\n";
}
Notice how the assignment to @middle
has gone -- it is no longer relevant.
If you assign the result of a splice
to
a scalar then:
@names=("Muriel","Sarah","Susanne","Gavin");
&look;
$middle=splice (@names, 1, 2);
&look;
sub look {
print "Names :
@names\n";
print "The Splice
Girls are: $middle\n";
}
then the scalar is assigned the last element removed, or undef if
it doesn't work at all.
The splice
function is
also a way to delete elements from an array. In fact, a discussion
of :
Deleting Variables
is in order. Suppose we want to delete Hamburg from the following
array. How do we do it ? Perhaps:
@cities=("Brussels","Hamburg","London","Breda");
&look;
$cities[1]="";
&look;
sub look {
print "Cities: ",scalar(@cities), ": @cities\n";
}
would be appropriate. Certainly Hamburg is removed. Shame, such a
great lake. But note, the array element still exists. There are
still four elements in @cities
.
So what we need is the appropriate splice
function, which removes the element entirely.
splice (@cities, 1, 1);
Now that's all well and good for arrays. What about ordinary
variables, such as these:
$car ="Porsche 911";
$aircraft="G-BBNX";
&look;
$car="";
&look;
sub look {
print "Car :$car: Aircraft:$aircraft:\n";
print "Aircraft exists !\n" if $aircraft;
print "Car exists !\n" if $car;
}
It looks like we have deleted the $car
variable. Pity. But think about it. It is not deleted,
it is just set to the null string "". As you recall
(hopefully) from previous ramblings, the null string evaluates to
false so the if
test
fails.
False values versus Existence: It is, therefore...
Just because something is false doesn't mean to say it doesn't
exist. A wig is false hair, but a wig exists. Your variable is still
there. Perl does have a function to test if something exists.
Existence, in Perl terms, means defined. So:
print "Car is defined !\n" if defined $car;
will evaluate to true, as the $car
variable
does in fact exist.
This begs the question of how to really wipe variables from the
face of the earth, or at least your Perl script. Simple.
$car ="Porsche 911";
$aircraft="G-BBNX";
&look;
undef $car; # this undefines $car
&look;
sub look {
print "Car :$car: Aircraft:$aircraft:\n";
print "Aircraft exists !\n" if $aircraft;
print "Car exists !\n" if defined
$car;
}
This variable $car
is
eradicated, deleted, killed, destroyed.
And now for something completely different....
Basic Regular Expressions
An introduction
Or regex for short. These can be a little intimidating.
But I'll bet you have already used some regex in your computing life
so far. Have you even said "I'll have any Dutch beer ?"
That's a regex which will match a Grolsch or Heineken, but not a
Budweiser, orange juice or cheese toastie. What about dir *.txt ?
That's a regular expression too, listing any files ending in .txt.
Perl's regex often look like this:
$name=~/piper/
That is saying "If 'piper' is inside $name
,
then True."
The regular expression itself is between /
/
slashes, and the =~
operator
assigns the target for the search.
An example is called for. Run this, and answer it with 'the faq'.
Then try 'my tealeaves' and see what happens.
print "What do you read before joining any Perl discussion ?
";
chomp ($_=<STDIN>);
print "Your answer was : $_\n";
if ($_=~/the faq/) {
print "Right !
Join up !\n";
} else {
print "Begone, vile
creature !\n";
}
So here $_
is searched
for 'the faq'. Guess what we don't need ! The =~
.
This works just as well:
if (/the faq/) {
because if you don't specify a variable, then perl searches $_
by default.
In this particular case, it would be better to use
if ($_ eq "the faq") {
as we are testing
for exact matches.
Senstivity -- regexes in touch with their inner child
But what if someone enters 'The FAQ' ? It fails, because the
regex is case sensitive. We can easily fix that:
if (/the faq/i) {
with the /i
switch, which
specifies case-insensitivity. Now it works for all variations, such
as "the
Faq" and "the FAQ".
Now you can appreciate why a regular expression is better in this
situation than a simple test using eq
.
As the regex searches one string for another string, a response of
"I would read the FAQ first !" will also work, because
"the FAQ" will match the regex.
Study this example just to clarify the above. Tabs and spaces
have been added for aesthetic beauty:
$_="perl for Win32";
# sets the string to be searched
if ($_=~/perl/) { print "Found perl\n" };
# is 'perl' inside $_ ? $_ is "perl for Win32".
if (/perl/) { print "Found perl\n"
}; # same as the regex above.
Don't need the =~ as we are testing $_
if (/PeRl/) { print "Found PeRl\n"
}; # this will fail because of
case sensitivity
if (/er/) { print "Found er\n"
}; # this will work,
because there is an 'er' in 'perl'
if (/n3/) { print "Found
n3\n" }; # this
will work, because there is an 'n3' in 'Win32'
if (/win32/) { print "Found win32\n" };
# this will fail because of case sensitivity
if (/win32/i) { print "Found win32 (i)\n" };
# this will *work* because of case insensitivity (note the /i)
print "Found!\n" if /
/;
# another way of doing it, this time looking for a space
print "Found!!\n" unless $_!~/ /; # both these are the
same, but reversing the logic with unless and !
print "Found!!\n" unless !/ /; # don't
do this, it will always never not confuse nobody :-)
# the ~ stays the same, but = is changed to ! (negation)
$find=32;
# Create some variables to search for
$find2=" for ";
# some spaces in the variable too
if (/$find/) { print "Found '$find'\n" };
# you can search for variables like numbers
if (/$find2/) { print "Found '$find2'\n" };
# and of course strings !
print "Found $find2\n" if /$find2/;
# different way to do the above
As you can see from the last example, you can embed a variable in
the regex too. Regular expressions could fill entire books (and they
have
done, see the book critiques at http://www.perl.com/) but here are
some useful
tricks:
Character Classes
@names=qw(Karlson Carleon Karla Carla Karin Carina
Needanotherword);
foreach (@names) {
# sets each element of @names to $_ in turn
if (/[KC]arl/) {
# this line will be changed a few times in the examples below
print "Match ! $_\n";
} else {
print "Sorry. $_\n";
}
}
This time @names
is
initialised using whitespace as a delimiter instead of a comma. qw
refers to
'quote words', which means split the list by words. A word ends with
whitespace
(like tabs, spaces, newlines etc).
The square brackets enclose single characters to be matched.
Here either Karl
or Carl
must be in each element. It doesn't have to be two
characters, and you can use more than one set. Change Line 4 in the
above program to:
if (/[KCZ]arl[sa]/) {
matches if something begins with K, C, or Z, then arl, then
either s or a. It does not match KCZarl. Negation is possible
too, so try this :
if (/[KCZ]arl[^sa]/) {
which returns things beginning with K, C or Z,
then arl, and then anything EXCEPT s or a. The
caret ^
has to be the
first character, otherwise it doesn't work as the negation. Having
said [ ]
defines single
characters only, I should mention than these two are the same :
/[abcdeZ]arl/;
/[a-eZ]arl/;
if you use a hyphen then you get the list of characters including
the start and finish characters. And if you want to match a special
character
(metacharacter), you must escape it:
/[\-K]arl/;
matches Karl or -arl. Although the -
character is represented by two characters, it
is just the one character to match.
Matching at specific points
If you want to match at the end of the line, make sure a $
is the last character in the regex. This one pulls out
all those names ending in a. Slot it into the example above :
if (/a$/) {
And there is a corresponding character, the caret ^
, which in this context matches at the beginning
of the string. Yes, the caret also negates a character class like
this [^KCZ]arl
but in this
case it anchors the match to the beginning of the string.
if (/n/i) {
if (/^n/i) {
The first one is true if the word contains an 'n' anywhere in it.
The second specifies that the 'n' must be at the beginning of the
string to be
matched. Use this anchor where you can, because it makes the whole
regex
faster, and safer if you know what the first character must be.
Negating the regex
If you want to negate the entire regex change =~
to !~
(Remember
!
means 'not equal to'.)
if ($_ !~/[KC]arl/) {
Of course, as we are testing $_
this works too:
if (!/[KC]arl/) {
Returning the Match
Now things get interesting. What if we want pull something out of
a string ? So far all we have done is test for truth, that is say
yea or nay if a string matches, but not return what we found. Run
this:
$_='My email address is <Robert@NetCat.co.uk>.';
/(<robert\@netcat.co.uk>)/i;
print "Found it ! $1\n";
Firstly, note the single quotes when $_
is assigned. If there were double quotes, we'd need \@
instead of @
.
Remember, double quotes ""
allow
variable interpolation, so Perl looks for an
array called @NetCat
which
does not exist.
Secondly, look at the parens around the entire regex. If you use
parens, a side effect is that the first match is put into a variable
called $1
. We'll get to
the main effect later. The second match goes into $2
and so on. Also note that the \@
has been escaped, so perl doesn't think it is an
array. Remember \
either
escapes a special character, or gives a special meaning. Think of it
as Superman's telephone box. Imagine Clark Kent walking around with
his magic partner Back Slash.
Notice how we specify in the regex case-insensitivity with
/i
and the regex returns
the case-sensitive string - that is, exactly what it found.
Try the regex without parens. Then try this one:
/<(robert)\@netcat.co.uk>/i;
You can put the parens anywhere. More or less. Now, run this :
$_='My email address is <Robert@NetCat.co.uk>.';
/<(robert)\@(netcat.co.uk)>/i;
print "Found it ! $1 at $2\n";
See, you can have more than one ! Look at the above regex. Looks
easy now, don't you think ? What about five minutes ago ? It would
have looked
like a typing mistake ! Well, there are some hairier regex to come,
but you'll
have a good barber.
* + -- regexes become line noise
What if we didn't know what the email address was going to be ?
$_='My email address is <webslave@work.com>.';
print "Found it ! :$1:" if /(<.*>)/i;
When you see an if
statement like this, read it right to left. The print
statement is only executed if code on
the right of the expression is true.
We'll discuss this. Firstly, we have the opening parens (
. So everything from (
to
)
will be put into $1
if the match is successful. Then the first character
of what we are searching for, <
.
Then we have a dot, or period .
.
For this regex, we can assume .
matches
any character at all.
So we are now matching <
followed
by any character. The *
means
0 or more of the previous character. The regex finishes by requiring
>
.
This is important. Get the basics right and all regex are easy (I
read somewhere once). An example best illustrates the point. Slot
this regex in instead:
$_='My email address is <webslave@work.com>.';
print "Found it ! :$1:" if /(<*>)/i;
What's happening here ?
The regex starts, logically, at the start of the string. This
doesn't mean it starts a 'M', it starts just before M. There is a
'nothing' between the string start and 'M'.
The regex is searching for <*
,
which is 0 or more <
.
The first thing it finds is not <
,
but the nothing in between the start of the string and the 'M' from
'My email...". Does this match ?
As the regex is looking for "0 or more" <
, we can certainly say that there are 0 <
at the start of the string. So the match is, so far,
successful. We have dealt with <*
.
However, the next item to match is >
. Unfortunately, the next item in the string is 'M',
from 'My email..". The match fails at this point. Sure, it
matched <
without any
problem, but the complete match has to work.
The only two characters that can match successfully at this point
are <
or >
. The 'point' being that <*
has been matched successfully, and we need either >
to complete the match or more of <
to continue the '0 or more' match denoted by *
.
'M' is neither of them, so it fails at this point, when it has
matched
Quick clarification - the regex cannot successfully match <
, then skip on ahead through the string until it
matches >
. The
characters in the string between <
>
also need to match the regex, and they don't
in this case.
All is not lost. Regexes are hardy little beasts and don't give
up easily. An attempt is made to match the regex wherever possible.
The regex system keeps trying the match at every possible place in
the string, working towards the end.
Let's look at the match when it reaches the 'm' in 'work.com'.
Again, we have here 0 <
.
So the match works as before. After success on <*
the next character is analysed - it is a >
, so the match is successful.
But, be warned. The match may be successful but your job is not
done. Assuming the objective of was to return the email address
within the angle brackets then that regex is a miserable failure.
Watch for traps of this nature when regexing.
That's *
explained.
Just to consolidate, a quick look at:
$_='My email address is <webslave@work.com>.';
print "Match 1 worked :$1:" if /(<*)/i;
$_='<My email address is <webslave@work.com>.';
print "Match 2 worked :$1:" if /(<*)/i;
$_='My email address is <webslave@work.com<<<<>.';
print "Match 3 worked :$1:" if /(<*>)/i;
Match 1 is true. It doesn't return anything, but it is true
because there are 0 <
at
the very
start of the string.
Match 2 works. After the 0 <
at
the start of the string, there is 1 <
so the regex can match that too.
Match 3 works. After the failing on the first <
, it jumps to the second. After that, there are plenty
more to match right up until the required ending.
Glad you followed that. Now, pay even closer attention !
Concentrate fully on the task at hand ! This should be
straightforward now:
$_='HTML <I>munging</I> time !.';
/<I>(.*)<\/I>/i;
print "Found it ! $1\n";
Pretty much the same as the above, except the parens are moved so
we return what's only inside the tags, not including the tags
themselves. Also
note how /
is escaped like
so; \/
otherwise Perl
thinks that's the end of
the regex.
Now, suppose we change $_
to
:
$_='HTML <I>munging</I> time is here
<I>again</I> !.';
and run it again. Interesting effect, eh ? This is known as
Greedy Matching. What happens is that when Perl finds the initial
match, that
is <I>
it jumps
right to the end
of the string and works back from there to find a match, so the
longest string
matches. This is fine unless you want the shortest string. And there
is a
solution:
/<I>(.*?)<\/I>/i;
Just add a question mark and Perl does stingy matching. No
nationalistic jokes. I have Dutch and Scottish friends I don't want
to offend.
The Difference Between + and *
You know what *
means,
namely match 0 or more. If you want to match 1 or more, then use +
. The difference is important.
$_='The number is 2200 and the day is Monday';
($star)=/([0-9]*)/;
($plus)=/([0-9]+)/;
print "Star is '$star' and Plus is '$plus'\n";
You'll note that $star
has no value. The match was
successful though. It managed to match 0 or more characters from 0
to 9 at the
very start of the regex.
The second regex with $plus
worked a little better,
because we are matching one or more characters from 0 to 9.
Therefore, unless one 0 to 9 is found the match will fail. Once a
0-9 is found, the match continues as long as the next character is
0-9, then it stops.
Now we know this, there is another way to remove an email address
from within angle brackets:
$_='My email address is <robert@netcat.co.uk> !.';
/<([^>]+)/i;
print "Found it ! $1\n";
This regex matches <
. Then the capturing parens
start. They have no effect on this regex other than to capture the
match.
After that, there is a character class, containing one character. As
^
is the first character is the class, it negates the class. That's
why
we are using a character class with only one character in it,
because it can
be negated.
So far we have matched <
and anything that is not
>
. The +
ensures we match as many
characters that are not <
's as we can. This has the
same effect as .*?
but is more efficient. It may also
suit your purposes, as .*?
relies on you knowing what
you want to match up to, whereas [^>]+
simply
contines matching until it finds something that fails its criteria.
Just make sure you understand the difference because it is a crucial
part of regexery.
Re-using the match -- \1, $1...
Suppose we didn't know what HTML tag we had to match ? It could
be B, I, EM or whatever, and we want everything that is in between.
Well, HTML container tags like B and EM have end tags which are the
same as the start tag, except for the / . So what we could do is:
- find out what is inside < >
- search for exactly the same tag, but with the closing /
- return whatever is in between.
Can this be done ? Of course. This is perl, all things are
possible. Now, remember the side effect of parens. I promise I'll
explain the primary effect at some point. If whatever is in (parens)
matches, the result is stored in a variable called $1
. So we can use <(.*?)>
which will find us <
then
as many anythings (the .
and
*
) up to the next, not
last >
(the ?
forces stingy matching).
The result is stored in $1
because
we used parens. Next, we need everything up to the closing tag.
That's easy : (.*?)
matches
everything up until the next character or set of characters. And how
exactly do we define where to stop ?
We can use $1
even in
the same regex it was found in. However, it is not referred to
within a regex as $1
, but
\1
.
So we want to match </$1>
which
in perl code is <\/\1>
.
The /
must be escaped
because it is the end of the regex, and 1
is escaped so it refers to $1
instead of matching the number 1.
Still here ? This is what it looks like:
$_='HTML <I>munging</I> time is here
<I>again</I> !.';
/<(.*?)>(.*?)<\/\1>/i;
print "Found it ! $2\n";
If you want to know how to return all the matches above, read on.
But before that:
How to Avoid Making Mountains while Escaping Special Characters
You want to match this; http://language.perl.com/faq/
.
That's a real (useful) URL by the way. Hint. To match it, you need
to do this:
/http:\/\/language\.perl\.com\/faq\//;
which should make the awful metaphor above clearer, if not
funnier. The slash, /
, is
not
normally a metacharacter but as it is being used for the regular
expression
delimiters, it needs to be escaped. We already know that .
is special.
Fortunately for our eyes, Perl allows you to pick your delimiter
if you prefix it with 'm' as this example shows. We'll use a #
:
m#http://language\.perl\.com/faq/#;
Which is a huge improvement, as we change /
to #
.
We can go further with readability by quoting everything:
m#\Qhttp://language.perl.com/faq/\E#;
The \Q
escapes
everything
up until \E
or the regex
delimiter (so
we don't really need the \E above). In this case #
will not be escaped, as it delimits the regex.
Someone once posted a question about this to the Perl-Win32-Users
mailing list and I was so intrigued about this apparently
undocumented trick I spent the next twenty minutes figuring it out
by trial and error, and posted a reply. Next day I found lots of
messages telling the poster to read the manual because it was
clearly documented. <face colour='red' intensity='high'> My
excuse was I didn't have the docs to hand....moral of the story -
RTFM and RTF FAQs !
Subsitution and Yet More Regex Power
Basic changes
Suppose you want to replace bits of a string. For example, 'us'
with 'them'.
$_='Us ? The bus usually waits for us, unless the driver forgets
us.';
print "$_\n";
s/Us/them/; # operates on $_, otherwise you need $foo=~s/Us/them/;
print "$_\n";
What happens here is that the string 'Us' is searched for, and
when a match is found it is replaced with the right side of the
expression, in
this case 'them'. Simple.
You'll notice that only one substitution was made. To match
globally use /g
which runs
through the entire string, changing wherever it can. Try:
s/Us/them/g;
which fails. This is because regexes are not, by default,
case-sensitive. So:
s/us/them/ig;
would be a better bet. Now, everything is changed. A little
too
much, but one problem at a time. Everything you have learn about
regex so far
can be used with s///
,
like parens,
character classes [ ]
,
greedy and
stingy matching and much more. Deleting things is easy too. Just
specify
nothing as the replacement character, like so s/Us//;
.
So we can use some of that knowledge to fix this problem. We need
to make sure that a space precedes the 'us'. What about:
s/ us/them/g;
An small improvement. The first 'Us' is now no longer
changed,
but one problem at a time ! We'll first consider the problem of the
regex
changing 'usually' and other words with 'us' in them.
What we are looking for is a space, then 'us', then a comma,
period or space. We know how to specify one of a number of options -
the character class.
s/ us[. ,]/them/g;
Another tiny step. Unfortunately, that step wasn't really in
the
right direction, more on the slippery slope to Poor Programming
Practice. Why?
Because we are limiting ourselves. Suppose someone wrote ' send it
to us;
when we get it'.
You can't think of all the possible permutations. It is often
easier, and safer, to simply state what must not follow the
match. In this case, it can be anything except a letter. We can
define that as a-z. So we can add that to the regex.
s/ us[^a-z]/ them/g;
the caret ^
negates
the
character class, and a-z
represents
every alphabet from a to z inclusive. A space has been added to the
substitution part - as the original space was matched, it should be
replaced
to maintain readability.
\w
What would be more useful is to use a-zA-Z
instead. If we weren't using /i
we'd need that. As a-zA-Z
is
such a common construct, Perl provides an easy shorthand:
s/ us[^\w]/ them/g;
The \w
construct
actually
means 'word' - equivalent to a-zA-Z_0-9
.
So we'll use that instead.
To negate any construct, simply capitalise it:
s/ us[\W]/ them/g;
and of course we don't need the negating caret now. In fact,
we
don't even need the character class !
s/ us\W/ them/g;
So far, so good. Matching the first 'us' is going to be
difficult
though. Fortunately, there is an easy solution. We've seen Perl's
definition
of a word - \w
. Between
each word is
a boundary. You can match this with \b
.
s/\bus\W/ them/g;
that's \b
followed
by 'us', not 'bus' :-)
Now, we require a word boundary before 'us'. As there
is a 'nothing' at the start of the string, we have a match. There is
a space
after the first 'Us', so the match is successful. You might notice
an extra
space has crept in - that's the space we added earlier. The match
doesn't
include the space any more - it matches on the word boundary, that
is just
before the word begins. The space doesn't count.
Did you notice the final period and the comma are replaced ? They
are part of the match - it is the
Replacing with what was found
\W
that matches them.
We can't avoid that. We can however put back that part of the match.
s/\bus(\W)/them\1/g;
We start with capturing whatever the \W
matches, using parens. Then, we add it to the replacement
string. The capture is of course in $1
,
but as it is in a regex we refer to it as \1
.
The final problem is of course capitalising the replacement
string when appropriate. Which in old versions of the tutorial I
left as an exercise to the reader, having run out of motivation. A
reader by the name of Paul Trafford duly solved the problem, and I
have just inserted his excellent explanation for the elucidation of
all concerned:
# Solution to the
us/them problem...
#
# The program works through the text assigning the
# variable $1 to 'U' or 'u' for any words where this
# letter is followed by 's' and then by non 'word'
# characters. The latter is assigned to
variable $2.
#
# For each such matching occurrence, $1 is replaced by
# the letter that precedes it in the alphabet using
# operations 'ord' and 'chr' that return the ASCII value
# of a character and the character corresponding to a
# given natural number. After this 'hem' is tacked
on
# followed by $2, to retain the shape of the original
# sentence. The '/e' switch is used for
evaluation.
#
# NOTES
# 1. This solution will not replace US (short for
# United States) with Them or them.
#
# 2. If a 'magical' decrement operator '--' existed for
# strings then the solution could be simplified for we
# wouldn't need to use the 'chr' and 'ord' operators.
$_='Us ? The bus usually waits for us, unless the driver forgets
us.';
print "$_\n";
s/\b([Uu])s(\W)/chr(ord($1)-1).hem.$2/eg;
print "$_\n";
An excellent solution, thanks Paul.
There are several more constructs. We'll take a quick look at \d
which means anything that is a digit, that is 0-9
. First we'll use the negated form, \D
, which is anything except 0-9
:
print "Enter a number :";
chop ($input=<STDIN>);
if ($input=~/\D/) {
print "Not a number
!!!!\n";
} else {
print 'Your answer is
',$input x 3,"\n";
}
this checks that there are no non-number characters in $x
. It's not perfect because
it'll choke on
decimal points, but it's just an example. Writing your own
number-checker is
actually quite difficult, but it is an interesting exercise. Try it,
and see
how accurate yours is.
x
I hope you trusted me and typed the above in exactly as it is
show (or pasted it), because the x
is
not a mistake, it is a feature. If you were too smart and changed it
to a *
or something change
it back and see what it does.
Of course, there is another way to do it :
unless ($input=~/\d/) {
print 'Your answer is
',$input x 3,"\n";
} else {
print "Not a number
!!!!\n";
}
which reverses the logic with an unless
statement.
More Matching
Assume we have:
$_='HTML <I>munging</I> time is here
<I>again</I> !.';
and we want to find all the italic words. We know that /g
will match globally, so
surely this will work :
$_='HTML <I>munging</I> time is here
<I>again</I> ! What <EM>fun</EM> !';
$match=/<i>(.*?)<\/i>/ig;
print "$match\n";
except it returns 1, and there were definitely two matches. The
match operator returns true or false, not the number of matches. So
you can
test it for truth with functions like if,
while,
unless
Incidentally, the s///
operator does return the number of substitutions.
To return what is matched, you need to supply a list.
($match) = /<i>(.*?)<\/i>/i;
which handily puts all the first match into $match
. Note that an =
is used
(for assignment), as opposed to =~
(to point the regex
at a variable other than $_
.
The parens force a list context in this case. There is just the
one element in the list, but it is still a list. The entire match
will be assigned to the list, or whatever is in the parens. Try
adding some parens:
$_='HTML <I>munging</I> time is here
<I>again</I> ! What <EM>fun</EM> !';
($word1, $word2) = /<i>(.*?)<\/i>/ig;
print "Word 1 is $word1 and Word 2 is $word2\n";
In the example above notice /g
has been added so a global replacement is done - this
means perl carries on matching even after it finds the first match.
Of course,
you might not know how many matches there will be, so you can just
use an
array, or any other type of list:
$_='HTML <I>munging</I> time is here
<I>again</I> ! What <EM>fun</EM> !';
@words = /<i>(.*?)<\/i>/ig;
foreach $word (@words) {
print "Found
$word\n";
}
and @words
will be grown
to the appropriate size for the matches. You really can supply what
you like
to be assigned to:
($word1, @words[2..3], $last) = /<i>(.*?)<\/i>/ig;
you'll need more italics for that last one to work. It was
only a demonstration.
There is another trick worth knowing. Because a regex returns
true each time it matches, we can test that and do something every
time it returns true. The ideal function is while
which means 'do something as long the condition I'm
testing is true'. In this case, we'll print out the match every time
it is true.
$_='HTML <I>munging</I> time is here
<I>again</I> ! What <EM>fun</EM> !';
while (/<(.*?)>(.*?)<\/\1>/g) {
print "Found the
HTML tag $1 which has $2 inside\n";
}
So the while operator runs the regex, and if it is true, carries
out the statements inside the block.
Try running the program above without the /g
. Notice how it loops forever ? That's because the
expression always evaluates to true. By using the /g
we force the match to move on until it eventually
fails.
Now we know this, an easy way to find the number of matches is:
$_='HTML <I>munging</I> time is here
<I>again</I> ! What <EM>fun</EM> !';
$found++ while /<i>.*?<\/i>/ig;
print "Found $found matches\n";
You don't need braces in this case as nothing apart from the
expression to be evaluated follows the while
function.
Parentheses Again: OR
The real use for them. Precedence. Try this, and yes you can try
it at home:
$_='One word sentences ? Eliminate. Avoid clichés like the
plague. They are old hat.';
while (/o(rd|ne|ld)/gi) {
print "Matched
$1\n";
}
Firstly, notice the subtle introduction of the or
operator, in this case |
,
the pipe. What I really want to explain
however, is that this regex matches o followed by rd, ne or ld.
Without the
parens it would be /ord|ne|ld/
which
is definitely not what we want. That matches just plain ord, or ne
or ld.
(?: OR Efficiency)
In the interests of efficiency, consider this:
print "Give me a name :";
chop($_=<STDIN>);
print "Good name\n" if /Pe(tra|ter|nny)/;
The code above functions correctly. If you were wondering what a
good name is, Petra, Peter and Penny qualify. The regex is not as
efficient as
it could be though. Think about what Perl is doing with the regex,
that you
are just ignoring. Simply throwing away casually. Without
consideration as to
the effort that has gone into creating it for you. The resources
squandered.
The little bytes of memory whose sole function in life is to store
this
information, which will never be used.
What's happening is that because parens are used, perl is
creating $1
for your usage and abusage. While this may
not seem important, a fair amount of resources go into creating $1
,
$2
and so on. Not so much the memory used to store
them, more the CPU effort involved. So, if you aren't going to use
the parens for capturing purposes, why bother capturing the match?
print "Give me a name :";
chop($_=<STDIN>);
print "Good name\n" if /Pe(?:tra|ter|nny)/;
print "The match is :$1:\n";
The second print statement demonstrates that nothing is captured
this time. You get the benefits of the paren's precedence-changing
capabilities, but without the overhead of the capturing. This
benefit is
especially worthwhile if you are writing CGI programs which use
parens in
regex -- with CGI, every little of bit efficiency counts.
Matching specific amounts of...
Finally, take a look at this :
$_='I am sleepy....zzzz....DING ! Wake Up!';
if (/(z{5})/) {
print "Matched
$1\n";
} else {
print "Match
failed\n";
}
The braces { }
specify
how many of the preceding character to match. So z{2}
matches exactly two 'z's and so on.
Change z{5}
to z{4}
and see how it works. And there's
more...
/z{3}/ |
3 z only |
/z{3,}/ |
At least 3 z |
/z{1,3}/ |
1 to 3 z |
/z{4,8}/ |
4 to 8 z |
To any of the above you may suffix an question mark, the effect
of which is demonstrated in the following program. Run it a couple
of times, inputting 2, 3 and 4:
print "How many letters do you want to match ? ";
chomp($num=<STDIN>);
# we assign and print in one smooth move
print $_="The lowest form of wit is indeed sarcasm, I don't
think.\n";
print "Matched \\w{$num,} : $1 \n" if /(\w{$num,})/;
print "Matched \\w{$num,?}: $1 \n" if /(\w{$num,}?)/;
The first match is 'match any word (that's a-Z0-9_
)
equal to or longer than $num
character,
and return it.' So if you enter 4, then 'lowest' is returned. The
word 'The' doesn't match.
The second match is exactly the same, but the ?
forces a minimal match, so only the part actually
matched is returned.
Just to clear this up, amend the program thus:
print "\nMatched \\w{$num,} :";
print "$1 " while /(\w{$num,})/g;
print "\nMatched \\w{$num,?} :";
print "$1 " while /(\w{$num,}?)/g;
Note the addition of /g
.
Try it without - notice how the match never moves on ?
Pre, Post, and Match
And now on the Regex Programme Today, we have guest stars
Prematch, Postmatch and Match. All of whom are going to slow our
entire programme down, but are useful anyway :
$_='I am sleepy....snore....DING ! Wake Up!';
/snore/; # look, no parens !
print "Postmatch: $'\n";
print "Prematch: $`\n";
print "Match: $&\n";
If you are wondering what the difference between match and
using
parens is you should remember than you can move the parens around,
but you
can't vary what $&
and
its ilk
return. Also, using any of the above three operators does slow your
entire
program, whereas using parens will just slow the particular regex
you use them
for. However, once you've used one of the three matches you might as
well use
them all over the place as you've paid the speed penalty. Use parens
where
possible.
RHS Expressions
/e
RHS means Right Hand Side. Suppose we have an HTML file, which
contains:
<FONT SIZE=2> <FONT SIZE=4> <FONT SIZE=6>
and we wish to double the size of each font so 2 becomes 4 and 4
becomes 8 etc. What about :
$data="<FONT SIZE=2> <FONT SIZE=4> <FONT
SIZE=6>";
print "$data\n";
$data=~s/(size=)(\d)/\1\2 * 2/ig;
print "$data\n";
which doesn't really work out. What this does is match
size=x
, where x
is any digit. The first
match,
size=
, goes into $1
and
the second match, whatever the digit is, goes into $2
. The second part of the regex simply
prints $1
and $2
(referred to as \1
and
\2
), and attempts to
multiply $2
by 2. Remember /i
means
case insensitive matching.
What we need to do is evaluate the right hand side of the regex
as an expression - that is not just print out what it says, but
actually evaluate it. That means work it through, not blindly treat
it as string. Perl can do this:
$data=~s/(size=)(\d)/$1.($2 * 2)/eig;
A little explanation....the LHS is the same as before. We add
/e
so Perl evaluates the
RHS as an
expression. So we need to change \1
into $1
and so on. The
parens are there to ensure that $2 * 2
is evaluated, then joined to $1
. And that's it !
/ee
It is even possible to have more than one /e
. For example:
$data='The function is <5funcA>';
$funcA='*2+4';
print "$data\n";
$data=~s/<(\d)(\w+)>/($1+2).${$2}/; # first time
# $data=~s/<(\d)(\w+)>/($1+2).${$2}/e; # second time
# $data=~s/<(\d)(\w+)>/($1+2).${$2}/ee; # third time
print "$data\n";
To properly appreciate this you need to run it three times,
each
time commenting out a different line. Only one regex line should be
uncommented when the program is run.
The first time round the regex is a dumb variable interpolation.
Perl just searches the string for any variables, finds $1
and $2
, and replaces them.
Second time round the expression is evaluated, as opposed to just
plain variable-interpolated. This means that $1+2
is
evaluated. $1
has a value of 5, pl, plus 2 == 7. The
other part of the replacement, ${$2}
is evaluated only
so far as working out that the variable named $2
should
be placed in the string.
Third time round and Perl now makes a second pass through the
string, looking for things to do. After the first pass, and just
before that second pass the string looks like this; 7*2+4
.
Perl evaluates this, and prints the result.
So the more /e
's you
add on the end of the regex, the more passes Perl makes through the
replacement string trying to evaluate the code.
This is fairly advanced stuff here, and it is probably not
something you will use every day. But knowing it is there is handy.
A Worked Example: Date Change
Imagine you have a list of dates which are in the US format of
month, day, year as opposed to the rest of the world's logical
notion of day, month year. We need a regex to transpose the day and
month. The dates are:
@dates=(
'01/22/95',
'05/15/87',
'8-13-96',
'5.27.78',
'6/16/1993'
);
The task can be split into steps such as:
- Match the first digit, or two digits. Capture this result.
- Match the delimiter, which appears to be one of
/ - .
- Match the second two digits, and capture that result
- Rebuild the string, but this time reversing the day and month.
That may not be all the steps, but it is certainly enough for a
start. Planning regex is important. So, first pass:
@dates=(
'01/22/95',
'5/15/87',
'8-13-96',
'5.27.78',
'6/16/1993'
);
foreach (@dates) {
print;
s#(\d\d)/(\d\d)#$2/$1#;
print " $_\n";
}
Hmm. This hasn't worked for the dates delimited with - .
,
and the last date hasn't worked either. The first problem is pretty
easy; we are just matching /
, nothing else. The second
problem
arises because we are matching two digits. Therefore, 5/15/87 is
matched on
the 15 and 87, not the 5 and 15. The date 6/16/1993 is matched on
the 16 and
the 19 of 1993.
We can fix both of those. First, we'll match either 1 or 2
digits. There are a few ways of doing this, such as \d{1,2}
which means either 1 or two of the preceding character, or perhaps
more easily \d\d?
which means match one \d
and the other digit is optional, hence the question mark. If we used
\d+
then that would match 19988883 which is not a valid
date, at least not as far as we are concerned.
Secondly, we'll use a character class for all the possible date
delimiters. Here is just the loop with those amendments:
foreach (@dates) {
print;
s#(\d\d?)[/-.](\d\d?)#$2/$1#;
print " $_\n";
}
which fails. Examine the error statement carefully. The key word
is 'range'. What range? Well, the range between / and . because -
is the range operator within a character class. That means it is a
special
character, or a metacharacter. And to negate the special meaning of
metacharacters we have to use a backslash.
But wait! I don't hear you cry. Surely .
is a
metacharacter too? It is, but not within a character class so it
doesn't need to be escaped.
foreach (@dates) {
print;
s#(\d\d?)[/\-.](\d\d?)#$2/$1#;
print " $_\n";
}
Nearly there. However, we are always replacing the delimiter with
/
which is messy. That's an easy fix:
foreach (@dates) {
print;
s#(\d\d?)([/\-.])(\d\d?)#$3$2$1#;
print " $_\n";
}
so that fixes that. In case you were wondering, the .
dot does not act as '1 of anything' inside a character class. It
would
defeat the object of the character class if it did. So it doesn't
need
escaping. There is a further improvement you can make to this regex:
$m='/.-';
foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)#$3$2$1#;
print " $_\n";
}
which is good practice because you are bound to want to change
your delimiters at some point, and putting them inside the regex is
hardcording, and we all know that ends in tears. You can also re-use
the
$m
variable elsewhere, which is good pratice.
Did you notice the difference between what we assign to $m
and what we had before?
/\-.
$m='/.-';
The difference is that the -
is no longer escaped.
Why not? Logic. Perl knows -
is the range operator.
Therefore, there must be a character to the immediate left and
immediate right of it in order for it to work, for example e-f
.
When we assign a string to $m
, the range operator is
the last character and therefore has no character to the right of
it, so Perl doesn't interpret as a range operator. Try this:
$m='/-.';
and watch it fail.
Something else that causes heartache is matching what you don't
mean to. Try this:
@dates=(
'01/22/95',
'5/15/87',
'8-13-96',
'5.27.78',
'/16/1993',
'8/1/993',
);
$m='/.-';
foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)#$3$2$1# or print "Invalid date! ";
print " $_\n";
}
The two invalid dates at the end are let through. If you wanted
to check the validity of every possible date since the start of the
modern
calendar then you might be better off with a database rather than a
regex, but
we can do some basic checking. The important point is that we know
the
limitations of what we are doing.
What we can do is make sure of two things; that there are three
sets of digits seperated by our chosen delimiters, and that the last
set of digits is either two digits, eg 99, 98, 87, or four digits,
eg 1999, 1998, 1987.
How can we do this? Extend the match. After the second digit
match we need to match the delimter again, then either 2 digits or
four digits. How about:
$m='/.-';
foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)[$m](\d\d|\d{4})#$3$2$1$2# or print
"Invalid date! ";
print " $_\n";
}
which doesn't really work out. The problem is it lets 993
through. This is because \d\d will match on the front of 993.
Furthermore, we
aren't fixing the year back on to the end result.
The delimiter match is also faulty. We could match / as the first
delimiter, and - as the second. So, three problems to fix:
foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)\2(\d\d|\d{4})$#$3$2$1$2$4# or print
"Invalid!";
print " $_\n";
}
This is now looking like a serious regex. Changes:
- We are re-using the second match, which is the delimiter,
further on in the regex. That's what the
\2
is.
This ensures the second delimiter is the same as the first one,
so 5/7-98 gets rejected.
- The
$
on the end means end of string. Nothing
allowed after that. So the regex now has to find either 2 or 4
digits at the end of the string, or it fails.
- Added the match of the year (
$4
) to the rebuild
section of the regex.
Regex can be as complex as you need. The code above can be
improved still further. We could reject all years that don't begin
with either 19 or 20 if they are four-digit years. The other problem
with the code so far is that it would reject a date like
02/24/99 which is valid
because there are characters after
the year. Both can be fixed:
@dates=(
'01/22/95',
'5/15/87',
'8-13-96',
'5.27.78',
'/16/1993',
'8/1/993',
'3/29/1854',
'! 4/23/1972 !',
);
$m='/.-';
foreach (@dates) {
print;
s#(\d\d?)([$m])(\d\d?)\2(\d\d|(?:19|20)\d{2})(?:$|\D)#$3$2$1$2$4# or
print "Invalid!";
print " $_\n";
}
We have now got a nested OR, and the inner OR is non-capturing
for reasons of efficiency and readability. At the end we alternate
between
letting the regex match either an end of line or any non-digit,
symbolised
with \D
.
We could go on. It is often very difficult to write a regex that
matches anything of even minor complexity with absolute certainity.
Think about IP addresses for example. What is important is to build
the regex carefully, and understand what it can and cannot do.
Catching anything supposedly invalid is a good idea too. Test your
regex with all sorts of invalid data, and you'll understand what it
can do.
Split and Join
Splitting
While you are in the regex mood, a quick look at split
and join
.
Destruction is always easier (just ask your car mechanic), so lets
start with split
.
$_='Piper:PA-28:Archer:OO-ROB:Antwerp';
@details=split /:/, $_;
foreach (@details) {
print "$_\n";
}
Here we give split
is
given two arguments. The first one is a regex specifying what to
split on. The
next is what to split. Actually, I could leave $_
out because as usual it is the default if nothing is specified.
The assignment can either be a scalar variable or a list like an
array (or hash, but at this time 'hash' to you means what you think
the Dutch do or a silly drinking event spoilt by some running). If
it's a scalar variable you get the number of elements the split has
splut. Should that be 'the split has splittered' or 'the split has
splat'. Hmmm. Probably 'the split has split'. You know what I mean.
I think I just generated a Fatal Error in English.dll. Whoops. In
any case, splitting to a scalar variable is not always a Good Thing,
as we'll see later.
If the assignment is an array, then as you can see in the above
example the array is created with the relevant elements in order.
You can also assign to scalars, for example :
$_='Piper:PA-28:Archer:OO-ROB:Antwerp';
($maker,$model,$name,$reg,$location) = split /:/, $_;
(@aircraft[0..1],$aname,@regdetails) = split /:/, $_;
$number=split /:/ ;
# not bothering with the $_ at the end, as it is the default
print "Using the first 'split'\n";
print "$reg is a $maker $model $name based in
$location\n";
print "There are $number details available on this
aircraft\n\n";
print "Using the second 'split'\n";
print "You can find $regdetails[0], an $aircraft[1],
$regdetails[1]\n";
This demonstrates that a list can be a list of scalar
variables
(which is basically what an array is anyway), and that you can
easily see how
many elements the expression can be split into.
The example below adds a third parameter to split, which is how
many elements you want returned. If you don't want the extra stuff
at the end pop
it.
$_='Piper:PA-28:Archer:OO-ROB:Antwerp';
@details=split /:/, $_, 3;
foreach (@details) {
print "$_\n";
}
In the example below we split
on whitespace. Whitespace, in perl terms, is a space,
tab, newline, formfeed or carriage return. Instead of writing \t\n\f\r
for each
of the above, you can simply use \s
,
or the negated version
\S
which means anything except
whitespace. Think of whitespace as anything you know is there, but
you can't
see.
The whitespace split
is
specially optimised for speed. I've used spaces, double spaces, a
tab and a newline in the list below. Also note the +
, which means one or more of the preceding character,
so it will split
on any
combination of whitespace. And I think the final split
is useful to know. The split
function does not return the delimiter, so in this
case the whitespace will not be returned.
$_='Piper PA-28 Archer
OO-ROB
Antwerp';
@details=split /\s+/, $_;
foreach (@details) {
print "$_\n";
}
@chars=split //, $details[0];
foreach $char (@chars) {
print "$char
!\n";
}
A very FAQ
The following question has come up at least three times in the
Perl-Win32-Users mailing list. Can you answer it ?
"My data is delimited by |, for example:
name|age|sex|height|
Why doesn't
@array=split /|/, $line;
work ?"
Why indeed. If you don't already know the answer, some simple
troubleshooting steps can be applied. First, create a sample program
and run it.
$line='name|age|sex|height';
@array=split /|/,$line;
foreach (@array) { print "$_\n" }
The effect is to split
each character. The |
is
returned. As it is the delimiter, |
should be ignored, not returned.
At this point you should be thinking 'metacharacter'. A little
research (looking at the documentation) will reveal that |
is indeed a metacharacter, which means 'or', when
inside a regex. So, in effect, the regex /|/
means 'nothing, or nothing'. The split
is therefore performed on 'nothings', and there are
'nothings' in between each character. The solution is easy ; /\|/
.
$line='name|age|sex|height';
@array=split /\|/,$line;
foreach (@array) { print "$_\n" }
So that's the fun stuff, destruction. Now to put it back together
again with join
.
What Humpty Dumpty needs : Join
$w1="Mission critical ?";
$w2="Internet ready modems !";
$w3="J(insert your cool phrase here)"; # anything prefixed
by 'J' is now cool ;-)
$w4="y2k compatible.";
$w5="We know the Web.";
$w6="...the leading product in an emerging market.";
$cool=join ' ', $w1,$w2,$w3,$w4,$w5,$w6;
print $cool;
Join takes a 'glue' operator, which is not a regular
expression. It can be a scalar variable however. In this case it is
a space.
Then it takes a list, which can either be a list of scalar
variables, an array
or whatever as long as its a list. And you can see what the result
is. You
could assign it to an array, but you'd end up with everything in the
first
element of the array.
The example below adds an array into the list, and demonstrates
use of a variable as the delimiter.
$w1="Mission critical ?";
$w2="Internet ready modems !";
$w3="J(insert your cool phrase here)"; # anything prefixed
by 'J' is now cool ;-)
$w4="y2k approved, tested and safe !";
$w5="We know the Web.";
$w6="...the leading product in an emerging market.";
@morecool=("networkable","compatible");
$sep=" ";
$cool=join $sep, $w1,$w2,$w3,@morecool,$w4,$w5,$w6;
print $cool;
A recap, but with some new functions
Randomness
Aren't you wishing you could mix and match randomly so you too
could get a job marketing vapourware ? Heh.
@cool=(
"networkable directory services",
"legacy systems compatible",
"Mission critical, Business Ready",
"Internet ready modems !",
"J(insert your cool phrase here)",
"y2k approved, tested and safe !",
"We know the Web. Yeah.",
"...the leading product in an emerging market."
);
srand;
print "How many phrases would you like (max
",scalar(@cool),") ?";
while (1) {
chop ($input=<STDIN>);
if ($input <=
scalar(@cool) and $input > 0) {
last;
}
print 'Sorry, invalid
input, try again :';
}
for (1..$input) {
$index=int(rand $#cool);
print "$cool[$index]
";
splice @cool, $index, 1;
}
A few things to explain. Firstly, while
(1) {
.
We want an everlasting loop, and this one way to do it.
1 is always true, so round it goes. We could test $input
directly, but that wouldn't allow
last
to be demonstrated.
Everlasting loops aren't useful unless you are a politician being
interviewed. We need to break out at some point. This is done by the
last
function. When $input
is between 1 and the number of elements in @cool
then out we go. (You can also break out to labels, in
case you were wondering. And break out in a sweat. Don't start now
if you weren't.)
The srand
operator
initialises the random number generator. Works ok for us, but CGI
programmers should think of something different because their
programs are so frequently run (they hope :-).
rand
generates a random
number between 0 and 1, or 0 and a number it is given. In this case,
the number of elements of @cool
-1,
so from 0 to 7. There is no point generating numbers between 1 and 8
because the array elements run from 0 to 7.
The int
function makes
sure it is an integer, that is no messy bits after the decimal
point.
The splice
function
removes the printed element from the array so it won't appear again.
Don't want to stress the point.
Concatenation
There is another joining operator, this time the humble dot, or
period: .
. This concatanates
(joins) variables:
$x="Hello";
$y=" World";
$z="\n";
print "$x\n";
# print $x and a newline
$prt=$x.$y.$z;
# make a new var $prt out of $x, $y and $z
print $prt;
$x.=$y." again ".$z; # add stuff to $x
print $x;
Files
Opening
Perl is very good at handling files. Create, in your perl scripts
directory c:\scripts
, a file called stuff.txt
.
Copy the following into it :
The Main Perl Newsgroup:comp.lang.perl.misc
The Perl FAQ:http://www.perl.com/faq/
Where to download perl:http://www.activestate.com/
Now, to open and do things with this file. First, we must open
the file and assign it to a filehandle. All operations will
be done on
the file via the filehandle. Earlier, we used <STDIN>
as a filehandle - we read from
it.
$stuff="c:\scripts\stuff.txt";
open STUFF, $stuff;
while (<STUFF>) {
print "Line number
$. is : $_";
}
What this script does is fail. What is should do is open
the file defined in $stuff
,
assign it
to the filehandle STUFF
and
then,
while there are still lines left in the file, print the line number $.
and the current line.
An unforgivable error
It fails. That's not so bad, everything fails sometimes. What is
unforgivable is NOT CHECKING THE ERROR CODE !
This is a better version:
open STUFF, $stuff or die "Cannot open $stuff for read
:$!";
If the open
operation
fails, the or
means that
the code on
the RHS (right hand side) is evaluated. Perl dies. This means it
exits the
script, performs a post-mortem which it writes up into $!
and tells
you the line number at which it died. Just because $!
contains useful
information doesn't mean to say it is automagically printed, in true
perl
fashion. Usually you will wish to avail yourself of the information
inside as
it is of great help when working out why something is not going
according to
plan. The moral of the chapter is:
Always check your return codes !
\\ or / in pathnames -- your choice
The problem should now be apparent. The backslashes, being escape
characters, are not displayed. There are two ways to fix this:
- Escape the backslashes, like so
$stuff="c:\\scripts\\stuff.txt";
- Convert backslashes into forward slashes :
$stuff="c:/scripts/stuff.txt";
The forward slashes are the preferred option, even under Win32,
because you can then port the script direct to Unix or other
platforms (assuming you don't use drive letters), and it is less
typing. If you wish to use Perl to start external processes then you
must use the \\
method,
but this variable will be used only in a Perl program, not as a
parameter to start an external program. Changing the $stuff
variable results in a working script. Always check
your return codes !
Reading a file
$stuff="c:/scripts/stuff.txt";
open STUFF, $stuff or die "Cannot open $stuff for read
:$!";
while (<STUFF>) {
print "Line $. is :
$_";
}
A little more detail on what is happening here. The file is
opened for read. You can append and write too. You don't have
to use a
variable, but I always do because it is then easy to change and easy
to insert
into the or die
section,
and it is
easy to change later on. Hardcoding things is not the best way to
write a
maintainable and flexible program. Just ask the Year 2000 people
about code
that lived a little longer than the authors imagined :-).
open STUFF, "c:/scripts/stuff.txt" or die "Cannot
open stuff.txt for read :$!";
is just as good but more work if you want to change anything.
The line input operator (that's the angle brackets <>
reads from the beginning of the file up until and
including the first newline. The read data goes into $_
, and you can do what you want with it there. On the
next iteration of the loop data is read from where the last read
left off, up to the next newline. And so on until there is no more
data. When that happens the condition is false and the loop
terminates. That's the default behaviour, but we can change this.
This means that you can open a 200Mb file in perl and run through
it without having to load the entire file into memory. 200Mb of
memory is quite a bit. If you really want to load the entire 200Mb
file into one variable, Perl lets you. Limits are not the Perl Way.
The special variable $.
is
the current line number, starting at 1.
As usual, there is a quicker way to do the previous program.
$STUFF="c:/scripts/stuff.txt";
open STUFF or die "Cannot open $STUFF for read :$!";
while (<STUFF>) {
print "Line $. is :
$_";
}
This saves a little bit of typing, but does tie your filehandle
to the variable name. In fact, that entire program could be
compressed
further, but that's for later.
If you are really into shortness, try this:
$STUFF="c:/scripts/stuff.txt";
open STUFF or die "Cannot open $STUFF for read :$!";
print "Line $. is : $_" while (<STUFF>);
Writing to a File
A simple write
$out="c:/scripts/out.txt";
open OUT, ">$out" or die "Cannot open $out for
write :$!";
for $i (1..10) {
print OUT "$i : The
time is now : ",scalar(localtime),"\n";
}
Note the addition of >
to the filename. This opens it for writing. If we want to print
to the file we now just specify the filehandle name. You print to
the
filehandle, which is a gateway to the file.
Filehandles don't have to be capitalised, but it is wise. All
Perl functions are lowercase, and Perl is case-sensitive. So if
you choose uppercase names they are guaranteed not to conflict with
current or future function words.
And a neat way to grab the date sneaked in there too. You should
be aware that writing to a file overwrites the file. It does
not append data! However, you may append:
Appending
$out="c:/scripts/out.txt";
&printfile;
open OUT, ">>$out" or die "Cannot open $out for
append :$!";
print OUT 'The time is now : ',scalar(localtime),"\n";
close OUT;
&printfile;
sub printfile {
open IN, $out or die
"Cannot open $out for read :$!";
while (<IN>) {
print;
}
close IN;
}
This script demonstrates subroutines again, and how to append to
a file, that is write additional data at the end. The close
function is introduced here. This,
well, closes a filehandle. You don't have to close a filehandle -
just leave
it open until the script finishes, or the next open command to the
same
filehandle will close it for you.
@ARGV: Command Line Arguments
Perl has a special array called @ARGV
. This is the list of arguments passed along with the
script name on the command line. Run the following perl script as:
perl myscript.pl hello world how are you
foreach (@ARGV) {
print "$_\n";
}
Another useful way to get parameters into a program -- this time
without user input. The relevance to filehandles is as follows. Run
the
following perl script as:
perl myscript.pl stuff.txt out.txt
while (<>) {
print;
}
Short and sweet ? If you don't specify anything in the angle
brackets, whatever is in @ARGV
is
used
instead. And after it finishes with the first file, it will carry on
with the
next and so on. You'll need to remove non-file elements from @ARGV
before you use this.
It can be shorter still:
perl myscript.pl stuff.txt out.txt
print while <>;
Read it right to left. It is possible to shorten it even further
!
perl myscript.pl stuff.txt out.txt
print <>;
This takes a little explanation. As you know, many things in
Perl, including filehandles, can be evaluated in list or scalar
context. The
result that is returned depends on the context.
If a filehandle is evaluated in scalar context, it returns the
first line of whatever file it is reading from. If it is evaluated
in list context, it returns a list, the elements of which are the
lines of the files it is reading from.
The print
function is a
list operator, and therefore evaluates everything it is given in
list context. As the filehandle is evaluated in list context, it is
given a list !
Who said short is sweet? Not my girlfriend, but that's another
story. The shortest scripts are not usually the easiest to
understand, and not even always the quickest. Aside from knowing
what you want to achieve with the program from a functional point of
view, you should also know wheter you are coding for maximum
performance, easy maintenance or whatever -- because chances those
goals may be to some extent mutually exclusive.
Modifying a File with $^I
One of the most frequent Perl tasks is to open a file, make some
changes and write it back to the original filename. You already have
enough knowledge to do this. The steps would be:
- Make a backup copy of the file
- Open the file for read
- Open a new temporary file for write
- Go through the read file, and write it and any changes to the
temp file
- When finished, close both files
- Delete the original file
- Rename the temp file to the original filename
If you have managed to get this far and assiduously work through
the examples, the above will be child's play. Play if you want, but
there is a Better Way.
Make sure you have data in c:\scripts\out.txt
then run this:
@ARGV="c:/scripts/out.txt";
$^I=".bk";
# let the magic begin
while (<>) {
tr/A-Z/a-z/;
# another new function sneaked in
print;
# this goes to the temp filehandle, ARGVOUT,
# not STDOUT as usual, so don't mess with it !
}
So, what's happening? First, we load up @ARGV
with
the name of a file. It doesn't matter how @ARGV
is
loaded. We
could have shift
ed the
code from the
command line.
The $^I
is a special
variable. You knew that just by looking at it. It's name is the
Inplace Edit variable, and when it has a value the effects are:
- The name of the file to be in-placed edited is taken from the
first element of
@ARGV
. In this case, that is c:/scripts/out.txt
.
The file is renamed to its existing name plus the value of $^I
,
ie out.txt.bk
.
- The file is read as usual by the diamond operator
<>
,
placing a line at a time into $_
.
- A new filehandle is opened, called
ARGVOUT
,
and no prizes for guessing it is opened on a file called out.txt
.
The original out.txt
is renamed.
- The
print
prints
automatically to ARGVOUT
, not STDOUT
as it would usually.
At the end of the operation you have neatly edited the file and
made a backup. If you don't want a backup, assign a null string to $^I
but don't go crying on any mailing lists if you lose
data.
The usual method of in-place editing would involve just printing
everything back where it came from until your regex finds whatever
needs changing. You could of course slurp the whole file into memory
and play with it there, which could be a lot easier but if you are
dealing with files of more than a few megabytes this is probably not
a feasible approach.
Now take a look at out.txt
.
Notice how all capital letters have been transliterated into
lowercase. This is the tr
operator
at work, which is more efficient than regex for changing single
characters. But that's only a small part of the tr
function's value to the world. More later.
You should also have an out.txt.bk
file.
And finally, notice the way @ARGV
has
been created. You don't have to create it from the command line
arguments -- it can be treated like an ordinary array, for that is
what it is.
$/ -- Changing what is read into $_
On a different note, what if your input file is doesn't look like
this:
Beer
Wine
Pizza
Catfood
which is nicely delimited with a newline each time, but like
this:
shorts
t-shirt
blouse
pizza
beer
wine
catfood
Viz
Private Eye
The Independent
Byte
toothpaste
soap
towel
which is delimited by TWO newlines, not one. You don't have to
save the above as shop.txt
, but if you don't, the
examples will be difficult to follow.
Now, if you want each set of items as elements in an array you'll
have to do something like this:
$SHOP="shop.txt";
$x=0;
open SHOP or die "Can't open $SHOP for read: $!\n";
while (<SHOP>) {
if (/^\n/) {
# does line begin with newline ?
$x++; #
if so, increment $x. Rest of if statement not executed.
} else {
$list[$x].=$_; # glue $_ on the end of whatever is in $list[$x],
using a .
}
}
foreach (@list) {
print "Items
are:\n$_\n\n";
}
which works, but there is a much easier way to do it. You knew I
was going to say that.
$SHOP="shop.txt";
$/="\n\n";
open SHOP or die "Can't open $SHOP for read: $!\n";
while (<SHOP>) {
push (@list, $_);
}
foreach (@list) {
print "Items
are:\n$_\n\n";
}
The $/
variable is a
special variable (it even looks special). It is the Default Input
Record
Separator. Remember the operation of the angle brackets being to
read a
file in up until the next newline ? Time to come clean. What the
angle bracket
actually do is read up until whatever $/
is set to. It is set to a newline by default.
So if we set it to two newlines, as above, then it reads up until
it finds two consecutive newlines, then puts the data into $_
This makes the program a lot shorter and quicker. You
can set $/
to just about
anything, not just a newline. If you want to hack this list for
example:
Tea:Beer:Wine:Pizza:Catfood:Coffee:Chicken:Salmon:Icecream
you could just leave $/
as
a newline and slurp it into memory in one go, but imagine the above
items are a list of clothes that your girlfriend wants to buy or a
list of clothes your boyfriend should have thrown away by now.
Either are going to be really big files, and you might not want to
read it all into memory in one go. So set $/=":";
and all will be well. There are also read
and seek
functions,
but they aren't covered here. Those are useful for files where you
read in a precise number of bytes.
We'll go back to the last example for a moment. It is useful to
know how to read just one line (well, up to $/
) at a time:
$SHOP="shop.txt";
$/="\n\n";
open SHOP or die "Can't open $SHOP for read: $!\n";
$clothes=<SHOP>; #
everything up until the first occurrence of $/ into $clothes
$food=<SHOP>; # everything from first occurrence
of $/ to the second into $food
print "We need...\n",$clothes,"...and\n",$food;
And now we know that, there is a even quicker way to achieve the
aim of the original program :
$SHOP="shop.txt";
$/="\n\n";
open SHOP or die "Can't open $SHOP for read: $!\n";
@list=<SHOP>; # dumps *all* of $SHOP into @list,
not just one line.
foreach (@list) {
print "Items
are:\n$_\n\n";
}
and you don't need to grab it all :
@list[0..2]=<SHOP>
. We haven't mentioned list context for a while. Whether the line
input operator <>
returns
a
single value or a list depends on the context you use it in. When
you supply
@xxxxx
then this must be a
list. If
you supply $xxxxx
then
that's a scalar
variable. You can force it into list context by using parens.
The two lines below are provided so you can paste them into the
above program. They demonstrate how parens force list context.
Remember to replace the foreach
with
something that prints the variables.
($first, $second) = <SHOP>;
$first, $second = <SHOP>;
HERE Docs
The problem:
print "This is a long line of text which might be too long
to fit on just one line\n";
print "and I was right, it was too long to fit on one line.
In fact, it looks like it\n";
print "might very well take up to FOUR, yes FOUR lines to
print. That's four print\n";
print "statements, which takes up even more room. But
wait! I'm wrong! It will take\n";
print "FIVE lines to print this statement! Or is that six
lines? I'm not sure....\n";
The solution:
$var='variable interpolated';
print <<PRT;
This is a long line of text which might be too long to fit on just
one line
and I was right, it was too long to fit on one line. In fact,
it looks like
it might very well take up to FOUR, yes FOUR lines to print.
That's four print statements, which takes up even more room.
But wait! I'm
wrong! It will take FIVE lines to print this statement!
Or maybe six lines?
I'm not sure....but anyway, just to prove this can be $var.
PRT
That's called a 'here' document and you don't need to use
PRT
, you can use whatever you like within reason. You
don't need to put in explicit newlines, although
if you do they perform as usual. Now you know
about here docs you can stop wearing the print
function out by calling it every couple of lines. You
don't have to use here docs to print to files,
just anywhere you'd normally put a more than one print
statement.
Reading Directories
Globbing
For this exercise, I suggest creating another directory where you
have at least two text files and two or more binary files. Copy a
couple of .dll files from your WINDIR directory if you need to,
those will do for the binaries, and save a couple of random text
files. Size doesn't matter, in this case.
Then run this, giving the directory as the command line argument:
$dir=shift; # shifts @ARGV, the command line arguments after the
script name
chdir $dir or die "Can't chdir to $dir:$!\n" if $dir;
while (<*>) {
print "Found a file: $_\n" if -T;
}
The chdir
function
changes perl's working directory. You should, as ever, test to see
if it worked or not. In this case we only try and change directory
if $dir
is true.
The <*>
construct reads all files from a given
directory, and prints if it passes the file test -T
, which returns true if the file is a non-binary, ie
text file. You can be more specific:
$dir =shift;
$type='txt';
chdir $dir or die "Can't chdir to $dir:$!\n" if $dir;
while (<*.$type>) {
print "Found a file: $_\n";
}
like so. But, there is a better way to read from directories.
The method above is rather slow and inflexible.
readdir : How to read from directories
Instead, there is readdir
.
Another version of the previous example:
$dir= shift || '.';
opendir DIR, $dir or die "Can't open directory $dir:
$!\n";
while ($file= readdir DIR) {
print "Found a file: $file\n";
}
The first difference is the first line, which essentially
says if shift
is
false, then $dir = .
, which is of course the current
directory. Then, the directory is opened and we
have the chance to trap the error. It is assigned a filehandle.
The readdir
function reads
each file into $file
. There is no while
(<WDIR>) {
construct.
We can also apply the text file test. Run this, once without
entering a directory and the second time with entering a directory
path other than the one the script is in:
$dir= shift || '.';
opendir DIR, $dir or die "Can't open directory $dir:
$!\n";
while ($file= readdir DIR) {
print "Found a file: $file\n" if -T $file ;
}
Firstly, because the filename is now not in $_
we have to explicitly apply the -T
test to it with -T $file
.
Why did this not work the second time? Look at the code
carefully. You are testing $file
. If perl doesn't get a
fully qualified pathname, it assumes you are still in the directory
the script was run from, or that of the last successful chdir
. Not necessarily where you are readdir
'ing
from. So, to fix it:
print "Found a
file: $dir/$file\n" if -T "$dir/$file" ;
where we now specify the pathname, both in the printout and
in the file test itself. The "" are used
because otherwise perl tries to divide $file
by $dir
.
Try running this on a directory with only a few files in it:
$dir= shift || '.';
opendir DIR, $dir or die "Can't open directory $dir:
$!\n";
while ($file= readdir DIR) {
print "Found a file: '$file'\n";
}
Notice that two files are found which have interesting names,
namely .
and ..
. These two files are the
current, and lower directory respectively. Nothing
new, they have always been there -- run the DOS
command dir
if you don't believe me. You don't
usually want to know about them, so:
while ($file= readdir DIR) {
next if $file=~/^\./;
print "Found a file: '$file'\n";
}
is the usual workaround. You can use scalar context to dump
everything to a list of some description:
$dir= shift || '.';
opendir DIR, $dir or die "Can't open directory $dir:
$!\n";
@files=readdir(DIR);
print "@files";
but that includes the .
files, so it is best to
ensure they aren't included:
@files=grep !/^\./, readdir(DIR);
We haven't met -T
yet,
but for the moment just remember it searches a list and if it
returns true, lets the variable pass. In this
case, if it doesn't begin with . then that's true
so it goes into @files
.
There are other commands associated with reading directories,
which tell you where in a directory you are, and then where to go to
return. You should be aware of their existence, because you never
know when you might need them. The one other command of use is closedir
, which closes a directory. Optional, but recommended
for clarity.
Associative Arrays
The Basics
Very, very useful. First, a quick recap on arrays. Arrays are an
ordered list of scalar variables, which you access by their index
number starting at 0. The elements in arrays always stay in the
same order.
Hashes are a list of scalars, but instead of being accessed by
index number, they are accessed by a key. The tables below
illustrate the point:
@myarray |
Index No. |
Value |
0 |
The Netherlands |
1 |
Belgium |
2 |
Germany |
3 |
Monaco |
4 |
Spain |
|
|
%myhash |
Key |
Value |
NL |
The Netherlands |
BE |
Belgium |
DE |
Germany |
MC |
Monaco |
ES |
Spain |
|
So if we want 'Belgium' from @myarray
and also from %myhash
,
it'll be:
print "$myarray[1]";
print "$myhash{'BE'}";
Notice that the $
prefix
is used, because it is a scalar variable. Despite the fact it is
part of a list, it is still a scalar variable. The
hash syntax is simply to use braces {
}
instead of square brackets.
So why use hashes ? When you want to look something up by a
keyword. Suppose we wanted to create a program which returns the
name of the country when given a country code. We'd input ES, and
the program would come back with Spain.
You could do it with arrays. It would be messy however. One
possible approach:
- create
@country
,
and give it values such as 'ES,Spain'
- Itierate over the entire array and
split
each element
of the array, and check the first result to see if it matches
the input
- If so, return the index
@countries=('NL,The
Netherlands','BE,Belgium','DE,Germany','MC,Monaco','ES,Spain');
print "Enter the country code:";
chop ($find=<STDIN>);
foreach (@countries) {
($code,$name)=split /,/;
if ($find=~/$code/i) {
print "$name has the code $code\n";
}
}
Complex and slow. We could also store a reference to another
array in each element of @countries
,
but that is not efficient. Whatever way we choose, you still need to
search the whole thing. And what if @countries
is a big array ? See how much easier a hash is:
A Hash in Action
%countries=('NL','The
Netherlands','BE','Belgium','DE','Germany','MC','Monaco','ES','Spain');
print "Enter the country code:";
chop ($find=<STDIN>);
$find=~tr/a-z/A-Z/;
print "$countries{$find} has the code $find\n";
Very easy. All we need to do is make sure everything is in
uppercase with tr
and we
are there. Notice the way %countries
is defined - exactly the same as a
normal array, except that the values are put into the
hash in key/value pairs.
When you should use hashes
So why use arrays ? One excellent reason is because when an array
is created, its variables stay in the same order you created them
in. With a hash, perl reorders elements for quick access. Add print
%countries;
to the end of that program above and run
it. See what I mean ? No recognisable sequence at all. It's like
trying to herd cats. If you were writing code that stored a list of
variables over time and you wanted it back in the order you found it
in, don't use a hash.
Finally, you should know that each key of a hash must be
unique. Stands to reason, if you think about it. You are
accessing the hash via keys, so how can you have two keys named 'NL'
or something ? If you do define a certain key twice, the second
value overwrites the first. This is a feature, and useful. The
values of a hash can be duplicates, but never the keys.
If you want to assign to a hash, there is of course no concept of
push
, pop
and splice
etc.
Instead:
Hash Hacking Functions
Assigning |
$countries{PT}='Portugal'; |
Deleting |
delete $countries{NL}; |
Accessing Your Hash
Assuming you keep the same %countries
hash as above, here are some useful ways to access it:
All the keys |
print keys %countries; |
All the values |
print values %countries; |
A Slice of Hash :-) |
print @countries{'NL','BE'}; |
How many elements ? |
print scalar(keys %countries); |
Does the key exist ? |
print "It's there
!\n" if exists $countries{'NL'}; |
Well, that last one is not an access as a such but useful anyway.
More Hash Access: Iteration, keys and values
You may have noticed that keys
and
values
return a list. And
we can iterate over a list, using foreach
:
foreach (keys %countries) {
print "The key $_
contains $countries{$_}\n";
}
which is useful. Note how any list can be fed to foreach
, and off it goes. As usual, there
is another way to do the above:
while (($code,$name)=each %countries) {
print "The key $code
contains $name\n";
}
The each
function
returns each key/value pair of the hash, and is
slightly faster. In this example we assign them to
a list (you spotted the parens ?) and away we go. Eventually
there are no more pairs, which returns false to the while
loop and it stops.
If you are into brevity, both the above can be accomplished in a
single line:
print "The key $code contains $name\n" while ($code,$name)=each
%countries;
print "The key $_ contains $countries{$_}\n" foreach keys
%countries;
Note -- this won't win any prizes for easily readable code by
non-programmers of Perl.
Sorting
A Simple Sort
If I was reading this I'd be wondering about sorting. Wonder no
more, and behold:
foreach (sort keys %countries) {
print "The key $_
contains $countries{$_}\n";
}
Spot the difference. Yes, sort
crept
in there. If you want the list sorted backwards, some
cunning is called for. This is suitably foxy:
foreach (reverse sort keys %countries) {
print "The key $_
contains $countries{$_}\n";
}
Perl is just so difficult at times, don't you think ? This
works because:
- keys returns a list
- sort expects a list -- and gets
one from keys , and sorts it
- reverse also expects a list, so
it gets one and returns it
- then the whole list is foreach 'd
over.
This is a quick example to make sure the meaning of reverse
is clear:
print "Enter string to be reversed: ";
$input=<STDIN>;
@letters=split //,$input; # splits on the 'nothings' in between each
character of $input
print join ":", @letters; # joins all elements of @letters
with \n, prints it
print reverse @letters; # prints all of @letters, but
sdrawkcab )-:
Perl's list operators can just feed directly to each other,
saving many lines of code but also decreasing readability to those
that aren't Perl-literate:
print "Enter string to be reversed: ";
print join ":",reverse split //,$_=<STDIN>;
This section is about sorting, so enough of reverse
. Time to go forwards instead.
Numeric Sorting -- How Sort Really Works
That's easy alphabetical sorting by the keys. If you had a hash
of international access numbers like this one:
%countries=('976','Mongolia','52','Mexico','212','Morocco','64','New
Zealand','33','France');
foreach (sort keys %countries) {
print "The key $_
contains $countries{$_}\n";
}
You might want to sort numerically. In that case, you need to
understand how Perl's sort function
works.
The sort function compares two
variables, $a and $b
. They must be called $a and $b
otherwise it won't work. One chap published a book with
stolen code, and he changed $a and $b
to $x and $y. He obviously didn't test the program because it
would have failed and he would have noticed. And this book was
really published ! Don't believe everything you read in books -- but
web tutorials are always 100% truthful :-)
Back to sorting. $a and $b
are compared, and the result is:
- 1 if $a is greater than $b
- -1 if $b is greater than $a
- 0 if $a and $b
are equal
So as long as the sort function gets
one of those three values back it is happy. This means we can write
our own sort routines, and feed them to sort. For example, we know
the default sort is alphabetical. But if we write this:
%countries=('976','Mongolia','52','Mexico','212','Morocco','64','New
Zealand','33','France');
foreach (sort supersort keys %countries) {
print "$_
$countries{$_}\n";
}
sub supersort {
if ($a > $b) {
return 1;
} elsif ($a < $b) {
return -1;
} else {
return 0;
}
}
then it works correctly. Of course, there is an easier way.
The 'spaceship' operator <=>
. It does exactly what the supersort
subroutine does, namely return 1, -1 or 0
depending on the comparison of two given values.
So we can write the above much more easily as:
%countries=('976','Mongolia','52','Mexico','212','Morocco','64','New
Zealand','33','France');
foreach (sort { $a <=> $b } keys %countries) {
print "$_
$countries{$_}\n";
}
Notice the { } braces, which define the contents as the
subroutine sort must use. Pretty short subroutine. There is a
companion operator to <=>
, namely cmp
which
does exactly the same thing but of course compares
the values as strings, not numbers.Remember, if you are comparing
numbers, your comparison operator should contain non-alphas, if you
are comparing strings the operator should contains alphas only. And
don't talk to strangers.
Anyway, you now have enough knowledge to sort a hash by value
instead of keys. Suppose your pointy haired manager bounced up to
you and demanded a hash sorted by value ? What would you do ?
OK, what should you do ?
Well, we could just sort the values.
foreach (sort values %countries) {
But Pointy Hair wants the keys too. And if you have a value
you can't find the key.
So we have to iterate over the keys. But just because we are
iterating over the keys doesn't mean to say we have to hand the keys
over to sort
. What about:
%countries=('976','Mongolia','52','Mexico','212','Morocco','64','New
Zealand','33','France');
foreach (sort { $countries{$a} cmp $countries{$b} } keys %countries)
{
print "$_
$countries{$_}\n";
}
beautifully simple. If you want a reverse sort transpose $a
and $b
.
Sorting Multiple Lists
You can sort several lists at the same time:
%countries=('976','Mongolia','52','Mexico','212','Morocco','64','New
Zealand','33','France');
@nations=qw(China Hungary Japan Canada Fiji);
@sorted= sort values %countries, @nations;
foreach (@nations, values %countries) {
print "$_\n";
}
print "#----\n";
foreach (@sorted) {
print "$_\n";
}
This sorts @nations
and
the values from %countries
into
a new array.
The example also demonstrates that you can foreach
over more than one list value -- each list is processed in turn. How
I discovered that particular trick with Perl is instructive. I just
tried it. If you think you should be able to do something with Perl,
try it. Adhere to the syntax and conventions you will be familiar
with from experience, in this case delimiting a list with commas,
and try it. I'm always finding new shortcuts just by
experimentation.
Grep and Map
Grep
If you want to search a list, and create another list of things
you found, grep
is one
solution. This is an example, which also demonstrates join
again :
@stuff=qw(flying gliding skiing dancing parties racing); #
quote-worded list
@new = grep /ing/, @stuff; # Creates @new, which contains elements
of @stuff
# matching with 'ing' in them.
print join ":",@stuff,"\n"; # first makes one
string out of the elements of @stuff, joined
# with ':' , then prints it, then prints \n
print join ":",@new,"\n";
Remember qw
means
'quote words', so word boundaries are used as
delimiters instead. The grep
function
must be fed a list on the right hand side. On the
left side, you may assign the results to a list or a
scalar variable. Assigning to a list gives you each actual element,
and to a scalar gives you the number of matches
found:
@stuff=qw(flying gliding skiing dancing parties racing);
$new = grep /ing/, @stuff;
print join ":",@stuff,"\n";
print "Found $new elements of \@stuff which matched\n";
If you decide to modify the elements on their way through grep
, you actually modify the original
list. Be careful out there.
@stuff=qw(flying gliding skiing dancing parties racing);
@new = grep s/ing//, @stuff;
print join ":",@stuff,"\n";
print join ":",@new,"\n";
To determine what actually matches you can either use an
expression or a block. Up to now we've been using expressions, but
when things become more complicated use a block:
@stuff=qw(flying gliding skiing dancing parties racing);
@new = grep { s/ing// if /^[gsp]/ } @stuff;
print join ":",@stuff,"\n";
print join ":",@new,"\n";
Try removing the braces and you'll get an error. Notice that
the comma before the list has gone. It is now
obvious where the expression ends, as it is inside
a block delimited with { } . The regex says if the element
begins with g, s or p, then remove ing. The result is only assigned
to @new
if the expression
is completely true - 'parties' does begin with p,
so that works, but s/ing//
fails
so the overall result is false, and the value is
not assigned to @new
.
Map
Map works the same way as grep
,
in that they both iterate over a list, and return a list. There are
two important differences however:
grep
returns the value
of everything it evaluates to be true;
map
returns the results
of everything it evaluates.
As usual, an example will assist the penny in dropping, clear the
fog and turn on the light (if not make my metaphors easier to
understand):
@stuff=qw(flying gliding skiing dancing parties racing);
print "There are ",scalar(@stuff)," elements in
\@stuff\n";
print join ":",@stuff,"\n";
@mapped = map /ing/, @stuff;
@grepped = grep /ing/, @stuff;
print "There are ",scalar(@stuff)," elements in
\@stuff\n";
print join ":",@stuff,"\n";
print "There are ",scalar(@mapped)," elements in
\@mapped\n";
print join ":",@mapped,"\n";
print "There are ",scalar(@grepped)," elements in \@grepped\n";
print join ":",@grepped,"\n";
You can see that @mapped
is
just a list of 1's. Notice that there are 5 ones
whereas there are six elements in the
original array, @stuff
. This is
because @mapped
contains the true results of map
-- in every case the expression /ing/
is successful, except for 'parties'.
In that case there the expression is false, so the result is
discarded. Contrast this action with the grep
function, which returns the actual value, but only
if it is true. Try this:
@letters=(a,b,c,d,e);
@ords=map ord, @letters;
print join ":",@ords,"\n";
@chrs=map chr, @ords;
print join ":",@chrs,"\n";
This uses the ord
function
to change each letter into its ASCII equivalent, then
the chr
function convert
ASCII numbers to characters. If you change map
to grep
in
the example above, you can see that nothing
appears to happen. What is happening is that grep
is trying the expression on each
element, and if it succeeds (is true) it returns the element, not
the result. The expression succeeds for each
element, so each element is returned in turn.
Another example:
@stuff=qw(flying gliding skiing dancing parties racing);
print join ":",@stuff,"\n";
@mapped = map { s/(^[gsp])/$1 x 2/e } @stuff;
@grepped = grep { s/(^[gsp])/$1 x 2/e } @stuff;
print join ":",@stuff,"\n";
print join ":",@mapped,"\n";
print join ":",@grepped,"\n";
Recapping on regex, what that does is match any element
beginning with g, s or p, and replace it with the
same element twice. The caret ^
forces
a match at the beginning of the string, the
[square brackets] denote a character class, and /e
forces Perl to evaluate the RHS as an
expression.
The output from this is a mixture of 1 and nothing for map
, and a three-element array called @grepped
from grep. Yet another example:
@mapped = map { chop } @stuff;
@grepped = grep { chop } @stuff;
The chop
function
removes the last character from a string, and
returns it. So that's what you get back from ^
, the result of the
expression. The grep
function gives you the mangled remains of the
original value.
Writing your own grep and map functions
Finally, you can write your own functions:
@stuff=qw(flying gliding skiing dancing parties racing);
print join ":",@stuff,"\n";
@mapped = map { &isit } @stuff;
@grepped = grep { &isit } @stuff;
print join ":",@mapped,"\n";
print join ":",@grepped,"\n";
sub isit {
($word)=/(^.*)ing/;
if (length $word == 3) {
return "ok";
} else {
return 0;
}
}
The subroutine isit
first
grabs everything up until 'ing', puts it into $word
, then returns 'ok' if the there are three characters
in $word
. If not, it
returns the false value 0. You can make these
subroutines (think of them as functions) as complex as
you like.
Sometimes it is very useful to have map
return the actual value, rather than the result. The
answer is easy, but not obvious. Remember that subroutines return
the value of the last expression evaluated? So, in this case, do
blocks. What if the expression was, very simply:
@grepstuff=@mapstuff=qw(flying gliding skiing dancing parties
racing);
print join " ",map { s/(^[gsp])/$1 x 2/e } @mapstuff;
print "\n";
print join " ",grep { s/(^[gsp])/$1 x 2/e } @grepstuff;
Now, make sure $_
is
the last thing evaluated:
@grepstuff=@mapstuff=qw(flying gliding skiing dancing parties
racing);
print join " ",map { s/(^[gsp])/$1 x 2/e;$_} @mapstuff;
print "\n";
print join " ",grep { s/(^[gsp])/$1 x 2/e } @grepstuff;
and there you have it. Now you understand that you can go and
impress your friends, but please don't count on success.
External Commands
Some ways to...
Perl can start external commands. There are five main ways to do
this:
system
exec
- Command Input, also known as
`backticks`
- Piping data from a process
- Quote execute
We'll compare system
and exec
first.
Exec
Poor old exec
is broken
on Perl for Win32. What it should do is stop running your Perl
script and start running whatever you tell it to. If it can't start
the external process, it should return with an error code. This
doesn't work properly under Perl for Win32. The exec
function does work properly on the standard Perl
distribution.
System
This runs an external command for you, then carries on with the
script. It always returns, and the value it returns goes into $?
. This means you can test to see if the program
worked. Actually you are testing to see if it could be started, what
the program does when it runs is outside your control if you use system
.
This example demonstrates system
in
action. Run the 'vol' command from a command prompt first if you are
not familiar with it. Then run the 'vole' command. I'm assuming you
have no cute furry executables called vole on your system, or at
least in the path. If you do have an executable called 'vole', be
creative and change it.
system("vole");
print "\n\nResult: $?\n\n";
system("vol");
print "\n\nResult: $?\n\n";
As you can see, a successful system call returns 0. An
unsuccessful one returns a value which you need to divide by 256 to
get the real return value. Also notice you can see
the output. And because system
returns, the code after the first system
call is executed. Not so with exec
,
which will terminate your perl script if it is
successful. Perl's usual use of single and double quotes
applies as per variable interpolation.
Backticks
These ``
are different
again to system and exec. They also start external processes, but return
the output of the process. You can then do whatever you like
with the output. If you aren't sure where backticks are on your
keyboard, try the top left, just left of the 1 key. Often around
there. Don't confuse single quotes ''
with
backticks ``
.
$volume=`vol`;
print "The contents of the variable \$volume are:\n\n";
print $volume;
print "\nWe shall regexise this variable thus :\n\n";
$volume=~m#Volume in drive \w is (.*)#;
print "$1\n";
As you can see here, the Win32 vol command is executed. We
just print it out, escaping the $
in the variable name. Then a simple
regex, using # as a delimiter just in case you'd
forgotten delimiters don't have to be / .
When to use external calls
Before you get carried away with creating elaborate scripts based
on the output from NT's net
commands,
note there are plenty of excellent modules out there which do a very
good job of this sort of thing, and that any form of external
process call slows your script. Also note there are plenty of built
in functions such as readdir
which
can be used instead of `dir`
.
You should use Perl functions where possible rather than calling
external programs because Perl's functions are:
- portable (usually, but there are exceptions). This means you
can write a script on your Mac PowerBook, test it on an NT box
and then use it live on your Unix box without modifying a single
line of code;
- faster, as every external process significantly slows your
program;
- don't usually require regexing to find the result you want;
- don't rely on output in a particular format, which might be
changed in the next version of your OS or application;
- are more likely to be understood by a Perl programmer -- for
example,
$files=`ls`;
on
a Unix box means little to someone that doesn't know that ls
is the Unix command for listing files, as dir
is in
Windows.
Don't start using backticks all over the place when system
will do. You might get a very large return value which you don't
need, and will consequently slurp lots of memory. Just use them when
you actually want to check the returned strings.
Opening a Process
The problem with backticks is that you have to wait for the
entire process to complete, then analyse the entire return code.
This is a big problem if you have large return codes or slow
processes. For example, the DOS command tree
. If you
aren't familiar with this command, run a DOS/command prompt, switch
to the root directory (C:\
) and type tree
.
Examine the wondrous output.
We can open a process, and pipe data in via a filehandle in
exactly the same way you would read a file. The code below is
exactly the same as opening a filehandle on a file, with two
exceptions:
- We use an external command, not a filename. That's the process
name, in this case,
tree
.
- A pipe, ie
|
is
appended to the process name.
open TRIN, "tree c:\\ /a |" or die "Can't see the
tree :$!";
while (<TRIN>) {
print "$. $_";
}
Note the |
which
denotes that data is to be piped from the
specified process. You can also pipe data to
a process by using |
as
the first character.
As usual, $.
is the
line number. What we can do now is terminate our tree
early. Environmentally unsound, but efficient.
open TRIN, "tree c:\\ /a |" or die "Can't see the
tree :$!";
while (<TRIN>) {
printf "%3s $_", $.;
last if $. == 10;
}
As soon as $.
hits
10 we shut the process off by exiting the loop.
Easy.
Except, maybe it won't. What if this was a long program, and you
forgot about that particular line of code which exits the loop?
Suppose that $.
somehow went from 9 to 11, or was
assigned to? It would never reach 10. So, to be safe
open TRIN, "tree c:\\ /a |" or die "Can't see the
tree :$!";
while (<TRIN>) {
printf "%3s $_", $.;
last if $. >= 10;
}
exit your loops in a paranoid manner, unless you really
mean only to exit when at line ten. For maximum
safety, maybe you should create your own counter
variable because $.
is a global variable.
I'm not necessarily advocating doing any of the above, but I am
suggested these things are considered.
You might notice the presence of a new keyword - printf
. It works like print
,
but formats the string before printing. The formatting is
controlled by such parameters as %3s
,
which means "pad out to a total of three spaces". After
the doublequoted string comes whatever you want to be printed in the
format specified. Some examples follow. Just uncomment each line in
turn to see what it does. There is a lot of new stuff below, but try
and work out what is happening. An explanation follows after the
code.
$windir=$ENV{'WINDIR'}; # yes, you can access the environment
variables !
$x=0;
opendir WDIR, "$windir" or die "Can't open $windir
!!! Panic : $!";
while ($file= readdir WDIR) {
next if $file=~/^\./; # try commenting this line to see why it is
there
$age= -M "$windir/$file"; # -M returns the age in days
$age=~s/(\d*\.\d{3}).*/$1/; # hmmmmm
#### %4.4d - must take up 4 columns, and pad with 0s to make up
space
#### and minimum
width is also 4
#### %10s - must take up 10 columns, pad with spaces
# printf "%4.4d %10s %45s \n", $x, $age, $file;
#### %-10s - left justify
# printf "%4.4d %-10s %-45s \n", $x, $age, $file;
#### %10.3 - use 10 columns, pad with 0s if less than 3
columns used
# printf "%4.4d %10.3d %45s \n", $x, $age, $file;
$x++;
last if $x==15; # we don't want to go through all the files :-)
}
There are some intentionally new functions there. When you
start hacking Perl (actually, you already started
if you have worked through this far) you'll see a
lot of example code. Try and understand the above, then read
the explanation below.
Firstly, all environment variables can be accessed and set via
Perl. They are in the %ENV
hash.
If you aren't sure what environment variables are, refer to your
friendly Microsoft documentation or books. The best known
environment variable is path
, and you can see its value
and that of all other environment variables by simply typing set
at your command prompt.
The regex /^\./
bounces
out invalid entries before we bother do any processing on them. Good
programming practice. What it matches is "anything that begins
with '.'". The caret anchors the match to the beginning of the
string, and as .
is a
metacharacter it has to be escaped.
Perl has several tests to apply on files. The -M
test returns the age in days. See the documentation
for similar tests. Note that the calls to readdir
return just the file, not the complete pathname. As
you were careful to use a variable for the directory to be opened
rather than hardcoding it (horrors) it is no trouble to glue it
together by using doublequotes.
Try commenting out $age=~s/(\d*\.\d{3}).*/$1/
and note the size of $age
.
It could do with a trim. Just for regex practice, we make it a
little smaller. What the regex does is:
- start capturing with
(
- look for 0 or more digits
\d*
- then a
.
(escaped)
- followed by three digits
\d{3}
- and that's all we want to capture so the parens are closed.
)
- Finally, everything else in the string is matched
.*
where .
is
any character (almost) and *
0
or more. This is pretty much guaranteed to match to the end of
the line
- Having matched the entire string (and put part of it into
$1
by using parens) we simply replace the string with
what we have matched.
Easy !
Mention should also be made of sprintf
, which is exactly like printf
except it doesn't print. You just use it to format
strings, which you can do something with later. For example :
open TRIN, "tree c:\\ /a |" or die "Can't see the
tree :$!";
while (<TRIN>) {
$line= sprintf "%3s $_", $.;
print $line;
last if $. == 10;
}
Quote execute
@opts=qw(w on ad oe b);
for (@opts) {
$result=qx(dir /$_);
print "dir /$_ resulted in:\n$result",'-' x 79;
sleep 1;
}
Anything within qx( )
is
executed, and duly variable interpolated. This sample also
demonstrated qw
which is
'quote words', so the elements of @opts
are delimited by word boundaries, not the usual commas.
You can also use for
instead
of foreach
if you want to
save typing four character for the sake of
legibility.
You may have noticed that system
outputs
the result of the command to the screen whereas qx
does not. Each to its own.
Oneliners
A short example
You'll have noticed Perl packs a lot of power into a small amount
of code. You can feed Perl code directly on the command line. This
is known as a oneliner, for obvious reasons. An example:
perl -e"for (55..75) { print chr($_) }"
The -e
switch tells
Perl that a command is following. The command must be enclosed in
doublequotes, not singles as on Unix. The command itself in this
case simply prints the ASCII code for the number 55 to 75 inclusive.
File access
This is a simple find routine. As it uses a regex, it is
infinitely superior to NT's findstr
:
perl -e"while (<>) {print if /^[bv]/i}" shop.txt
Remember, the while (<>)
construct
will open whatever is in @ARGV
.
In this case, we have supplied shop.txt
so it is opened
and we print lines that begin with either 'b' or 'v'.
That can be made shorter. Run perl -h
and you'll see a whole list of switches. The one we'll
use now is -n
, which puts
a while (<>) { }
loop around whatever code you supply with -e
. So:
perl -ne"print if /^[bv]/i" shop.txt
which does exactly the same as the previous program, but uses the
-n
switch to put a while
(<>)
loop around whatever other commands are
supplied.
A slightly more sophisticated version:
perl -ne"printf \"$ARGV : %3s : $_\",$. if /^[bv]/i"
shop.txt
which demonstrates that doublequotes must be escaped.
Modifying files with a oneliner and $^I
If you don't remember $^I
then
please review the section on Files before proceeding. When you're
ready, copy shop.txt
to shop2.txt
.
perl -i.bk -ne"printf \"%4s : $_\",$."
shop2.txt
The -i
switch primes
the inplace edit operator. We still need -n
.
If you had a typical quoted email message such as:
>> this is what was said
>> blah blah
> blaaaaahhh
The new text
and you wanted to remove the >
, then:
perl -i.bk -pe"s/^>+ ?//" email.txt
does the trick. Regex recap -- the caret matches what follows to
the beginning of the string, the +
means one or more
(no, we do not use *
which means 0 or more), then we
will match one space with \s
, but it is not necessary
for the space to be there for the match to be successful, hence ?
.
What is new in terms of oneliners is the use of -p
, which does exactly the same thing as -n
except that it adds a print
statement too. In case you were wondering why the
previous example used -n
and
this one uses -p
-- the
previous example uses prints data with printf
,
whereas this example doesn't have an explicit print statement so we
provide one with -p
.
Some other useful oneliners -- a calculator and a ASCII number
lookup:
perl -e"print 50/200+2"
perl -e"for (50..90) { print chr($_) }"
There are plenty more oneliners, and they are an essential part
of any sysadmin's toolbox. The two examples below are functionally
equivalent but the lower one is perhaps a little more readable:
perl -e"for $i (50..90) { print chr($i),\" is $i\n\"
}"
perl -e"for $i (50..90) { print chr($i),qq| is $i\n| }
Whatever follows qq
is
used as a delimiter, instead of having to escape the backslash. I
learnt this from the Perl-Win32-Users mailing list (see top) - I
think it was Lennart Borgman who pointed it out. He also mentioned
that you don't need the closing doublequote. Saves a little typing.
Subroutines and Parameters
In Perl, subroutines are functions are subroutines. If you like,
a subroutine is a user defined function. It's a bit like calling a
script a program, or a program a script. For the purposes of this
tutorial we'll refer to functions as subroutines, except when we
call them functions. Hope that's made the point.
For the purposes of this section we will develop a small program
which, by the end, will demonstrate how subroutines work. It also
serves to demonstrate how many programs are built, namely a little
at a time, in manageable sections. At least, that method works for
me. engines.
The chosen theme is gliding. That's aeroplanes without engines. A
subject close to every glider pilot's heart is how far they can fly
from the altitude they are at. Our program will calculate this. To
make it easy we'll assume the air is perfectly calm. Wind would be a
complication we don't need, especially when in a crowded lift.
What we need in order to calculate the distance we can fly is:
- How high we are (in feet)
- How many metres we travel forward for every metre we drop.
This is the glide ratio, for example 24:1 would mean travelling
24 metres forward for every 1 metre of height lost.
Obviously input is needed. We can either prompt the user or grab
the input from the command line. The latter is easier so we'll just
look at @ARGV
for the
command line parameters. Like so:
($height,$angle)=@ARGV; # @ARGV is the command line parameters
$distance=$height*$angle; # an easy calculation
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
The above should be executed thus:
perl yourscript.pl 5000 24
or whatever your script is called, with whatever parameters you
choose to use. I'm a poet and I don't even know it.
That works. What about a slight variation? The pilot does have
some control over the glide ratio, for example he can fly faster but
at a penalty of a lesser glide ratio. So we should perhaps give a
couple of options either side of the given parameters:
($height,$angle)=@ARGV;
$distance=$height*$angle;
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
$angle++; # add 1 to $angle
$distance=$height*$angle;
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
$angle-=2; # subtract 2 from $angle so it is 1 less than the
original
$distance=$height*$angle;
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
That's cumbersome code. We repeat exactly the same statement.
This wastes space, and if we want to change it there are three
changes to be made. A better option is to put it into a subroutine:
($height,$angle)=@ARGV;
&howfar;
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
$angle++;
&howfar;
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
$angle-=2;
&howfar;
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
sub howfar { # sub subroutinename
$distance=$height*$angle;
}
This is a basic subroutine, and you could stop here and have
learnt a very useful technique for programming. Now, when changes
are made they are made in one place. Less work, less chances of
errors. Improvements can always be made. For example, pilots outside
Eastern Europe generally measure height in feet, and glider pilots
are usually concerned with how many kilometres they travel over the
ground. So we can adapt our program to accept height in feet and
output the distance in kilometres:
($height,$angle)=@ARGV;
$height/=3.2; # divide feet by 3.2 to get metres
&howfar;
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
$angle++;
&howfar;
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
$angle-=2;
&howfar;
print "With a glide ratio of $angle:1 you can fly $distance
from $height\n";
sub howfar {
$distance=$height*$angle;
}
When you run this you'll probably get a result which involves a
fair few digits after the decimal point. This is messy, and we can
fix this by the int
function,
which in Perl and most other languages returns a number as an
integer, ie without those irritating numbers after the decimal
point.
You might have also noticed a small bit of Bad Programming
Practice slipped into the last example. It was the evil Constant,
the '3.2' used to convert feet to metres. Why, I don't hear you ask,
is this bad? Surely the conversion will never change?
It won't change, but our use of it might. We may decide that it
should be 3.208 instead of 3.2. We may decide to convert from feet
to nautical miles instead. You don't know what could happen.
Therefore, code with flexibility in mind and that means avoiding
constants.
The new improved version with int
and
constant removed:
($height,$ratio)=@ARGV;
$cnv1=3.2; # now it is a variable. Could easily be a cmd line
# parameter too. We have the flexibility.
$height =int($height/$cnv1); # divide feet by 3.2 to get
metres
&howfar;
print "With a glide ratio of $ratio:1 you can fly $distance
from $height\n";
$ratio++;
&howfar;
print "With a glide ratio of $ratio:1 you can fly $distance
from $height\n";
$ratio-=2;
&howfar;
print "With a glide ratio of $ratio:1 you can fly $distance
from $height\n";
sub howfar {
$distance=int($height*$ratio);
}
We could of course build the print
statement
into the subroutine, but I usually separate output presentation from
the calculation. Again, that means it is easier to modify later on.
Something else we can improve about this code is the use of the $ratio
variable. We are having to keep track of what we do to
it -- first add one, then subtract two in order to subtract one from
the original input. In this case it is fairly easy, but with a
complex program it can be difficult, and you don't want to be
creating lots of variables just to track one input, for example $ratio1
, $ratio2
etc.
Parameters
One solution is to pass the subroutine parameters. In the best
tradition of American columnists, who seem to have a particular
affection for this phrase, 'Here's how:'
($height,$ratio)=@ARGV;
$cnv1=3.2;
&howfar($height,$ratio);
print "With a glide ratio of $ratio:1 you can fly $distance
from $height\n";
&howfar($height,$ratio+1);
print "With a glide ratio of ",$ratio+1,":1 you can
fly $distance from $height\n";
&howfar($height,$ratio-1);
print "With a glide ratio of ",$ratio-1,":1 you can
fly $distance from $height\n";
sub howfar {
print "The parameters passed to this subroutine are @_\n";
($ht,$rt)=@_;
$ht =int($ht/$cnv1);
$distance=int($ht*$rt);
}
Quite a few things have changed here. Firstly, the subroutine is
being called with parameters. These are a comma-delimited list in
parens after the subroutine call. The two parameters are $height
and $ratio
.
The parameters end up in the subroutine as the @_
array. Being an array, they are in the same order as
passed. All the usual array operations work. All we will do is
assign the contents of the array to two variables.
We have also moved the conversion function into the subroutine,
because we want to put all the code for generating the distance into
one place.
Namespaces
We cannot use the variable names $height
and $ratio
because we modify them in the subroutine and that will affect
the main program. So we choose new ones to do the operation on.
Finally, a small change is made to the print output.
This approach works well enough for our small program here. For
larger programs, having to think of new variable names all the time
is difficult. It would be even more difficult if different
programmers were working on different sections of the program. It
would be impossible if a program were written, then an extension
created by another person somewhere else, and that same extension
had to be used by many people in many different programs. Obviously,
the risk of using the same variable name is too great. There are
only so many logical names out there.
There is a solution. Imagine you own a house with two gardens.
You have two identical dogs, one in the front garden, one in the
back garden. Bear with me, this is relevant. Both dogs are called
Rover, because their owner lacks imagination.
When you go to the front garden and call 'Rover!!!' or open a can
of dog food, the dog in the front garden comes running. Similarly,
you go to the back garden, call your dog and the other dog bounces
up to you.
You have two dogs, both called Rover, and you can change either
one of them. Wash one, neuter the other -- it doesn't matter, but
both are dogs and both have the same name. Changes to one won't
affect the other. You don't get them confused because they are in
different places, in two different namespaces.
Variable Scope
To bring things back to Perl, a short diversion is necessary to
illustrate the point with actual Perl code instead of canine
metaphors:
$name='Rover';
$pet ='dog';
$age =3;
print "$name the $pet is aged $age\n";
{
my $age =4; # run this again, but comment this line out
my $name='Spot'; # and this one
$pet ='cat';
print "$name the $pet is aged $age\n";
}
print "$name the $pet is aged $age\n";
This is pretty straightforward until we get to the {
. This marks the start of a block.
One feature of a block is that it can have its own namespace.
Variables declared, in other words initialised,
within that block are just normal variables,
unless they are declared with my
.
When variables are declared with my
they
are visible inside the block only. Also, any variable which has the
same name outside the block is ignored. Points to note from the
example above:
- The two
my
variables
appear to overwrite the variables of the same name from outside
the block.
- The two original variables aren't really overwritten because
as we prove after the block has ended, they haven't been
touched.
- The variable
$pet
is accessible inside and
outside the block as usual. Of course, if we declare it with my
then things will change.
my Variables
So there we have it. Namespaces. They work for all the other
types of variable too, like arrays and hashes. This is how you can
write code and not care about what other people use for variable
names -- you just declare everything with my
and have your own private party. Our original program
about gliding can be improved now:
($height,$ratio)=@ARGV;
$cnv1=3.2;
&howfar($height,$ratio);
print "With a glide ratio of $ratio:1 you can fly $distance
from $height\n";
&howfar($height,$ratio+1);
print "With a glide ratio of ",$ratio+1,":1 you can
fly $distance from $height\n";
&howfar($height,$ratio-1);
print "With a glide ratio of ",$ratio-1,":1 you can
fly $distance from $height\n";
sub howfar {
my ($height,$ratio)=@_;
$height =int($height/$cnv1);
$distance=int($height*$ratio);
}
The only change is that the parameters to the subroutine, ie the
contents of the array @_
,
are declared with my
.
This means they are now only visible within that block. The block
happens to also be a subroutine. Outside of the block, the original
variables are still accessible. At this point I'll introduce the
technical term, which is lexical scoping. That means the
variable is confined to the block -- it is only visible within the
block.
We still have to be concerned with what variables we use inside
the subroutine. The variable $distance
is created in
the subroutine and used outside of it. With larger programs this
will cause exactly the same problem as before -- you have to be
careful that the subroutine variables you use are the same ones as
outside the subroutine. For all the same reasons as before, like two
different people working on the code and use of custom extensions to
Perl, that can be difficult.
The obvious solution is to declare $distance
with my
, and thus lexically scope it. If we do this, then how
do we get the result of the subroutine? Like so:
($height,$ratio)=@ARGV;
$cnv1=3.2;
$distance=&howfar($height,$ratio); # run this again and
delete '$distance='
print "With a glide ratio of $ratio:1 you can fly $distance
from $height\n";
$distance=&howfar($height,$ratio+1);
print "With a glide ratio of ",$ratio+1,":1 you can
fly $distance from $height\n";
$distance=&howfar($height,$ratio-1);
print "With a glide ratio of ",$ratio-1,":1 you can
fly $distance from $height\n";
sub howfar {
my ($height,$ratio)=@_;
my $distance;
$height =int($height/$cnv1);
$distance=int($height*$ratio/1000); # output result in kilometres
not metres
}
First change -- $distance
is declared with my
. Secondly, the result of the subroutine is assigned
to a variable, which is also named $distance
. However,
it is a $distance
in a different namespace. Remember
the two gardens. You may wish to delete the $distance=
from the first assignment and re-run the code. The only other change
is one to change the output from meters to kilometres.
We have now achieved a sort of Black Box effect, where the
subroutine is given input and creates output. We pass the subroutine
two numbers, which may or may not be variables. We assign the output
of the subroutine to a variable. We care not what goes on inside
the subroutine, what variables it uses or what magic it performs.
This is how subroutines should operate. The only exception is
the variable $cnv1
. This is declared in the main body
of the program but also used in the subroutine. This has been done
in case we need to use the variable elsewhere. In larger programs it
would be a good idea to pass it to subroutines along with the other
parameters too.
Multiple Returns
That's all the major learning out the way with. The next step is
relatively easy, but we need to add new functionality to the program
in order to demonstrate it. What we will do is work out how long it
will take the glider pilot to fly the distance. For this
calculation, we need to know his airspeed. That can be a third
parameter. The actual calculation will be part of howfar
.
An easy change:
($height,$ratio,$airspeed)=@ARGV;
$cnv1=3.2;
$cnv2=1.8;
($distance,$time)=&howfar($height,$ratio,$airspeed);
print "Glide ratio $ratio:1, $distance from $height taking
$time\n";
($distance,$time)=&howfar($height,$ratio+1,$airspeed);
print "Glide ratio ",$ratio+1,":1, $distance from
$height taking $time\n";
($distance,$time)=&howfar($height,$ratio-1,$airspeed);
print "Glide ratio ",$ratio-1,":1, $distance from
$height taking $time\n";
sub howfar {
my ($height,$ratio,$airspeed)=@_;
my ($distance,$time); # how to 'my' multiple
variables
$airspeed*=$cnv2; # convert knots to kmph
$height =int($height/$cnv1);
$distance=int($height*$ratio/1000);
$time =int($distance/($airspeed/60)); # simple
time conversion
# print "Time:$time, Distance:$distance\n"; # uncomment
this later
}
This doesn't work correctly. First, the changes. The result from howfar
is now assigned to two variables. Subroutines return a list,
and so assigning to some scalar variables between parens separated
by commas will work. This is exactly the same as reading the command
line arguments from @ARGV
.
We are also passing a new parameter, $airspeed
.
There is a another conversion and a one-line calculation to provide
the amount of minutes it will take to fly $distance
.
If you look carefully, you can perhaps work out what the problem
is. There was a clue in the Regex section, when /e
was explained.
The problem is that Perl returns the result of the last
expression evaluated. In this case, the last expression is the
one calculating $time
, so the value $time
is returned, and it is the only value returned. Therefore, the value
of $time
is assigned to $distance
, and $distance
itself doesn't actually get a value at all.
Re-run the program but this time uncomment the line in the
subroutine which prints $distance
and $time
.
You'll noticed the value is 1, which means that the expression was
successful. Perl is faithfully returning the value of the last
expression evaluated.
This is all well and good, but not what we need. What is required
is a method of telling Perl what needs to be returned, rather than
what Perl thinks would be a good idea:
($height,$ratio,$airspeed)=@ARGV;
$cnv1=3.2;
$cnv2=1.8;
($distance,$time)=&howfar($height,$ratio,$airspeed);
print "Glide ratio $ratio:1, $distance from $height taking
$time\n";
($distance,$time)=&howfar($height,$ratio+1,$airspeed);
print "Glide ratio ",$ratio+1,":1, $distance from
$height taking $time\n";
($distance,$time)=&howfar($height,$ratio-1,$airspeed);
print "Glide ratio ",$ratio-1,":1, $distance from
$height taking $time\n";
sub howfar {
my ($height,$ratio,$airspeed)=@_;
my ($distance,$time); # how lexically scope multiple variables
$airspeed*=$cnv2; # convert knots to kmph
$height =int($height/$cnv1);
$distance=int($height*$ratio/1000); # output result in
kilometres not metres
$time =int($distance/($airspeed/60)); # simple time conversion
return ($distance,$time); # explicit return
}
A simple fix. Now, we tell Perl what to return, with the aptly
named return
function.
With this function we have complete control over what is returned
and when. It is quite usual to use if
statements
to control different return values, but we won't bother with that
here.
There is a subtle flaw in the program above. It is not backwards
compatible with the old method of calling the subroutine. Run this:
($height,$ratio,$airspeed)=@ARGV;
$cnv1=3.2;
$cnv2=1.8;
($distance,$time)=&howfar($height,$ratio,$airspeed);
print "Glide ratio $ratio:1, $distance from $height taking
$time\n";
($distance,$time)=&howfar($height,$ratio+1,$airspeed);
print "Glide ratio ",$ratio+1,":1, $distance from
$height taking $time\n";
$distance=&howfar($height,$ratio-1); # old way of calling it
print "With a glide ratio of ",$ratio-1,":1 you can
fly $distance from $height\n";
sub howfar {
my ($height,$ratio,$airspeed)=@_;
my ($distance,$time);
$airspeed*=$cnv2;
$height =int($height/$cnv1);
$distance=int($height*$ratio/1000);
$time =int($distance/($airspeed/60));
return ($distance,$time);
}
A division by 0 results third time around. This is of course
because $airspeed
doesn't exist, so of course it will
effectively be 0. Making your subroutines backwards compatible is
important in large programs, or if you are writing an add-in module
for other people to use. You can't expect everyone to retrofit
additional parameters to their subroutine calls just because you
decided to be a bit creative one day.
There are many ways to fix the problem, and this is just one:
($height,$ratio,$airspeed)=@ARGV;
$cnv1=3.2;
$cnv2=1.8;
($distance,$time)=&howfar($height,$ratio,$airspeed);
print "Glide ratio $ratio:1, $distance from $height taking
$time\n";
($distance,$time)=&howfar($height,$ratio+1,$airspeed);
print "Glide ratio ",$ratio+1,":1, $distance from
$height taking $time\n";
$distance=&howfar($height,$ratio-1);
print "With a glide ratio of ",$ratio-1,":1 you can
fly $distance from $height\n";
print "Direct print: ",join
",",&howfar(5000,55,60)," not bad for no
engine!\n";
sub howfar {
my ($height,$ratio,$airspeed)=@_;
my ($distance,$time); # how to 'my' multiple variables
$airspeed*=$cnv2; # convert knots to kmph
$height =int($height/$cnv1);
$distance=int($height*$ratio/1000); # output result in
kilometres not metres
if ($airspeed > 0) {
$time =int($distance/($airspeed/60));
return ($distance,$time);
} else {
return $distance;
}
}
Here we just test the $airspeed
to ensure we won't
be doing any divisions by 0. It also affects what we return. There
is also a new print
statement,
which shows that you don't need to assign to intermediate variables,
or even pass variables as parameters. Constants, evil things that
they are, work just as well. I already mentioned this, but a
demonstration doesn't hurt. Unless you work for an electric chair
manufacturer.
The astute reader.....:-) Every time I read that I wonder what
I've missed. Usually something obscure which the author knows nobody
will ever notice, but likes to belittle the reader. No exception
here! Anyway, you may be wondering why this would not have sufficed
instead of the if
statement:
sub howfar {
my ($height,$ratio,$airspeed)=@_;
my ($distance,$time); # how to 'my' multiple variables
$airspeed*=$cnv2; # convert knots to kmph
$height =int($height/$cnv1);
$distance=int($height*$ratio/1000); # output result in
kilometres not metres
$time =int($distance/($airspeed/60)) if $airspeed > 0;
return ($distance,$time);
}
After all, the first item returned is $distance
, so
therefore it should be the first one assigned via:
$distance=&howfar($height,$ratio-1);
and $time
should just disappear into the bit bucket.
The answer lies with scalars and lists. We are returning a list,
but assigning it to a scalar. What happens when you do that? The
scalar takes on the last value of the list. The last value of
the list being returned is of course $time
, which is
has been declared but not otherwise touched. Therefore, it is
nothing and appears as such on the printed statement. A small
program to demonstrate that point:
$word=&wordfunc("Greetings");
print "The word is $word\n";
(@words)=&wordfunc("Bonjour");
print "The words are @words\n";
sub wordfunc {
my $word=shift; # when in a subroutine, shifts @_ if no target
specified
my @words; # how to my an array
@words=split //,$word; # splits on the nothings between each letter
($first,$last)=($words[0],$words[$#words]); # see section on
Arrays if required
return ($first,$last); # Returns just the first and last
}
As you can see, the first call prints the letter 's', which is
the last element of the list that is returned. You could of course
use a list consisting of just one element:
($word)=&wordfunc("Greetings");
Now we are assigning a list to a list, so perl starts at the
first element and keeps assigning till it runs out of elements. The
parens turns a lonely scalar into an element of a list. You might
consider always assigning the results of subroutines this way, as
you never know when the subroutine might change. I know I've just
evangelised about how subroutines shouldn't change, but if you take
care and the subroutine write takes care, there definitely won't be
any problems!
That's about it for good old my
.
There is a lot more to learn about it but that's enough to get
started. You now know about a little about variable visibility, and
I don't mean changeable weather.
Local
There is one more function that I'd like to draw to your
attention, and we'll launch straight into the demonstration:
@words=@ARGV;
print "Output Field Separator is :$,:\n";
print '1. Words:', @words, "\n";
&change;
$,='_';
print "\nOutput Field Separator is :$,:\n";
print '2. Words:', @words, "\n";
&change;
sub change {
print ' Words:', @words, "\n";
}
which should be executed something like this:
perl test.pl sarcasm is the lowest form of wit
The special variable $,
defines
what Perl should print in between lists it is given. By
default, it is nothing. So the first two prints should have no
spaces between the words. Then we assign '_' to $,
so the next prints have underscores
between the words.
If we want to use a different value for $,
in the change
subroutine, and not disturb
the main value, we have a little problem. This problem cannot be
solved by my
because
global variables like $,
cannot
at this time be lexically scoped. So, we could manually do it:
@words=@ARGV;
print "Output Field Separator is :$,:\n";
print '1. Words:', @words, "\n";
&change;
$,="_";
print "\nOutput Field Separator is :$,:\n";
print '2. Words:', @words, "\n";
&change;
sub change {
$save=$,;
$,='*';
print ' Words:', @words, "\n";
$,=$save;
}
That works, but it is messy. Perl has a special function for
occasions of this nature, called local
.
An example of local
in
action:
@words=@ARGV;
print "Output Field Separator is :$,:\n";
print '1. Words:', @words, "\n";
&change;
$,="_";
print "\nOutput Field Separator is :$,:\n";
print '2. Words:', @words, "\n";
&change;
sub change {
local $,="!-!";
print ' Words:', @words, "\n";
}
You can try it with my
instead
but it won't work. I'm sure you'll try it anyway, I know
you learn things the hard way otherwise you a) wouldn't be
programming computers and b) wouldn't be using
this tutorial to do it.
The local
function
works in a similar way to my
,
but assigns temporary values to global variables. The my
function creates new variables that have the same
name. The distinction is important, but the reasons require perl
proficiency beyond the scope of this humble tutorial. In practice,
the difference is:
- lexically scoped variables (those declared with
my
)are faster than non-lexically scoped variables.
local
variables are
visible to called subroutines.
my
doesn't work on
global variables like $,
so
you must use local
.
Returning arrays
So that's the end of subroutines and parameters. Would you
believe I have only scratched the surface? There are closures,
prototypes, autoloading and references to learn. Not, however, in
this tutorial. At least not yet. I'll finish with one last
demonstration. You may have noticed that Perl returns one long list
from subroutines. This is fine, but suppose you want two separate
lists, for example two arrays? This is one way to do it:
($w1,$w2)=&wordfunc("Hello World"); # Assign the
array references to scalars
print "@$w1 and @$w2\n"; # deference, ie access, the
arrays referred to
#print "$w1 and $w2\n"; # uncomment this next time round
sub wordfunc {
my $phrase=shift;
my (@words,@word1,@word2); # declare three variables lexically
@words=split /\s+/,$phrase; # split the phrase on whitespace
@word1=split //,$words[0]; # create array of letters from the first
word
@word2=split //,$words[1]; # and the second
return (\@word1,\@word2); # return references to the two arrays --
scalars
}
There is a lot going on there. It should be clear up until the return
statement. As we know, Perl only returns a single
list. So, we make Perl return a list of the arrays it has just
created. Not the actual arrays themselves, but references to the
arrays. A bit like a shopping list is a just a bit of paper, not the
actual goods itself. The reference is created by use of the \
backslash.
Having returned two array references they are assigned to scalar
variables. If you uncomment the second print line you'll see two
references to arrays.
The next problem is how to dereference the references, or access
the arrays. The construct @$xxx
does
that for us. I know I said I wouldn't cover references, and I
haven't -- that is just a useful trick.
This little section is not designed as a complete guide, it is
just a taster of things to come. Perl is immensely powerful. If you
think something can't be done, the problem is likely to be it is
beyond your ability, not that of Perl.
Modules
An introduction
Subroutines are oft-used pieces of code. They exist so you can
re-use the code and not have to constantly rewrite it.
A module is, in principle, similar to a subroutine. It is also an
oft-used piece of code. The difference is that modules don't live in
your program, they are their own separate script outside your code.
For example, you might write a routine to send email. You could then
use this code in ten, a hundred, a thousand different programs just
by referencing the original program.
As you would expect, the basic Perl package includes a large
number of modules. These have been written by people who had a need
for the code, made it a module and released it into the big wide
world. Many of these modules have been debugged, improved and
documented by yet more people. To quote the OpenSource mantra, all
bugs are shallow under the scrutiny of every programmer.
Aside from the many modules included with Perl there are hundreds
more available on CPAN, the Comprehensive Perl Archive Network.
Refer to your documentation for details.
File::Find -- using a module
An example of a module included with Perl is File::Find
.
There are several modules under the File::Find
section,
such as File::Basetree
, File::Compare
and File::Stat
.
This is an example of how File::Find
can be used:
use File::Find;
$dir1='/some/dir/with/lots/of/files';
$dir2='/another/directory/';
find(\&wanted, $dir1,$dir2);
sub wanted {
print "Found it $File::Find::dir/$_\n" if /^[a-d]/i;
}
The first line is the most important. The use
function loads the File::Find
module.
Now, all the power and functionality of File::Find
is
available for use. Such as the find
function. This
accepts two basic parameters:
- The name of a subroutine, usually
wanted
which
defines what you want to do with the list of files being
returned. The filename will be in $_
.
- A list of directories to be searched. Subdirectories will also
be searched.
The subroutine wanted
simply prints the directory
the file was found in if the filename begins with a,b,c or d. Make
your own regex to suit. The line $File::Find::dir
means
the $dir
variable in the module $File::Find
.
This is explained further in the next section.
Note -- the \&wanted
parameter is a reference to
a subroutine. Essentially, this means that the code in File::Find
knows where to find the &wanted
subroutine. It is
basically like shortcuts under Windows 9x and NT4, instead of actual
files (but the UNIX Perl people would slaughter me for that, so be
quiet).
ChangeNotify
Another example is Win32::ChangeNotify
. As you might
expect there are a number of Win32-specific modules, and
ChangeNotify is one of them. It waits until a something changes in a
directory, then acts. What it waits for and what it does are up to
you, for example:
use Win32::ChangeNotify;
$Path='/downloads';
$WatchSubTree=0;
$Events='FILE_NAME';
$browser='E:/progs/netscape/Communicator/program/netscape.exe';
$changes=0;
$notify = Win32::ChangeNotify->new($Path,$WatchSubTree,$Events);
while (1) {
print "- ",scalar(localtime)," $changes so far to
$Path.\n";
$notify->wait;
++$changes;
print "- ",scalar(localtime), " Launching
$browser...\n";
system("$browser $Path");
$notify->reset;
}
Again, the module is incorporated into the program with use
. An object referred to by the variable $notify
is created. The parameters passed are the path to be watched,
whether we want to watch subtrees, and what sort of events we want
to be notified about, in this case only filename changes.
Then, we enter a loop which continues while 1 is true -- which
will be forever.
The program pauses when the wait
method of the $notify
notify object is called. Only when there is a change to the
directory, then the rest of the subroutine completes, launching the
browser. We have to reset the $notify
object.
There is some pretty frightening stuff about objects in the
explanation. But you don't actually need to understand anything
about objects. Just read the documentation, and experiment.
You can use as many modules as you like in one program. As they
are all written with carefully scoped variables you need not worry
about programmers using the same variable names in different
modules. Now you *really* appreciate scoping!
Your Very Own Module
You too can write your own modules. It is easy. First, we will
create the fantastic bit of code that we want to re-use everywhere.
First, we'll write a normal Perl program:
$name=shift;
print &logname($name);
sub logname {
my $name=shift;
my @namebits;
my ($logon,$inital);
@namebits=split /\s+/,$name;
($inital)=$name=~/(\w)/;
$logon=$inital.$namebits[$#namebits];
$logon=lc $logon;
return $logon;
}
Execute like so; perl script.pl
"Nick Bladon"
The script itself is nothing amazing. The lc
function stands for LowerCase, or probably lOWERcASE
-- you can see what it does.
In order to turn it into a module carry out the following steps:
- Find out where your copy of Perl is installed, for example
c:\progs\perl
.
- Within that directory there should be a
lib
directory.
- Make a directory within lib, for example
c:\progs\perl\lib\RMP\
Now we'll make the module. Remember, a module is just code you
are going to reuse. So we don't need all of the above example. Just
this bit:
sub logname {
my $name=shift;
my @namebits;
my ($logon,$inital);
@namebits=split
/\s+/,$name;
($inital)=$name=~/(\w)/;
$logon=$inital.$namebits[$#namebits];
$logon=lc $logon;
return $logon;
}
1;
The bit that has been added is the 1
at the bottom.
Why? Perl requires that all modules return true. We know that a
subroutine always returns the value of the last expression
evaluated. As 1 evaluates to true, that'll do.
You need to save this as logon.pm
in your newly
created directory under lib
. The pm
stands
for Perl Module.
That's it. A module created. To use, just make a normal Perl
script such as:
use RMP::logon;
$name=shift;
print logname($name);
and hey presto! Module power is yours!
You don't have to create your own subdirectory within lib
but I would advise it for the sake of neatness. And as you might
expect, there is a lot more to learn about modules but this is
supposed to be a basic tutorial, so that's enough for the time
being.
Bondage and Discipline
Perl is a very flexible language. It is designed as a hacking
tool, for quick sysadmin magic. It can do quite a bit more besides,
but being small and powerful is a core Perl feature. Earlier on I
said Perl is not a bondage and discipline language -- to qualify
that, it doesn't have to be. However, there is a time and
place for everything.
For tiny scripts you don't want to be declaring variables,
typecasting and generally spending more time obeying rules than you
do getting the job done. So, Perl doesn't force you to do all of
these good programming practices. However, not all your programs are
going to be five-minute hacks. Some will be pretty large. Therefore,
some Discipline is in order.
Perl has two primary methods of enforcing discipline. They are:
-w
for Warnings
use strict;
-w
Consider for a moment this little program:
@input=@ARGV;
$outfile='outfile.txt';
open OUT, ">$outfile" or die "Can't open $outfile
for write:$!\n";
$input2++;
$delay=2 if $input[0] eq 'sleep';
sleep $delay;
print "The first element of \@input is $input[0]\n";
print OUY "Slept $delay!\n";
It doesn't do much. Just prints out the first argument supplied,
and demonstrates the uninspiring sleep
function.
The program itself is full of holes, and it is only a few lines. How
many errors can you spot? Try and count them. When you are finished,
execute the program with error-checks enabled:
perl -w script.pl hello
Perl finds quite a few errors. The -w
switch finds, among other heinous sins:
- Variables used only once. In the example,
$input2
is used only once. It is a useless variable.
- Filehandles used incorrectly. With
print OUY
I'm
trying to print to a non-existent filehandle. With -w
an alarm is raised, as it would be if I tried to
write to a filehandle which was read-only.
- Use of uninitialised variables. The variable
$delay
is uninitialised if 'sleep' is not the first parameter. Making
variables spring into the air on demand is not good programming
practice. They should be defined carefully first.
So, generally, -w
is a
Good Thing. It forces you to write cleaner code. So use it, but
don't be afraid not to for very short programs.
Shebang
You know that you can turn warnings on with -w
on the command line. You can also turn them on within
the script itself. For that matter, you can give perl any command
line option within the script itself. For example:
perl script.pl hello
to execute this:
#!perl -w
@input=@ARGV;
$outfile='outfile.txt';
open OUT, ">$outfile" or die "Can't open $outfile
for write:$!\n";
$input2++;
$delay=2 if $input[0] eq 'sleep';
sleep $delay;
print "The first element of \@input is $input[0]\n";
print OUY "Slept $delay!\n";
has the same effect as:
perl -w script.pl hello
It may be more convenient for you to put the flag inside the
script. It doesn't have to be just -w
,
it can be any argument Perl supports. Run
perl -h
for a full list.
The first line, #!perl -w
is the shebang line. This
is derived from UNIX, where Perl was first developed. UNIX systems
make a script executable by changing an attribute. The operating
system then loads the file and works out how to execute it -- in
this case by looking at the first line, then loading the perl
interpreter. Windows systems know that all files with a certain
extension must be passed to a certain program for execution, eg all .bat
files are passed to command.com
, and all .xls
files are passed to Excel. The point of all this being that you
don't need a shebang line, but it doesn't hurt.
use strict;
So what's strict and how do you use it? The module strict
restricts
'unsafe constructs', according to the perldocs. The strict
module
is a pragma, which is a hint that must be obeyed. Like when
your girlfriend says 'oh, that ring is *far* too expensive'.
There is no need to be frightened about unsafe code if you don't
mind endless hours of debugging unstructured programs. When you
enable the strict
module, the three things that Perl
becomes strict about are:
- Variables 'vars'
- References 'refs'
- Subroutines 'subs'
This tutorial doesn't presently cover references (and let's hope
I remember to remove this sentence if I do cover it in later
versions) so we won't worry about refs.
Strict variables are useful. Essentially, this means that all
variables must be declared, that is defined before use rather than
springing into existence as required. Furthermore, each variable
must be defined with my
or
fully qualified. This is an example of a program that is not strict,
and should be executed something like this:
perl script.pl "Alain James Smith";
where the "" enclose the string as a single
parameter as opposed to three separate
space-delimited parameters.
#use strict; # uncomment after running a couple of times
$name=shift; # shifts @ARGV if no arguments supplied
print "The name is $name\n";
$inis=&initials($name);
$luck=int(rand(10)) if $inis=~/^(?:[a-d]|[n-p]|[x-z])/i;
print "The initials are $inis, lucky number: $luck\n";
sub initials {
my $name=shift;
$initials.=$1 while
$name=~/(\w)\w+\s?/g;
return $initials;
}
By now you should be able to work out what the above does. When
you uncomment the use strict;
pragma,
and re-run the program, you will get output something like this:
Global symbol "$name" requires explicit package name at
n1.pl line 3.
Global symbol "$inis" requires explicit package name at
n1.pl line 6.
Global symbol "$luck" requires explicit package name at
n1.pl line 8.
Global symbol "$initials" requires explicit package name
at n1.pl line 14.
Execution of n1.pl aborted due to compilation errors.
These warnings mean Perl is not exactly clear about what the
scope of variables is. If Perl is not clear, you might not be
either. So you need to be explicit about your variables, which means
either declaring them with my
so
they are restricted to the current block, or referring to them with
their fully qualified name. An example, using both methods:
use strict;
$MAIN::name=shift; # shifts @ARGV if no arguments supplied
print "The name is ",$MAIN::name,"\n";
my $inis='';
my $luck='';
$inis=&initials($MAIN::name);
$luck=int(rand(10)) if $inis=~/^(?:[a-d]|[n-p]|[x-z])/i;
print "The initials are $inis, lucky number: $luck\n";
sub initials {
my $name=shift;
my $initials;
$initials.=$1 while
$name=~/(\w)\w+\s?/g;
return $initials;
}
The my
variables in the
subroutine are nothing new. The my
variables
outside the subroutine are. If you think about it, the main program
itself is also a kind of block, and therefore variables can be
lexically scoped to be visible only within the block.
The other interesting bit is the $MAIN::name
business. This, as you might expect, is the fully qualified name of
the variable. The first part is the package name, in this case MAIN
.
The second part is the actual variable name. Personally, I've never
needed to refer to a variable this way. I'm not saying you'll never
use the syntax, but I would suggest that knowing this is not on a
perl students Top 10 list of Things to Master.
The important thing about use strict
is
that it does enforce more discipline than you have been used to, and
for all but the smallest of programs, that is most definitely a Good
Thing.
Debugging
Sooner or later you'll need to do some fairly hairy debugging. It
will be later if you are using strict
,
-w
and writing your
subroutines properly, but the moment will come.
When it does you'll be poring over code, probably late at night,
wondering where the hell the problem is. Some techniques I find
useful are:
- Print your variables and other information out at frequent
intervals.
- Split difficult components of the program out into small,
throwaway scripts. Get these working, then copy the results back
into the main program.
- # Comment frequently.
Eventually, you'll be stuck. Such is the price of progress. In
this case, Perl's own debugger can be invaluable. Run this code as
normal first:
$name=shift;
print "Logon name creation program\n:Converting
'$name'\n";
print &logname($name),"\n\n";
print "Program ended at", scalar(localtime),"\n";
sub logname {
my $name=shift;
my @namebits;
my ($logon,$inital);
@namebits=split
/\s+/,$name;
($inital)=$name=~/(\w)/;
$logon=$inital.$namebits[$#namebits];
$logon=lc $logon;
return $logon;
}
We'll run it with the debugger so you can watch perl at work
while it runs:
perl -d script.pl "Peter Dakin";
and you are into the debugger, which should look something
like this:
c:\scripts\db.pl>perl -d db.pl "Peter Dakin"
Loading DB routines from perl5db.pl version 1.0401
Emacs support available.
Enter h or `h h' for help.
main::(db.pl:1):
$name=shift;
DB<1>
db.pl |
Name of script being executed |
1 |
Line number of script that is just about to be executed. |
$name=shift; |
The code that is just about to be executed. |
Type s
for a single step and press enter. The code $name=shift;
will be executed, and perl waits for your next command. Keep
inputting s
until the program terminates.
This by itself is useful as you see the subroutine flow, but if
you enter h
for help you'll see a bewildering range of
debug options. I won't detail them all here, but some of the ones I
find most useful are:
n |
Executes main program, but skips subroutine calls. The
subroutine is executed, but you aren't stepped through it.
Try using n instead of s . |
/xx/ |
Searches through program for xx |
p |
Prints, for example p @namebits , p
$name |
Enter |
Pressing the Enter key (inputting a carriage return)
repeats the last n or s command. |
perlcode |
You can type any perl code in and it will be evaluated,
and have a effect on your program. In the example below I
remove spaces from $name . Inputs in bold:
main::(db.pl:1):
$name=shift;
DB<1> s
main::(db.pl:3):
print "Logon name creation program\n:Converting
'$name'\n";
DB<1> $name=~s/\s//g;
DB<2> print $name
MarkGray
DB<3>
|
There are many, many more debugger options which are worth
becoming familiar with. Type h
for a full list.
Logical Operators
Logical operators are such things as OR, NOT, AND. They all
evaluate expressions. The expression evaluates to true, or false.
Exactly what criteria for evaluation are used depends on the
operator.
or
The or
operator works
as follows:
open STUFF, $stuff or die "Cannot open $stuff for read
:$!";
This line means -- if the operation for opening STUFF
fails, then do something else. Another example:
$_=shift;
/^R/ or print "Doesn't start with R\n";
If the regular expression is false, then whatever is on the left
side of the or
is printed.
As you know, shift
works
on @ARGV
if no target is
given, or @_
inside a
subroutine.
Perl has two OR
operators. One is the now familiar or
and the other is ||
.
Precedence: What comes First
To understand the difference between the two we need to talk
about precedence. Precedence means priority, order, importance. A
good example is:
perl -e"print 2+8
which we know equals 10. But if we add:
perl -e"print 2+8/2
Now, will this be 2+8 == 10, divided by 2 == 5? Or maybe 8/2 ==
4, plus 2 == 6?
Precedence is about what is done first. In the example above, you
can see that the division is done first, then the addition.
Therefore, division has a higher precedence that addition.
You can force the issue with parens:
perl -e"print ((2+8)/2)
which forces Perl, kicking and screaming, to evaluate 2+8 then
divide the result by 2.
So what has this to do with logical operators? Well, the main
difference between or
and ||
is precedence.
In the example below, we attempt to assign two variables to
non-existent elements of an array. This will fail:
@list=qw(a b c);
$name1 = $list[4] or "1-Unknown";
$name2 = $list[4] || "2-Unknown";
print "Name1 is $name1, Name2 is $name2\n";
print "Name1 exists\n" if defined $name1;
print "Name2 exists\n" if defined $name2;
The output is interesting. The variable $name2
has
been created, albeit with a false value. However, $name1
does not exist. The reason is all about precedence. The or
operator has a lower precedence than ||
.
This means or
looks at
the entire expression on its left hand side. In this case, that is $name1
= $list[4]
. If it is true, it gets done. If it is false, it
is not and the right hand side is evaluated, and the left hand side
is ignored as if it never existed. In the example above, once the
left side is found to be false, then all the right side evaluates is
"1-Unknown"
which may be true but doesn't
produce any output.
In the case of ||
,
which has a higher precedence, the code immediately on the left of
the operator is evaluated. In this case, that is $list[4]
.
This is false, so the code immediately to the right is evaluated.
But, the original code on the left which was not evaluated, $name2
=
is not forgotten. Therefore, the expression evaluated to $name2
= "2-Unknown"
.
The example below should help clarify things:
@list=qw(a b c);
$ele1 = $list[4] or print "1 Failed\n";
$ele2 = $list[4] || print "2 Failed\n";
print <<PRT;
ele1 :$ele1:
ele2 :$ele2:
PRT
The two failure codes are both printed, but for different
reasons. The first is printed because we are assigning $ele1
a false value, so the result of the operation is false. Therefore,
the right hand side is evaluated.
The second is printed because $list[4]
itself false.
Yet, as you can see, $ele2
exists. Any idea why?
The reason is that the result of print
"2-Failed\n"
has been assigned to $ele2
.
This is successful, and therefore returns 1.
Another example:
$file='not-there.txt';
open FILE, $file || print "1: Can't open
file:$!\n";
open FILE, $file or print "2: Can't open
file:$!\n";
In the first example, the error message is not printed. This is
because $file
is evaluating to true. However, in the
second example, or
looks
at the entire expression, not just what is immediately to the left
and takes action on the result of evaluating the entire left hand
side, not just the expression immediately to its left.
You can fix things with parens:
$file='not-there.txt';
open FILE, $file || print "1: Can't open
file:$!\n";
open FILE, $file or print "2: Can't open
file:$!\n";
open (FILE, $file) || print "3: Can't open file:$!\n";
like so, but why bother when you have a perfectly good operator
in or
? You could apply
parens elsewhere:
@list=qw(a b c);
$name1 = $list[4] or "1-Unknown";
($name2 = $list[4]) || "2-Unknown";
print "Name1 is $name1, Name2 is $name2\n";
print "Name1 exists\n" if defined $name1;
print "Name2 exists\n" if defined $name2;
Now, ($name2 = $list[4])
is evaluated as a complete
expression, not just as $list[4]
is evaluated as a
complete expression, not just as $list[4]
, so we get
exactly the same result as if we used or
.
And
now for something similar. And. Logical AND operators evaluate
two expressions, and return true only if both are true.
Contrast this with OR, which returns true only of one or more
of the two expressions are true. Perl has a few AND operators.
The first type of AND we will look at is &&
:
@list=qw(a b c);
print "List is:@list\n";
if ($list[0] eq 'x' && $list[2]++ eq 'd') {
print "True\n";
} else {
print "False\n";
}
print "List is:@list\n";
The output here is false. It is clear that $list[0]
does not equal x
. As AND statements can only return
true if both expressions being evaluated are true, then as the first
statement is false this is an obvious non-starter and perl decides
it need not continue to the second statement. Entirely sensible.
The second type of AND statement is &
. This is similar to &&
. See if you can work out what the difference is using
this example:
@list=qw(a b c);
print "List is:@list\n";
if ($list[0] eq 'x' & $list[2]++ eq 'd') {
print "True\n";
} else {
print "False\n";
}
print "List is:@list\n";
The difference is that the second part of the expression is
evaluated no matter what the result of the first part is. Despite
the fact that the AND statement cannot possibly return true, perl
goes ahead and evaluates the second part of the statement anyway,
hence $list[2]
ends up as d
.
The third AND which we will look at is and
. This behaves in the same way as &&
but is lower precedence. Therefore, all the guidelines
about ||
and or
apply.
Other Logical Operators
Perl has not
, which
works like !
except for
low precedence. If you are wondering where you have seen !
before, what about:
$x !~/match/;
if ($t != 5) {
as two examples. There is also Exclusive OR, or XOR. This means:
- If one expression is true, XOR returns true
- If both expressions are false, XOR returns false
- If both expressions are true, XOR returns false (the crucial
difference from OR)
This needs an example. Jane and Sonia are two known
troublemakers, with a reputation for throwing good beer around,
going topless at inappropriate moments and singing out of tune to
the karaoke machine. You only want to let one of them into your
party, and instead of a big muscle-bound bouncer you have this perl
script on the door:
($name1,$name2)=@ARGV;
if ($name1 eq 'Jane' xor $name2 eq 'Sonia') {
print "OK, allowed\n";
} else {
print "Sorry, not allowed\n";
}
I would suggest running it thus:
perl script.pl Jane Karen
(one true, one false)
perl script.pl Jim Sonia
(one true, one false)
perl script.pl Jane Sonia
(both true)
perl script.pl Jim Sam
(both false)
Well, the script is not perfect as a doorman, as all Jane and
Sonia have to do is type their names in lowercase, but hopefully it
demonstrated xor
.
One thing to beware of is:
$_=shift;
print "OK\n" unless not(!/r/i || /o/i & /p/ or /q/);
over-complication, and believe me the above is not as complicated
as it could be. Take the time to understand what you want to do.
Perl provides a plethora of logical operands so you really don't
have any excuse for not writing legible code. The above can be
written a lot more concisely and clearly. As well as a lot more
obscurely :-)
@ARGV
Last words
I hope you have enjoyed this tutorial and learnt something from
it. I would appreciate an email letting me know how it could be
improved. What you have learnt is just a fraction of Perl's
functionality, but you'll find skills like regexes can be applied in
many other places than Perl.
Good luck.
--
Robert
Thanks to...
Everyone that helped in the development of this tutorial. I do
read all the feedback emails, but don't always action them the same
year. What you have just read is better because of the people below.
They fix the bugs, scream when they don't understand and I rewrite
whole sections. Documents like this are written by the authors, but
polished by the readers.
The roll of honour is, in a semi-chronological order:
- Mark Miller for his long email suggesting improvements
and highlighting typos. I cringed when I realised what I'd let
through :-(
- Roland to whom I am eternally grateful for sending in
many typo reports, and pointing out where he didn't understand
an explanation.
- Katya de Vries for finding HTML errors and problems
with the example code.
- Steven Ham for being picky about spelling errors. Good
going, considering English is his second language !
- Carlos Jaramillo Uribe for pointing out where I could
have explained postincrements and regex a little better and for
pointing out a typo or two.
- Sergio Polini who brought an interesting aspect Perl's
behaviour with arrays to my attention, and helping to improve
parts of the Regex section.
- Leo Durocher for telling me he had trouble with the
regex section. If he did, I'm sure many others did too.
- Paul Trafford for solving the Them/Us problem I was too
lazy to bother with, and doing it so elegantly.
- Eric Smith who was one of many people who made
me a table of contents rather than just tell me I should include
one. I never used any of them, and the one you see now is
auto-generated by a program written in Java (only kidding, its
not auto-generated :-)
- Mike Conkin who said he didn't understand $^I. Good
point. I'd forgotten to explain it at all. Mike went to list
several other areas I could do with improving in one of the most
amusing and useful missives I've had on the tutorial. Thanks.
- Vasile Calamuti who picked up on my use of
join
before I'd explained it, and a couple more oversights.
- Didier Owono for pointing out my original explanation
of
/ee
didn't make sense. Hopefully the second
version does.
- Keen Meng Lew and Ever Olano who, independently
(I assume) picked up exactly the same two typos. Which are now
fixed.
- Anna in Ohio who sent a polite email with a few errors
she picked up on.
- Ken Teuchler for knowing the difference between
=
and =~
, and for his long list of
improvements which varied from grammar errors to style
suggestions to oversights. A huge help.
- cookie, firstly for his Win9x experiments and error
checks about my explanation of scoping. Secondly for his many
subsequent emails pointing out minor problems which elevated him
to status of #1 bugfixer. Appreciated.
- Ginny for spotting an errant ; which in the best
tradition of teachers I have changed into an exercise for
debugging, of course I meant to leave it out in the first place.
I should also point out that a major motivation for me do put
the effort into this tutorial is the appreciation of the
userbase, and Ginny sent me a particularly motivational missive.
- Jeffery Jackson for noticing my error about 0-based
arrays.
- Kevin Haskins for pointing out Notepad's limitations
and an equality issue.
- Pat McCarthy for picking up a small typo.
- Bob Kauten who noticed that I hadn't explained the
range operator properly. I blame....well, me really.
- Ayhan Tuncer for picking up a mistake where I'd
carelessly cut and pasted pasted pasted. The next day Michael
Kersey found the exact same error, before I'd had a chance
to fix it. Ayhan also found quite a few more errors after that
one during her work on the Turkish translation.
- Ray Price who was another one who found the above
error, and a couple more typos as well.
- Henry Vermeulen, a Dutch chap who noticed I'd mispelled
Heineken. Nothing to do with Perl, just one of my outlandish
examples.
- Everyone that has ever worked on perl, all the hackers
on the perl-win32* mailing lists, ActiveState and
the netizens of clpm.
The original location of this document is:
http://www.netcat.co.uk/rob/perl/win32perltut.html