Aug 24 1999
Software development has changed greatly in the past twenty years. However, the programmer’s tools have not changed nearly as much over the same period. Programmers keep running into the same types of bugs that others have encountered, such as buffer overruns. Most software debugging is a slow manual process that does not scale well. I have developed a Perl script to locate various typographical errors in C or C++ source code. This paper describes the development of the script, the types of bugs that the script will report, auxiliary scripts and applications, and my experience developing my first Perl script.
Typographical errors (typos) can be very frustrating and
time-consuming to locate. One famous example of a typo is a
mistyped period instead of a comma in a FORTRAN program,
DO 10 I = 1.10
DO 10 I = 1,10
The FORTRAN compiler interprets the line as an assignment statement instead of a loop. See <http://www.best.com/~wilson/faq/inicio.html#queIII1> for a full explanation. Another example of a typographical error is from comp.risks, Volume 20, Issue 18 <http://catless.ncl.ac.uk/Risks/20.18.html#subj9>, where there were too few equals sign in an if statement, which turned a comparison into an assignment. It took a programmer two days to find three of these typographical errors.
Locating bugs in an application can be a tedious process. A programmer can do a time-consuming code review of the source code or if the bug is easily reproducible, the programmer can use a source code debugger to locate the bug. Otherwise, there are several computerized methods for locating bugs. Runtime analyzers like Rational Software’s Purify or NuMega’s BoundsChecker help locate bugs, but they often increase the execution time of the application and the amount of memory used. The more popular runtime analyzers detect many problems with memory accesses since they can easily detect when an application accesses invalid memory. Some runtime analyzers, such as Boundschecker, also check API usage, i.e. illegal parameters passed to functions. Debug versions of memory allocation functions are also popular and effective. Source code analyzers are another category of debugging tools. They usually find a different class of bugs than runtime analyzers. Compiling at a high warning level or running lint on your source are examples of source code analyzers. Intrinsa’s PREfix is a source code analyzer that can detect many of the bugs that runtime analyzers detect but its price is generally prohibitive except for individuals.
Programmers’ reviewing source code for typographical errors is a time-consuming, boring, and inefficient process. It is similar to looking for a needle in a haystack. It is very easy to overlook an errant semicolon or equal sign. This is a perfect job for a computer. Source code analyzers are better tools for locating typographical errors than runtime analyzers as runtime analyzers usually detect only the symptoms of a typographical error, if they detect anything at all.
One such experience with a typographical error made me search for a better solution. I had to spend much time tracking down a bug to a line in the source code that had one too many equal signs. The extra equal sign turned an assignment statement into a comparison. My first solution involved using the Win32 application findstr.exe, similar to the UNIX grep command, to scan source code for the particular typo. The solution worked but was inflexible when handling more than one case. Debugging consisted of trial and error. My second solution involved using a batch file to control scanning the source code. This slightly increased the flexibility but was not a great improvement. I had heard that Perl was good at processing text so I cracked open the Pink Camel book and read a few chapters. I used some of the sample code as a framework for my third solution. Once I was satisfied with my typo.pl Perl script, I released the script to my group at Microsoft for their use. A feedback-loop developed. I would release a newer and improved script. People would find problems or make suggestions for new features or different classes of bugs. Then I would make more changes and the loop would start over again. After several iterations, I released the script companywide, which resulted in an even greater feedback-loop.
The main loop of the script scans files one line at a time. The script removes all C or C++ comments, string constants, and whitespace from the line before checking for possible errors. It searches the line for any keywords or user-specified functions. For keywords that have an associated expression, i.e. if statements, for-loops, and while-loops, the script takes the expression and scans it for any possible errors. For user-specified functions, the script takes the function parameters and any return code and scans them for possible errors. This may require specific Perl code for a particular function. The script keeps track of various statistics such as number of lines scanned, number of comments, number of functions, and the number of statements as measured by semicolons.
The script emits a warning when it thinks that the code may have an error. In most cases, it does not know for sure and since the programmer is the only one who can decide, it will generate the warning as the safe thing to do. It is up to the programmer to decide if the bug is real or not.
At the end of the scan, the script displays statistics about the scanned files including number of lines, bytes, comments, semicolons, and functions as well as the amount of time that the script took to scan the files.
The script can display statistics about each file that it scans.
Users can direct the script to extract all the strings in the scanned files. Users can spellcheck the list of extracted strings to locate all the misspelled words in the source code.
The possible errors that the script generates can be organized into several categories:
X == Y;instead of
X = Y;
memset(buf, 0, nCount);sets 0 bytes to the value nCount
if (x & 3 == 2)is interpreted as
if (x & (3 == 2))
See the List of possible errors at the end of this paper for a complete list of possible errors that the script generates.
Perl was easy to learn. It has the fast turnaround time of interpreted BASIC. Its string processing capabilities are very rich. Regular expressions are a feature lacking from most other procedural languages. Arrays and hashes are easy to use and require much less management than in C or C++. Since there are so many ways to accomplish a given task in Perl, you need to profile each solution if you are worried about performance. Solutions that require few characters may or may not consume a lot of time. The Camel book offers several performance tips. I found that I needed to investigate if each of these were valid for my script. Sometimes they were valid and sometimes they were not, e.g. Avoid $&, $’, and $` did not make a measurable difference in the script’s runtime. There were the typical problems, i.e. confusing the string and numeric equality operators. Some of the error messages could be cryptic: when I used #$ARGV instead of the correct $#ARGV, which resulted in a confusing syntax error. It takes some time to get used to the suggested Perl programming style. I tend to write Perl code as if I was writing C code still. However, the problems are minimal compared to the time and effort that the script has saved.
I have developed several auxiliary Perl scripts and applications, which make handling the script’s output easier.
D:\src\zip22>perl c:\typo\typosum.pl -c <typo.txt 0: 11 1: 1 2: 0 3: 2 4: 0 5: 0 6: 0 7: 0 8: 0 9: 0 10: 0 11: 1 12: 0 13: 0 14: 0 15: 0 16: 0 17: 8 18: 0 19: 1 20: 2 21: 1 22: 0 23: 0 24: 0 25: 0 26: 3 27: 35 28: 0 29: 0 30: 24 31: 0 32: 6 33: 0 34: 0 35: 0 36: 0 37: 0 38: 0 39: 0 40: 0 41: 0 42: 0 43: 0 44: 1 45: 0 46: 14 47: 0 48: 0 49: 0 50: 0 51: 0 52: 0 Total= 110
User specifies behaviour of private functions in a text file
User runs the script from the topmost directory of the source code, directing the output to a file.
User browses the file with TV.Exe or a text editor to check for any valid bugs.
Scan the source code of Info-Zip’s zip2.2 archiver.
We will use a predefined option file that specifies the behaviour of most Win32 functions.
D:\src\zip22>perl c:\typo\typo.pl -optionfile:c:\typo\win32.txt c // Perl version: 5.001 // TYPO.PL Version 2.45 Jun 15 1999 by Johnny Lee (johnnyl) // OPTIONS: '-optionfile:c:\typo\win32.txt c' // START: Tue Jun 15 17:29:14 1999 D:\src\zip22\fileio.c (280): no immediate strchr check 27: =strchr(q,'@') [q] . . D:\src\zip22\windll\windll.c (106): using malloc result w/no check 30: *zcomment = 0; [zcomment] // FUNCS: 545 // SEMIS: 10,760 // COMMS: 5,615 // LINES: 34,020 // CHARS: 991,749 // START: Tue Jun 15 17:29:14 1999 // STOP: Tue Jun 15 17:29:32 1999
If the user redirects the script output to a file, then the user can browse the output using the TV.EXE application. See Figure 1.
Figure 1. TV.EXE displaying the script output from scanning Info-Zip’s zip2.2 source code.
You can use typosum.pl to generate a listing that displays the number of errors of each type found in the source code. If the user wants to compare the output of separate invocations of the script on the same group of files, then the line numbers have to be removed because modifications to the files may shift code around. Denum.pl removes the line numbers from the script output so you can use a diff-like tool to determine if there are any changes.
The Perl script is not a silver bullet. It does not parse C or C++ correctly. The script does not handle #include files or macros. Macros or complex code can fool the script and generate false positives. The script does not handle if-else statement control flow correctly. This failure generates more false positive warnings. The script has evolved over several years. Conditions that were once valid may not be valid any longer. My main job is not developing and maintaining the Perl script. I work on the script in my spare time – when I was recuperating from a running injury, I had plenty of spare time. I have not had the time to document all these assumptions or revisit them. The script cannot determine if the programmer designed the code to execute in a certain manner, i.e. falling through from one case statement to the following case statement. However, the script can scan source code written for different operating systems. When I ran the script on my PC, I was able to find real bugs in Macintosh and VMS source code. The script is easy to use, runs quickly, and does not require the modification of any makefiles to work. The time required to investigate all the reported warnings is much less than the time required to review the source code by one or more programmers. The script does not get tired, suffer from eyestrain, repetitive-stress injuries, or whine about scanning more source code. Programmers can run the script on their code before they checkin to ensure that there were no bugs introduced.
The homepage for the typo.pl perl script is http://www.oocities.com/typopl/. I will also submit the typo.pl script to CPAN after the 1999 Perl Conference.
if (x == y);
=in assignment statements. VC98 emits a warning for this. Handles single
X == Y; X - NULL;
if (x = 3)
ASSERT(Z = 4);
x = y && 1;
x = y || 1;
if (x & 1 == 0) ==> if (x & (1 == 0))
Release/AddRefinstead of invoking them. MSVC 5+ can detect this case.
<<, >>) followed by
+,-,*,/may have undesired result
x = y << 8 + 12; ==> x = y << (8 + 12);
memsetmay set 0 bytes
memset(buf, 0, nCount);
FillMemorymay set 0 bytes
FillMemory(pAction, 0, sizeof(Action));
LocalReAlloc/GlobalReAllocmay fail without
NULLwill overwrite the original value
pch = (char *)realloc(pch, cch+20);
LocalReAlloc, it's not an error to the compiler,
casestatement without a
case 2: Foo(); case 3: Bar(); break;
If you add a comment with the text fall
through or no break before the next
case statement, then the script will not emit a
CreateFile's return value vs
HFILE_ERROR, which is the documented return value on failure.
NULLis wrong since
_allocafails by throwing an exception, so check to see if
_allocais within a
CreateThreadis checked at the first if-stmt.
if ((x != 0) || (x != 2))in this case, if x == 0, the second comparison will succeed and the code will enter the body of the if-statement.
if ((x == 0) && (x == 1))
lstrcpy/strcpyand other functions
x = !3;
newoperator is used before it has been checked for success
x = y / (10 ^ 7);
HRESULTfunction result w/no check
VARIANT, should use
(!x & Y), probably meant
(!(x & Y)); C/C++ precedence rules have '!' before '&'
#definefor a value instead of existence
'\0', i.e. user meant to test for null terminator instead of number 0
memset(this, 0, sizeof this);
I would like to thank the many people at Microsoft who have written to me with suggestions or reported bugs.