A fast and easy way to find bugs in your source code

Johnny Lee

Microsoft Corporation

typo_pl@hotmail.com

Aug 24 1999

Abstract

Software development has changed greatly in the past twenty years. However, the programmer’s tools have not changed nearly as much over the same period. Programmers keep running into the same types of bugs that others have encountered, such as buffer overruns. Most software debugging is a slow manual process that does not scale well. I have developed a Perl script to locate various typographical errors in C or C++ source code. This paper describes the development of the script, the types of bugs that the script will report, auxiliary scripts and applications, and my experience developing my first Perl script.

Introduction

Typographical errors (typos) can be very frustrating and time-consuming to locate. One famous example of a typo is a mistyped period instead of a comma in a FORTRAN program, i.e.  DO 10 I = 1.10 instead of DO 10 I = 1,10

The FORTRAN compiler interprets the line as an assignment statement instead of a loop. See <http://www.best.com/~wilson/faq/inicio.html#queIII1> for a full explanation. Another example of a typographical error is from comp.risks, Volume 20, Issue 18 <http://catless.ncl.ac.uk/Risks/20.18.html#subj9>, where there were too few equals sign in an if statement, which turned a comparison into an assignment. It took a programmer two days to find three of these typographical errors.

Locating bugs in an application can be a tedious process. A programmer can do a time-consuming code review of the source code or if the bug is easily reproducible, the programmer can use a source code debugger to locate the bug. Otherwise, there are several computerized methods for locating bugs. Runtime analyzers like Rational Software’s Purify or NuMega’s BoundsChecker help locate bugs, but they often increase the execution time of the application and the amount of memory used. The more popular runtime analyzers detect many problems with memory accesses since they can easily detect when an application accesses invalid memory. Some runtime analyzers, such as Boundschecker, also check API usage, i.e. illegal parameters passed to functions. Debug versions of memory allocation functions are also popular and effective. Source code analyzers are another category of debugging tools. They usually find a different class of bugs than runtime analyzers. Compiling at a high warning level or running lint on your source are examples of source code analyzers. Intrinsa’s PREfix is a source code analyzer that can detect many of the bugs that runtime analyzers detect but its price is generally prohibitive except for individuals.

Programmers’ reviewing source code for typographical errors is a time-consuming, boring, and inefficient process. It is similar to looking for a needle in a haystack. It is very easy to overlook an errant semicolon or equal sign. This is a perfect job for a computer. Source code analyzers are better tools for locating typographical errors than runtime analyzers as runtime analyzers usually detect only the symptoms of a typographical error, if they detect anything at all.

Development of an automated bug finder

One such experience with a typographical error made me search for a better solution. I had to spend much time tracking down a bug to a line in the source code that had one too many equal signs. The extra equal sign turned an assignment statement into a comparison. My first solution involved using the Win32 application findstr.exe, similar to the UNIX grep command, to scan source code for the particular typo. The solution worked but was inflexible when handling more than one case. Debugging consisted of trial and error. My second solution involved using a batch file to control scanning the source code. This slightly increased the flexibility but was not a great improvement. I had heard that Perl was good at processing text so I cracked open the Pink Camel book and read a few chapters. I used some of the sample code as a framework for my third solution. Once I was satisfied with my typo.pl Perl script, I released the script to my group at Microsoft for their use. A feedback-loop developed. I would release a newer and improved script. People would find problems or make suggestions for new features or different classes of bugs. Then I would make more changes and the loop would start over again. After several iterations, I released the script companywide, which resulted in an even greater feedback-loop.

Description of the Typo.pl Perl script

The main loop of the script scans files one line at a time. The script removes all C or C++ comments, string constants, and whitespace from the line before checking for possible errors. It searches the line for any keywords or user-specified functions. For keywords that have an associated expression, i.e. if statements, for-loops, and while-loops, the script takes the expression and scans it for any possible errors. For user-specified functions, the script takes the function parameters and any return code and scans them for possible errors. This may require specific Perl code for a particular function. The script keeps track of various statistics such as number of lines scanned, number of comments, number of functions, and the number of statements as measured by semicolons.

The script emits a warning when it thinks that the code may have an error. In most cases, it does not know for sure and since the programmer is the only one who can decide, it will generate the warning as the safe thing to do. It is up to the programmer to decide if the bug is real or not.

At the end of the scan, the script displays statistics about the scanned files including number of lines, bytes, comments, semicolons, and functions as well as the amount of time that the script took to scan the files.

The script can display statistics about each file that it scans.

Users can direct the script to extract all the strings in the scanned files. Users can spellcheck the list of extracted strings to locate all the misspelled words in the source code.

The possible errors that the script generates can be organized into several categories:

See the List of possible errors at the end of this paper for a complete list of possible errors that the script generates.

Perl was easy to learn. It has the fast turnaround time of interpreted BASIC. Its string processing capabilities are very rich. Regular expressions are a feature lacking from most other procedural languages. Arrays and hashes are easy to use and require much less management than in C or C++. Since there are so many ways to accomplish a given task in Perl, you need to profile each solution if you are worried about performance. Solutions that require few characters may or may not consume a lot of time. The Camel book offers several performance tips. I found that I needed to investigate if each of these were valid for my script. Sometimes they were valid and sometimes they were not, e.g. Avoid $&, $’, and $` did not make a measurable difference in the script’s runtime. There were the typical problems, i.e. confusing the string and numeric equality operators. Some of the error messages could be cryptic: when I used #$ARGV instead of the correct $#ARGV, which resulted in a confusing syntax error. It takes some time to get used to the suggested Perl programming style. I tend to write Perl code as if I was writing C code still. However, the problems are minimal compared to the time and effort that the script has saved.

Auxiliary Perl scripts and applications

I have developed several auxiliary Perl scripts and applications, which make handling the script’s output easier.


D:\src\zip22>perl c:\typo\typosum.pl -c <typo.txt

 0:    11   1:     1   2:     0   3:     2   4:     0   5:     0   6:     0

 7:     0   8:     0   9:     0  10:     0  11:     1  12:     0  13:     0

14:     0  15:     0  16:     0  17:     8  18:     0  19:     1  20:     2

21:     1  22:     0  23:     0  24:     0  25:     0  26:     3  27:    35

28:     0  29:     0  30:    24  31:     0  32:     6  33:     0  34:     0

35:     0  36:     0  37:     0  38:     0  39:     0  40:     0  41:     0

42:     0  43:     0  44:     1  45:     0  46:    14  47:     0  48:     0

49:     0  50:     0  51:     0  52:     0



Total= 110

Usage

User specifies behaviour of private functions in a text file

User runs the script from the topmost directory of the source code, directing the output to a file.

User browses the file with TV.Exe or a text editor to check for any valid bugs.

Example:

Scan the source code of Info-Zip’s zip2.2 archiver.

We will use a predefined option file that specifies the behaviour of most Win32 functions.


D:\src\zip22>perl c:\typo\typo.pl -optionfile:c:\typo\win32.txt c

// Perl version: 5.001

// TYPO.PL Version 2.45 Jun 15 1999 by Johnny Lee (johnnyl)

// OPTIONS: '-optionfile:c:\typo\win32.txt c'

// START: Tue Jun 15 17:29:14 1999

D:\src\zip22\fileio.c (280): no immediate strchr check 27: =strchr(q,'@') [q]

.

.

D:\src\zip22\windll\windll.c (106): using malloc result w/no check 30: *zcomment = 0; [zcomment]

// FUNCS: 545

// SEMIS: 10,760

// COMMS: 5,615

// LINES: 34,020

// CHARS: 991,749

// START: Tue Jun 15 17:29:14 1999

// STOP:  Tue Jun 15 17:29:32 1999

If the user redirects the script output to a file, then the user can browse the output using the TV.EXE application. See Figure 1.

Figure 1. TV.EXE displaying typo.pl script output

Figure 1. TV.EXE displaying the script output from scanning Info-Zip’s zip2.2 source code.

You can use typosum.pl to generate a listing that displays the number of errors of each type found in the source code. If the user wants to compare the output of separate invocations of the script on the same group of files, then the line numbers have to be removed because modifications to the files may shift code around. Denum.pl removes the line numbers from the script output so you can use a diff-like tool to determine if there are any changes.

Pros and Cons of the Perl script

The Perl script is not a silver bullet. It does not parse C or C++ correctly. The script does not handle #include files or macros. Macros or complex code can fool the script and generate false positives. The script does not handle if-else statement control flow correctly. This failure generates more false positive warnings. The script has evolved over several years. Conditions that were once valid may not be valid any longer. My main job is not developing and maintaining the Perl script. I work on the script in my spare time – when I was recuperating from a running injury, I had plenty of spare time. I have not had the time to document all these assumptions or revisit them. The script cannot determine if the programmer designed the code to execute in a certain manner, i.e. falling through from one case statement to the following case statement. However, the script can scan source code written for different operating systems. When I ran the script on my PC, I was able to find real bugs in Macintosh and VMS source code. The script is easy to use, runs quickly, and does not require the modification of any makefiles to work. The time required to investigate all the reported warnings is much less than the time required to review the source code by one or more programmers. The script does not get tired, suffer from eyestrain, repetitive-stress injuries, or whine about scanning more source code. Programmers can run the script on their code before they checkin to ensure that there were no bugs introduced.

Where to get the Typo.pl Perl script

The homepage for the typo.pl perl script is http://www.oocities.org/typopl/. I will also submit the typo.pl script to CPAN after the 1999 Perl Conference.

List of possible errors

  1. Semicolon appended to an if statement. VC98 emits a warning for this.
    
    if (x == y);
    exit(1);
  2. Use of == instead of = in assignment statements. VC98 emits a warning for this. Handles single +,- characters too.
    
    X == Y;
    
    X - NULL;
    
    
  3. Assignment of a number in an if statement, probably meant a comparison. VC98 emits a warning for this.
    
    if (x = 3)
    
    
  4. Assignment within an Assert
    
    ASSERT(Z = 4);
    
    
  5. Increment/decrement of ptr, ptr's contents not modified.
    Programmer may have meant to modify ptr's contents
    
    *ptr++;
    
    
  6. Logical AND with a number
    
    x = y && 1;
    
    
  7. Logical OR with a number
    
    x = y || 1;
    
    
  8. Bitwise-AND/OR/XOR of number compared to another value.
    This may have an undesired result due to C precedence rules since
    bitwise-AND/OR/XOR has lower precedence than the comparison operators.
    
    if (x & 1 == 0) ==> if (x & (1 == 0))
    
    
  9. Referencing Release/AddRef instead of invoking them. MSVC 5+ can detect this case.
    
    pFoo->Release;
    
    
  10. Whitespace following a line-continuation character
  11. Shift operator ( <<, >> ) followed by +,-,*,/ may have undesired result
    due to C precedence rules. The shift operator has lower precedence. VC98 emits a warning for this.
    
    x = y << 8 + 12; ==> x = y << (8 + 12);
    
    
  12. Very basic check for uninitialized variables in for-loops
  13. Misspelling the word Microsoft
  14. Swapping the last two args of memset may set 0 bytes
    
    memset(buf, 0, nCount);
    
    
  15. Swapping the last two args of FillMemory may set 0 bytes
    
    FillMemory(pAction, 0, sizeof(Action));
    
    
  16. LocalReAlloc/GlobalReAlloc may fail without MOVEABLE flag
  17. Assigning result of realloc function to same variable that's realloc'ed
    may result in leaked memory if realloc fails since NULL will overwrite the original value
    
    pch = (char *)realloc(pch, cch+20);
    
    
  18. ReAlloc flags in wrong place or using ReAlloc flags for a different realloc API,
    i.e. passing GMEM_MOVEABLE to LocalReAlloc, it's not an error to the compiler,
    but I'd say you were playing with fire.
  19. case statement without a break/return/goto/exit
    
        case 2:
    
            Foo();
    
    
    
        case 3:
    
            Bar();
    
            break;
    
    

    If you add a comment with the text fall through or no break before the next case statement, then the script will not emit a warning.

  20. Comparing CreateFile's return value vs NULL for failure
    Problem is that CreateFile returns INVALID_HANDLE_VALUE on failure.
  21. Casting a 32-bit number (may not be 64-bit safe)
  22. Casting a 7-digit hex number with high-bit set in first digit.
    Programmer may have meant to add an extra digit.
  23. Comparing functions that return handles to INVALID_HANDLE_VALUE for failure,
    problem is that these functions return NULL on failure
  24. Comparing OpenFile/_lopen/_lclose/_lcreat return value
    to anything other than HFILE_ERROR, which is the documented return value on failure.
  25. Comparing _alloca result to NULL is wrong since _alloca
    fails by throwing an exception, not returning NULL.
  26. MSVC's _alloca fails by throwing an exception, so check to see if _alloca is within a try {}
  27. Check to see if the result from functions that return a value
    like CreateWindow or CreateThread is checked at the first if-stmt.
  28. Check for multiple inequality comparisons of the same var separated by ||,
    i.e. if ((x != 0) || (x != 2)) in this case, if x == 0, the second comparison will succeed and the code will enter the body of the if-statement.
    Programmer probably meant && instead of ||.
  29. Similar to 28, check for cases of the form: if ((x == 0) && (x == 1))
  30. If a function result is used before it has been checked for success
  31. Check for use of lstrcpy/strcpy and other functions
    that can overflow buffers.
  32. Check to see if function result was stored somewhere
  33. Trying to take the logical inverse of a number.
    
    x = !3;
    
    
  34. If the result from the new operator is used before it has been checked for success
  35. Function that throws exception on error is not in a try {}.
  36. Check for misspelled defined symbols. User must do most of the investigative work.
    The script will note all the symbols used in #ifdef,#ifndef,#if,#elif statements and
    print them out at the end.
  37. Check for bitwise-XORing one number with another number
    
    x = y / (10 ^ 7);
    
    
  38. Wrong flags used with MapViewOfFile.
  39. Wrong flags used with CreateFile
  40. Duplicate flags passed to CreateFile
  41. Complain about returning unchecked function results
  42. Using HRESULT function result w/no check
  43. Double semicolon at the end of a statement
  44. Incorrectly calculating memory needed by using strlen(X+1) instead of strlen(X)+1
  45. Assigning TRUE to boolVal field of VARIANT, should use VARIANT_TRUE (= -1)
  46. Empty statement after while/for loop
  47. Use of (!x & Y), probably meant (!(x & Y)); C/C++ precedence rules have '!' before '&'
  48. Testing a #define for a value instead of existence
  49. Test a char for '0' instead of '\0', i.e. user meant to test for null terminator instead of number 0
  50. Use of a disallowed function
  51. Use of a disallowed string
  52. Filling an object with zeros, i.e. memset(this, 0, sizeof this);

Acknowledgments

I would like to thank the many people at Microsoft who have written to me with suggestions or reported bugs.