Illegal processor exceptions that user-mode applications cause usually result in application termination and a Dr. Watson message, but the rest of the system continues.
When a kernel-mode device driver or subsystem causes an illegal exception, NT faces a difficult dilemma. It has detected that a part of the operating system with the ability to access any hardware device and any valid memory has done something it wasn't supposed to do.
NT could just ignore the exception and let the device driver or subsystem continue as if nothing had happened. The possibility exists that the error was isolated and that the component will somehow recover, letting NT limp along. What's more likely is that the detected exception resulted from deeper problems. Permitting the system to continue operating will probably result in more exceptions, and data stored on disk or other peripherals can become corrupt--a risk that's too high to take.
A device driver or subsystem also might realize that something is not quite right.
To stop a system in the face of kernel-mode exceptions and to provide
a systems administrator or developer information about what has happened,
NT exports the KeBugCheck function for use by kernel-mode device drivers,
subsystems, and the Microkernel. This function takes a Stop Code and four
more parameters that are interpreted on a per-Stop Code basis. After KeBugCheck
masks out all interrupts on all processors of the system, it switches the
display into blue screen mode (80 columns by 50 lines text mode), paints
a blue background, and begins to print information about the system's state.
Mapping the Blue Screen
The blue screen contains five areas of text from top to bottom: the Stop Code, system information, a list of loaded drivers, the stack trace, and an administrative message. Blank lines separate these areas. Some areas might be missing in a blue screen if the system state is too corrupt for NT to fill them in.
The Stop Code is a number that represents the nature of the detected problem. The bugcodes.h file in the Windows NT Device Driver Kit contains a complete list of the 150 or so Stop Codes. However, you will typically encounter only 4 or 5 of them. The text line below the Stop Code provides the text equivalent of the Stop Code numeric identifier.
The values in the parentheses give more specific information:
By looking at the STOP code and the third and fourth parameters,
you can possibly determine what caused the error condition. Interpreting
the additional Stop Code parameters rarely provides any insight into a
problem for anybody other than a device driver writer (or a member of the
Microsoft
NT development team). Fortunately, NT does some interpretation for us.
KeBugCheck scans the parameters for one that looks like it might be
an address pointing to the memory image of an Executive subsystem or a
device driver. When KeBugCheck finds one, it prints the parameter,
the base address of the module the parameter is in, and the name of the
module. This last piece of information is crucial.
Interpreting the Blue Screen Information
Sometimes an important clue is lurking in the Stop Code area or stack trace that can help you take a more proactive approach to ridding the system of the blue screen.
First, the Stop Code can provide all the information you need to identify
the problem. Several Stop Codes, their causes, and some suggestions about
what to do if you encounter one:
IRQL_NOT_LESS_OR_EQUAL | 0x0A | This code is probably the most frequently appearing code, and it usually
results from a buggy driver. The most common source of the problem is that
the Virtual Memory Manager has detected a kernel-mode component's attempt
to access pageable memory when the IRQL is Dispatch Level or higher and
the memory is in the paging file. The IRQL must be below Dispatch Level
for this operation to be legal. Look at the modules listed in the Stop
Code and stack trace areas of the screen for a possible candidate. This
code can also be a side effect resulting from a driver not shown in either
area that scribbled on memory it shouldn't have.
A process attempted to access pageable memory at a process internal request level (IRQL) that was too high. A process can access only objects that have priorities (IRQL) equal to or lower than its own. A device driver using improper addresses usually is the cause of this error. |
KMODE_EXCEPTION_NOT_HANDLED | 0x1E | In this case, the Microkernel's processor exception handler has detected that a driver or subsystem has tried to execute an illegal processor instruction, or a software instruction that NT cannot interpret. The cause can be a faulty memory module or a driver that has corrupted memory. Althought the module information on the blue screen is usually misleading in this case, making it difficult to identify the source of the problem, sometimes the exception address (the second parameter) pinpoints the driver or function that caused the problem. Always note this address and the link date of the driver or image that contains this address. |
NTFS_FILE_SYSTEM | 0x24 | All file system bug checks have encoded the source file and the line within the source file that generated the bug check in their first ULONG (unsigned long value). The upper 16 bits identify the file; the lower 16 bits identify the source line in the file where the bug check occurred |
NO_MORE_IRP_STACK_LOCATIONS | 0x35 | With this code, if you've added a new virus scanner or someone has accessed a shared volume over the network for the first time on the machine, the Server device driver can be at fault. The Server device driver constructs I/O request packets with a slot for every device driver on the path to the disk. Sometimes the number of I/O request packets the Server device driver allocates is insufficient, resulting in this Stop Code. Try increasing the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ LanmanServer\IrpStackSize setting to a number higher than 4 (or whatever it's set to) and see whether the problem goes away. |
REGISTRY_ERROR | 0x51 | Something has gone terribly wrong with the Registry. It might have received an I/O error while attempting to read one of its files as a result of a hardware problem or file system corruption |
KERNEL_STACK_INPAGE_ERROR | 0x77 | The system could not read in the requested page of kernel data. A bad block in a paging file or a disk controller error might be the cause. If the error is a result of a paging error, AUTOCHK will attempt to map out the bad block when you restart the system. The second parameter identifies the cause of the error |
INACCESSIBLE_BOOT_DEVICE | 0x7B |
|
UNEXPECTED_KERNEL_MODE_TRAP | 0x7F | In this case, the Microkernel's processor exception handler has detected that a driver or subsystem has tried to execute an illegal processor instruction, or a software instruction that NT cannot interpret. The cause can be a faulty memory module or a driver that has corrupted memory. A trap that the kernel doesn't have permission to have or catch occurred in privileged processor (kernel) mode. The message may signify a computer RAM problem (mismatched SIMMs), a BIOS problem, or corrupted file system drivers. Althought the module information on the blue screen is usually misleading in this case, making it difficult to identify the source of the problem, sometimes the first number in the bug check is the number of the trap. Consult an Intel x86 Family manual for the trap codes. |
NMI_HARDWARE_FAILURE | 0x80 | A hardware error has occurred, in which HAL reports the information that it can identify and directs the user to call the hardware vendor |
0xC000009A | Signifies a lack of nonpaged pool resources | |
0xC000009C | A bad block on the drive | |
0xC000016A | A bad block on the drive | |
0xC0000185 | Signifies improper termination of a SCSI device, bad SCSI cabling, or two devices attempting to use the same IRQ |
Microsoft Windows NT Workstation Resource Kit contains more information about Stop Codes.
Often, you begin seeing blue screens after you install a new software product or piece of hardware. If you've just added a driver, rebooted, and got a blue screen early in system initialization, you can reset the machine and press the space bar when instructed, to get the Last Known Good configuration. Enabling Last Known Good causes NT to revert to a copy of the Registry's device driver registration key (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services) from the last successful boot (before you installed the driver).
If you keep getting blue screens, an obvious approach is to uninstall the things you added just before the appearance of the first blue screen. If some time has passed since you added something new or you added several things at about the same time, you need to note the names of the modules you see in both the Stop Code and stack trace areas. Note that ntoskrnl.exe refers to the image that contains all NT's core kernel-mode subsystems as well as the Microkernel.
If you recognize any of the module names as being related to something you just added (such as scsiport.sys if you put on a new drive), you've possibly found your culprit. Many device drivers have cryptic names, so one thing you can do to figure out which application or hardware device is associated with a name is to run the Regedit Registry viewing tool the next time you boot the system or on a similarly equipped machine. Search for the name of the driver under the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services key. This branch of the Registry is where NT stores registration information for every device driver in the system. If you find a match, look for a value called DisplayName. Some drivers fill in this value with a name descriptive of the device driver's purpose. For example, you might find Virus Scanner, which can implicate the antivirus software you have running.
You can also search Microsoft's online Knowledge Base (http://www.microsoft.com)
for the Stop Code and the name of the suspect hardware or application.
You might find information about a workaround, an update, or a Service
Pack that fixes the problem you're having.
Setting the Blue Screen Options
Instead of just halting the system with a blue screen, you can have:
You can use another utility, dumpchk.exe, to examine the integrity of the dump file and verify that the system created the file correctly. With dumpchk, you can view basic information about the dump file, such as which NT version was running and the STOP error codes.
Another useful utility is dumpexam.exe, which converts the memory file into a readable text file. You need three files to run dumpexam: dumpexam.exe, imagehlp.dll, and for the Intel platform, kdextx86.dll (the third file depends on the platform). The three files must be in the same directory. You can find them on the CD-ROM of the NT Server or the NT Workstation CD-ROM in the directory \support\debug\<platform>, where platform is i386, alpha, mips, or ppc.
The noninteractive debugging method is ideal for users who don't want to debug the driver, but just want to figure out which one is at fault. To run dumpexam, you need to load the symbol files, which contain NT system debugging information. Make sure that the symbol files are for the version of NT you're running, including any installed service packs. For the Intel version of NT, the symbol files are in the \support\debug\i386\symbols directory on the NT resource kits' CD-ROMs
Syntax for dumpexam:
dumpexam [options] [CrashDumpFile]
where
-? displays the command syntax
-v specifies verbose mode
-p prints the header only
-f filename specifies the output file nameX
-y path sets the symbol search path
Example of syntax for dumpexam: if you want to analyze a dump file for a computer with NT Workstation 4.0, the symbols are in the directory d:\symbols; the dump file, server.dmp, is in the directory d:\dump. The command line reads
dumpexam -y d:\symbols d:\dump\server.dmp
The results of the exam will be in %SystemRoot%\MEMORY.TXT.