BSOD Blue Screen Of Death

Illegal processor exceptions that user-mode applications cause usually result in application termination and a Dr. Watson message, but the rest of the system continues.

When a kernel-mode device driver or subsystem causes an illegal exception, NT faces a difficult dilemma. It has detected that a part of the operating system with the ability to access any hardware device and any valid memory has done something it wasn't supposed to do.

NT could just ignore the exception and let the device driver or subsystem continue as if nothing had happened. The possibility exists that the error was isolated and that the component will somehow recover, letting NT limp along. What's more likely is that the detected exception resulted from deeper problems. Permitting the system to continue operating will probably result in more exceptions, and data stored on disk or other peripherals can become corrupt--a risk that's too high to take.

A device driver or subsystem also might realize that something is not quite right.

To stop a system in the face of kernel-mode exceptions and to provide a systems administrator or developer information about what has happened, NT exports the KeBugCheck function for use by kernel-mode device drivers, subsystems, and the Microkernel. This function takes a Stop Code and four more parameters that are interpreted on a per-Stop Code basis. After KeBugCheck masks out all interrupts on all processors of the system, it switches the display into blue screen mode (80 columns by 50 lines text mode), paints a blue background, and begins to print information about the system's state.

Mapping the Blue Screen

The blue screen contains five areas of text from top to bottom: the Stop Code, system information, a list of loaded drivers, the stack trace, and an administrative message. Blank lines separate these areas. Some areas might be missing in a blue screen if the system state is too corrupt for NT to fill them in.

The most useful portion of the display is usually the Stop Code area. This area lists the Stop Code and the four additional parameters passed to KeBugCheck.

The Stop Code is a number that represents the nature of the detected problem. The bugcodes.h file in the Windows NT Device Driver Kit contains a complete list of the 150 or so Stop Codes. However, you will typically encounter only 4 or 5 of them. The text line below the Stop Code provides the text equivalent of the Stop Code numeric identifier.

The values in the parentheses give more specific information:

First points to the address that the driver referenced improperly.
Second value is the IRQL that was required to access the memory
Third value specifies whether the driver was doing a read or a write
Fourth value points to the instruction address that attempted the access

By looking at the STOP code and the third and fourth parameters, you can possibly determine what caused the error condition. Interpreting the additional Stop Code parameters rarely provides any insight into a problem for anybody other than a device driver writer (or a member of the Microsoft NT development team). Fortunately, NT does some interpretation for us. KeBugCheck scans the parameters for one that looks like it might be an address pointing to the memory image of an Executive subsystem or a device driver. When KeBugCheck finds one, it prints the parameter, the base address of the module the parameter is in, and the name of the module. This last piece of information is crucial.

The system information area of the screen is below the Stop Code area, and it simply identifies the system's processor type (e.g., Pentium, x486) and NT's base build number (no Service Pack information appears).

Below the system information on the blue screen is the loaded driver area. Here you'll see a listing of all the registered device drivers at the time of the stop. KeBugCheck prints the name, base memory address, and date-stamp (the time a driver was built). Unless you develop device drivers, this information is useless.
Finally, just below the loaded driver area is a snapshot of the system stack at the time of the call to KeBugCheck. Each module (except the first one) in the list had invoked the module printed on the line above it and was waiting for a result. The system detected a problem while the module on the first line was executing, and often this module matches the module shown in the Stop Code area.
The administrative message tells you to contact your systems administrator if you have a chronic blue screen problem on your system.

Interpreting the Blue Screen Information

Sometimes an important clue is lurking in the Stop Code area or stack trace that can help you take a more proactive approach to ridding the system of the blue screen.

First, the Stop Code can provide all the information you need to identify the problem. Several Stop Codes, their causes, and some suggestions about what to do if you encounter one:

IRQL_NOT_LESS_OR_EQUAL 0x0A This code is probably the most frequently appearing code, and it usually results from a buggy driver. The most common source of the problem is that the Virtual Memory Manager has detected a kernel-mode component's attempt to access pageable memory when the IRQL is Dispatch Level or higher and the memory is in the paging file. The IRQL must be below Dispatch Level for this operation to be legal. Look at the modules listed in the Stop Code and stack trace areas of the screen for a possible candidate. This code can also be a side effect resulting from a driver not shown in either area that scribbled on memory it shouldn't have.
A process attempted to access pageable memory at a process internal request level (IRQL) that was too high. A process can access only objects that have priorities (IRQL) equal to or lower than its own. A device driver using improper addresses usually is the cause of this error.

KMODE_EXCEPTION_NOT_HANDLED 0x1E In this case, the Microkernel's processor exception handler has detected that a driver or subsystem has tried to execute an illegal processor instruction, or a software instruction that NT cannot interpret. The cause can be a faulty memory module or a driver that has corrupted memory. Althought the module information on the blue screen is usually misleading in this case, making it difficult to identify the source of the problem, sometimes the exception address (the second parameter) pinpoints the driver or function that caused the problem. Always note this address and the link date of the driver or image that contains this address.

NTFS_FILE_SYSTEM 0x24 All file system bug checks have encoded the source file and the line within the source file that generated the bug check in their first ULONG (unsigned long value). The upper 16 bits identify the file; the lower 16 bits identify the source line in the file where the bug check occurred

NO_MORE_IRP_STACK_LOCATIONS 0x35 With this code, if you've added a new virus scanner or someone has accessed a shared volume over the network for the first time on the machine, the Server device driver can be at fault. The Server device driver constructs I/O request packets with a slot for every device driver on the path to the disk. Sometimes the number of I/O request packets the Server device driver allocates is insufficient, resulting in this Stop Code. Try increasing the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ LanmanServer\IrpStackSize setting to a number higher than 4 (or whatever it's set to) and see whether the problem goes away.

REGISTRY_ERROR 0x51 Something has gone terribly wrong with the Registry. It might have received an I/O error while attempting to read one of its files as a result of a hardware problem or file system corruption

KERNEL_STACK_INPAGE_ERROR 0x77 The system could not read in the requested page of kernel data. A bad block in a paging file or a disk controller error might be the cause. If the error is a result of a paging error, AUTOCHK will attempt to map out the bad block when you restart the system. The second parameter identifies the cause of the error

INACCESSIBLE_BOOT_DEVICE 0x7B

If you see this Stop Code, NT is very early in a boot and cannot access the disk partition that boot.ini is pointing to for the location of the system files (where your \winnt directory resides). The disk containing that partition is faulty, or the data on the disk or partition has become corrupt

UNEXPECTED_KERNEL_MODE_TRAP 0x7F In this case, the Microkernel's processor exception handler has detected that a driver or subsystem has tried to execute an illegal processor instruction, or a software instruction that NT cannot interpret. The cause can be a faulty memory module or a driver that has corrupted memory. A trap that the kernel doesn't have permission to have or catch occurred in privileged processor (kernel) mode. The message may signify a computer RAM problem (mismatched SIMMs), a BIOS problem, or corrupted file system drivers. Althought the module information on the blue screen is usually misleading in this case, making it difficult to identify the source of the problem, sometimes the first number in the bug check is the number of the trap. Consult an Intel x86 Family manual for the trap codes.

NMI_HARDWARE_FAILURE 0x80 A hardware error has occurred, in which HAL reports the information that it can identify and directs the user to call the hardware vendor

0xC000009A Signifies a lack of nonpaged pool resources

0xC000009C A bad block on the drive

0xC000016A A bad block on the drive

0xC0000185 Signifies improper termination of a SCSI device, bad SCSI cabling, or two devices attempting to use the same IRQ

Microsoft Windows NT Workstation Resource Kit contains more information about Stop Codes.

Often, you begin seeing blue screens after you install a new software product or piece of hardware. If you've just added a driver, rebooted, and got a blue screen early in system initialization, you can reset the machine and press the space bar when instructed, to get the Last Known Good configuration. Enabling Last Known Good causes NT to revert to a copy of the Registry's device driver registration key (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services) from the last successful boot (before you installed the driver).

If you keep getting blue screens, an obvious approach is to uninstall the things you added just before the appearance of the first blue screen. If some time has passed since you added something new or you added several things at about the same time, you need to note the names of the modules you see in both the Stop Code and stack trace areas. Note that ntoskrnl.exe refers to the image that contains all NT's core kernel-mode subsystems as well as the Microkernel.

If you recognize any of the module names as being related to something you just added (such as scsiport.sys if you put on a new drive), you've possibly found your culprit. Many device drivers have cryptic names, so one thing you can do to figure out which application or hardware device is associated with a name is to run the Regedit Registry viewing tool the next time you boot the system or on a similarly equipped machine. Search for the name of the driver under the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services key. This branch of the Registry is where NT stores registration information for every device driver in the system. If you find a match, look for a value called DisplayName. Some drivers fill in this value with a name descriptive of the device driver's purpose. For example, you might find Virus Scanner, which can implicate the antivirus software you have running.

You can also search Microsoft's online Knowledge Base (http://www.microsoft.com) for the Stop Code and the name of the suspect hardware or application. You might find information about a workaround, an update, or a Service Pack that fixes the problem you're having.

Setting the Blue Screen Options

Instead of just halting the system with a blue screen, you can have:

NT log an event to the system log, which you can view with the Event Viewer administrative tool,
send you an administrative alert,
write a dump of the machine's physical memory to disk, you won't want it unless you have a chronic problem that a particular hardware vendor or Microsoft will help you debug,
or automatically reboot the computer, if your machine is performing a task for which you want to minimize downtime.

The Windows NT Server and NT Workstation CD-ROMs contain some tools to help you with the memory dump file. dumpflop.exe writes the memory file to floppies (a 32MB memory file fits on about 10 disks). Unfortunately, Microsoft does not accept the memory file on any other medium. Once you have created the dump file, you can make it available to a Microsoft Product Support Specialist either by sending the floppies to Microsoft or by preparing a Remote Access Service (RAS) connection for Microsoft Product Support to dial in and view the file contents remotely. Or you can submit the file to Microsoft over the Internet by connecting to ftp.microsoft.com and copying the file to /transfer/incoming/bussys/winnt.

You can use another utility, dumpchk.exe, to examine the integrity of the dump file and verify that the system created the file correctly. With dumpchk, you can view basic information about the dump file, such as which NT version was running and the STOP error codes.

Another useful utility is dumpexam.exe, which converts the memory file into a readable text file. You need three files to run dumpexam: dumpexam.exe, imagehlp.dll, and for the Intel platform, kdextx86.dll (the third file depends on the platform). The three files must be in the same directory. You can find them on the CD-ROM of the NT Server or the NT Workstation CD-ROM in the directory \support\debug\<platform>, where platform is i386, alpha, mips, or ppc.

The noninteractive debugging method is ideal for users who don't want to debug the driver, but just want to figure out which one is at fault. To run dumpexam, you need to load the symbol files, which contain NT system debugging information. Make sure that the symbol files are for the version of NT you're running, including any installed service packs. For the Intel version of NT, the symbol files are in the \support\debug\i386\symbols directory on the NT resource kits' CD-ROMs

Syntax for dumpexam:
dumpexam [options] [CrashDumpFile]

where
-? displays the command syntax
-v specifies verbose mode
-p prints the header only
-f filename specifies the output file nameX
-y path sets the symbol search path

Example of syntax for dumpexam: if you want to analyze a dump file for a computer with NT Workstation 4.0, the symbols are in the directory d:\symbols; the dump file, server.dmp, is in the directory d:\dump. The command line reads

dumpexam -y d:\symbols d:\dump\server.dmp

The results of the exam will be in %SystemRoot%\MEMORY.TXT.

Referencias

IRQL_NOT_LESS_OR_EQUAL	0x0A	This code is probably the most frequently appearing code, and it usually results from a buggy driver. The most common source of the problem is that the Virtual Memory Manager has detected a kernel-mode component's attempt to access pageable memory when the IRQL is Dispatch Level or higher and the memory is in the paging file. The IRQL must be below Dispatch Level for this operation to be legal. Look at the modules listed in the Stop Code and stack trace areas of the screen for a possible candidate. This code can also be a side effect resulting from a driver not shown in either area that scribbled on memory it shouldn't have. A process attempted to access pageable memory at a process internal request level (IRQL) that was too high. A process can access only objects that have priorities (IRQL) equal to or lower than its own. A device driver using improper addresses usually is the cause of this error.
KMODE_EXCEPTION_NOT_HANDLED	0x1E	In this case, the Microkernel's processor exception handler has detected that a driver or subsystem has tried to execute an illegal processor instruction, or a software instruction that NT cannot interpret. The cause can be a faulty memory module or a driver that has corrupted memory. Althought the module information on the blue screen is usually misleading in this case, making it difficult to identify the source of the problem, sometimes the exception address (the second parameter) pinpoints the driver or function that caused the problem. Always note this address and the link date of the driver or image that contains this address.
NTFS_FILE_SYSTEM	0x24	All file system bug checks have encoded the source file and the line within the source file that generated the bug check in their first ULONG (unsigned long value). The upper 16 bits identify the file; the lower 16 bits identify the source line in the file where the bug check occurred
NO_MORE_IRP_STACK_LOCATIONS	0x35	With this code, if you've added a new virus scanner or someone has accessed a shared volume over the network for the first time on the machine, the Server device driver can be at fault. The Server device driver constructs I/O request packets with a slot for every device driver on the path to the disk. Sometimes the number of I/O request packets the Server device driver allocates is insufficient, resulting in this Stop Code. Try increasing the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ LanmanServer\IrpStackSize setting to a number higher than 4 (or whatever it's set to) and see whether the problem goes away.
REGISTRY_ERROR	0x51	Something has gone terribly wrong with the Registry. It might have received an I/O error while attempting to read one of its files as a result of a hardware problem or file system corruption
KERNEL_STACK_INPAGE_ERROR	0x77	The system could not read in the requested page of kernel data. A bad block in a paging file or a disk controller error might be the cause. If the error is a result of a paging error, AUTOCHK will attempt to map out the bad block when you restart the system. The second parameter identifies the cause of the error
INACCESSIBLE_BOOT_DEVICE	0x7B	If you see this Stop Code, NT is very early in a boot and cannot access the disk partition that boot.ini is pointing to for the location of the system files (where your \winnt directory resides). The disk containing that partition is faulty, or the data on the disk or partition has become corrupt
UNEXPECTED_KERNEL_MODE_TRAP	0x7F	In this case, the Microkernel's processor exception handler has detected that a driver or subsystem has tried to execute an illegal processor instruction, or a software instruction that NT cannot interpret. The cause can be a faulty memory module or a driver that has corrupted memory. A trap that the kernel doesn't have permission to have or catch occurred in privileged processor (kernel) mode. The message may signify a computer RAM problem (mismatched SIMMs), a BIOS problem, or corrupted file system drivers. Althought the module information on the blue screen is usually misleading in this case, making it difficult to identify the source of the problem, sometimes the first number in the bug check is the number of the trap. Consult an Intel x86 Family manual for the trap codes.
NMI_HARDWARE_FAILURE	0x80	A hardware error has occurred, in which HAL reports the information that it can identify and directs the user to call the hardware vendor
	0xC000009A	Signifies a lack of nonpaged pool resources
	0xC000009C	A bad block on the drive
	0xC000016A	A bad block on the drive
	0xC0000185	Signifies improper termination of a SCSI device, bad SCSI cabling, or two devices attempting to use the same IRQ