Journaling and Logging
The traditional Linux file systems were based on the legacy Unix file
systems. Such file systems (e.g. ext2fs) are static, which means they
do not track changes applied to files and directories to guarantee that
all updates are performed safely. Furthermore, ext2fs works
asynchronously. Information about a file -- for example its
permissions, creation date, and ownership -- are written in a delayed
fashion and, often, in several distinct operations.
This approach results in a noticeable performance gain; however, it
also incurs data consistency problems. If a power failure occurs
exactly when the file system has updated the contents of a file but
before it managed to update its header, then the file becomes
corrupted. Worse yet, if the disk is highly fragmented, then it's
likely that other files may have been corrupted as a result and the
entire directory needs to be restored.
Traditionally, a process called fsck (file system check) would check
the file system during reboot and detect the corrupt files. In some
cases, it would manage to fix them too, but usually you would have to
reconstitute the files from a backup set. In the Internet age, when
servers are required to stay up for months, this approach is
unacceptable. The demand for a more reliable file system and faster
recovery time led to the development of several journaling and logging
file systems.
What is journaling?
The concept, introduced about a decade ago in database systems, ensures
data consistency and integrity in the event of a failure during a
transaction. A typical database journaling system records every
operation applied to the database records. If a transaction can't be
completed due to a hardware fault or a network failure, then the
database system restores the records to their original state. A
journaling file system uses a similar method by constantly monitoring
inode changes.
Logging, as opposed to journaling, keeps track of both inode changes
and file content changes. Each of these approaches has advantages and
drawbacks. In terms of performance overhead, journaling requires less
resources but logging enables faster recovery time. In either case,
recovery time is much faster compared to a static file system.
Furthermore, it doesn't necessitate a reboot.
Vulnerabilities of traditional static file systems
Under a static file system, each file consists of two logical units: a
metadata block (commonly known as the inode) and the file's data. The
inode (information node) contains information about the physical
locations of the file's data blocks, modification time, etc.... The
second logical unit consists of one or more blocks of data, which
needn't be contiguous. Thus, when an application changes the contents
of a file, ext2fs modifies the file's inode and its data in two
distinct, synchronous write operations. If an outage occurs in between,
then the file system's state is unknown and needs to be checked for
consistency. A metadata logging file system overcomes this
vulnerability by using a wrap-around, appending only log area on the
disk.
The logging system records the state of each disk transaction in the
log area. Before any change is applied to the file system, an intent-to-
commit record is appended to the log. When the change has been
completed, the log entry is marked as complete. In the event of a
recovery from a failure, the system replays the log and checks for an
intent-to-commit record without a matching completion mark. Since every
modification to the file system is recorded in the log, the file system
only needs to read the log rather than performing a full file system
scan. If an intent-to-commit record without a completion mark is found,
then the change logged in that record is undone.
Let's look at a concrete example. Suppose we have a file that contains
three data blocks: 1,2 and 3. The first two of blocks are contiguous:
bbb12bbb3Hbbb
The b area indicates discarded data blocks and H is the file header.
Now an application updates blocks 2 and 3. Consequently, the file
system looks as follows (the a area marks obsolete data blocks that
previously contained the blocks 2 and 3 and the header):
bbb1abbbaabbb23H
Notice that the modified data was appended to the end: first, the
blocks 2 and 3, and finally the header. The previous location of blocks
2,3, and the header was discarded. This approach has several
advantages. It's faster because the system doesn't need to seek all
over the disk for writing parts of the file and it's safer because file
parts that have been changed aren't lost until the log has successfully
written the new blocks. Finally, a recovery after a crash is much
faster because the logging system needs to check only the updates that
took place after the last checkpoint.
At present, there are several journaling file systems available for
Linux. The SGI xfs file system is an Open Source product. It's a
reliable, fast, and 64-bit file system. IBM's JFS is another highly
acclaimed open source product. Its 1.0.0 version was released recently.
For further information on JFS.
By Danny Kalev
System Clock
This week, we'll explore the notion of time measurement
and processing under Linux. We will start with a quick
overview of the low-level hardware clocks and their
interrupts, and then we will discuss associated device
drivers and synchronization with external time sources.
Real-Time Clocks
All modern PCs possess an internal real-time clock
(RTC), typically built into the machine's chipset;
however, some machines have an on-board Motorola
MC146818 clock (or a compatible chip). Real-time clocks
can send periodical signals in frequencies ranging
between 2hz to 8192hz and functions as an alarm, raising
IRQ (interrupt request) 8 when a timer countdown
completes. Linux's /dev/rtc driver, a read-only
character device type, controls the system's RTC and
reports the current value as an unsigned long whose
low-order byte contains the interrupt type. The
interrupt type can be update-done, alarm-rang, or
periodic. The remaining three bytes hold the number of
interrupts since the last read. You can access status
information of the /dev/rtc driver via the pseudo-file
proc/driver/rtc. (if the /proc filesystem is active).
Time-related Interrupt Requests
On a congested system, the IRQ load can affect the
system's performance. Thus, several interrupts may pile
up, causing an "IRQ jam". Users must check the number of
interrupts accumulated since the last read, as it may be
higher than one. Modern hardware architectures can
handle clock signals at a rate of up to 2048hz. Higher
frequencies, however, might cause IRQ jams. By design, a
non-privileged process may enable interrupts and signals
at a rate of 64hz or lower. For higher frequencies, the
process must have a root privilege.
Synchronization with External Time Sources
Certain systems are synchronized with an external
time-measuring device. Using an external time source is
common practice in hard-core, real-time processing,
embedded systems and clusters. Synchronizing the kernel
with the Network Time Protocol (NTP) enables Linux to
keep-up with very accurate atomic clocks around the
world via the Internet. In such systems, the kernel
writes time to the CMOS every 11 minutes. When doing so,
the kernel disables the RTC periodic interrupts for a
short time.
By Danny Kalev