Linux Tips and Tricks

System-based I/O vs. Raw I/O
Before version 2.2, the Linux kernel supported only one type of disk I/O -- system-based I/O in which a process interacts with a physical device through an intermediary kernel buffer. This intermediary buffer is transparent to the user: calls to read() write() lseek() etc., seem as if they access a physical file directly. In practice, however, the kernel intercepts the calls and transfers the data to its own buffer before passing it on to the physical device or process. System-based I/O offers several advantages. The kernel can cache data and thus reduce the overall physical I/O operations needed; the kernel controls the overall system performance because it won't allow an on-the-loose process to starve other processes. Finally, certain devices are fussy about the size of data being written or read (for instance, disk drives usually handle fixed size chunks of 512 bytes on each I/O operation). The kernel hides these physical limitations from the user. In raw disk I/O, the process interacts with a physical device directly, without the kernel's brokerage.
Uses of raw disk I/O
Although the traditional system-based I/O is satisfactory in most cases, some applications must use raw I/O. One common scenario is in data-critical applications, where the user wants to ensure that the data is written to a disk immediately so that it isn't lost in the event of a system failure. Specialized applications, say a relational database engine, often use their own I/O caching algorithms. In such applications, the overhead of the kernel's caching is uncalled for.
Raw I/O Past, Present, and Future
The concept of raw disk I/O is hardly new. Several attempts have been made in the past to introduce raw I/O to Unix. Indeed, quite a few variants such as IBM's AIX now support it. The problem, however, is that most existing implementations require literally doubling the number of device nodes. Linux creators rejected this approach. Instead, kernel 2.4 uses a pool of device nodes that can be associated with any arbitrary block device. A new object called "kiobuf" was introduced. A kiobuf is an abstraction of a set of kernel pages set up during boot time. Raw I/O is achieved by creating a kiobuf and populating it with the physical pages a given process is using for I/O, without any intermediate copies.
The kiobuf object is then passed to the I/O layers for reading and writing. The current Linux implementation isn't perfect yet but it's constantly improving. For further information, you can subscribe to the kiobuf-io-devel list server or read prior postings here.
By Danny Kalev
Non-Uniform Memory Access (NUMA)
SMP in a Nutshell
In a previous newsletter, I presented the principles of symmetric multiprocessing (SMP). An SMP system combines multiple processors that operate under a single operating system and access each other's memory over a common bus. However, SMP's scalability is rather limited; once the system includes more than 16 processors, performance usually deteriorates. The problem lies with the throughput of the shared bus that connects the processors to memory devices. As the number of processors increases, the bus becomes saturated and turns into a performance bottleneck.
Enter NUMA
NUMA is a relatively new method of configuring a cluster of processors in a multiprocessor system so that they share memory locally thereby improving performance and users' ability to expand the system beyond the inherent limits of SMP. NUMA adds an intermediate level of memory shared among a few processors so that most data accesses don't have to travel on the main bus. NUMA defines three cache layers, where a lower number indicates a faster cache: L1, L2 and L3. When a processor looks for data, it first looks in the L1 cache on the processor itself (MMX processes, for instance, have a private 32KB cache each), then on a larger L2 cache chip nearby, then on the L3 cache that NUMA provides. Only if all the previous lookups have failed does the processor seek the data in the external memory, which is significantly slower. Put differently, NUMA introduces an additional cache layer that reduces the number of accesses to the external memory.
NUMA-enabled SMP
A typical NUMA-based machine consists of multiple clusters, or units. Each unit consists of four processors interconnected by a local bus to a shared memory (the L3 cache) on a single motherboard. A common SMP bus interconnects several units thus forming an SMP system. Such a system may contain up to 256 processors. NUMA views each of these units as a node in the interconnection network. However, a user-level application views all the individual cluster's memories as a single memory.
For further information about NUMA see: [0042]
By Danny Kalev
A New VM
The Story So Far
The original kernel 2.4 Virtual Memory (VM) manager (known as the Rik Van Riel VM, or RVR VM) suffered from serious problems. Long spinlocks would block a CPU for relatively long periods of time on a multiprocessor machine; in order to perform swapping, the required swap file was at least as large as the system's RAM; and servers with 8GB of RAM had to allocate at least 9GB of disk space to enable swapping. Up until kernel 2.4.7, machines with minimal RAM would suffer from sudden "swap storms" that would practically freeze the system. Due to these problems and others, quite a few Linux shops have refused to upgrade from kernel 2.2.19 to 2.4. A few weeks ago, Andrea Arcangeli decided to write a new VM from scratch.
The New VM
Arcangeli's main goals were to fix known bugs, improve the overall VM performance, and simplify it. The new VM (known as the AA VM) divides physical pages into active and inactive pages. These pages are then subdivided into dirty and clean pages (a dirty page is one whose content has been modified). When the active dirty pages comprise 66% of the total pages, the VM looks for the oldest ones and reclassifies them as "inactive dirty". These are later moved to the swap when memory becomes scarce. In addition to the noticeable performance improvement, the swap under the AA VM is additional to the RAM, just like in 2.2 times.
The incorporation of the AA VM into kernel 2.4.10 onward triggered a debate among the kernel community. While Linus Torvalds endorses it wholeheartedly, Alan Cox's Linux tree adhered to the original VM. Fortunately, Torvalds and Cox have finally agreed to adopt the AA VM in future kernel releases.
The AA VM and You
The transition to the AA VM can affect you as a user in various ways including performance tuning, system configuration, third-party applications and libraries, clustering, and hardware compatibility. As always, be cautious and prepare to deal with surprises when upgrading to 2.4.10. Although the AA VM looks very promising, it probably has its own bugs and glitches that will be discovered and fixed in future kernel versions.
By Danny Kalev

System Calls

Cloning A Process
In a previous newsletter, I discussed the fork() system call. fork() launches a new process as a child of the parent process. With fork(), the parent and the child have distinct address spaces. This is a classic implementation of multiprocessing. In a multithreaded application, the main process and its threads behave like distinct processes too, except that they share some or all of their resources, virtual memory, open files, signal handlers and so on.
The clone() syscall enables the programmer to specify the shared resources between the parent and the child processes. clone() has the following prototype:
```
int clone(int flags);
```
The return value of clone() is the same as fork's: in the parent process, it returns the pid of the newly created process or -1 in the event of an error; in the child process the function returns 0. In fact, calling clone() isn't much different from a fork() call. The only difference is the flags argument. This argument must contain the signal that should be returned to the parent process when the child exits. Normally, it's SIGCHLD although under certain conditions, a different value may be used as well. You may combine one or more of the following options using the bitwise OR operator:
- CLONE_VM : The two processes share their virtual memory space, including the stack.
- CLONE_FS : The two processes share file system information such as the current directory.
- CLONE_FILES : Open files are shared.
- CLONE_SIGHAND : The two processes share the same signal handlers.
These constants are declared in the
header file. Sharing a resource means that the two processes see it identically. For instance, when you specify the CLONE_FILES option, not only is the set of open files shared between the parent and its child, but the current position in each file is also shared.
Here is an example of using clone() to launch a child process that share's the parent's virtual memory:
```
    pid_t chld=clone(SIGCHLD|CLONE_VM)
    if (chld>0) {
      /* we're in the parent */
    } else if (chld==0) {
      /* we're in the child */
    }
    if (chld<0) {
      printf("clone() failed");
      exit(1)
    }
```
Note that using clone() directly in an application isn't recommended. Linux has plenty of thread libraries that hide the gory details of thread management and provide a fully POSIX compliant, higher-level thread implementation. These libraries, including the new glibc 2 C library, rely on the clone() syscall underneath.
Waiting for a Child
Collecting the exit status of a child is called "waiting on the process". Linux defines four functions that return the exits status. The first, wait4(), takes four arguments, hence its name. wait4() has the following prototype:
```
pid_t wait4(pid_t pid, int *status, int options, struct rusage *usage);
```
The first argument is the pid of the process whose status is returned. It may contain one of the following ranges of values:
```
< -1 -- Wait for any child whose pgid is the same as the absolute value of pid.
  -1 -- Waits for any child to terminate.
   0 -- Waits for any child in the same process group as the current process.
 > 0 -- Waits for a process whose pid was passed as an argument.
```
The second argument is a pointer to int. wait4() assigns the exit status of the examined process to it. The third argument controls the behavior of wait4(). The WNOHANG flag causes wait4() to return immediately. If no processes are available, then wait4() returns 0 rather than a valid pid. The WUNTRACED flag causes wait4() to return if a child has been stopped. You can combine these two flags with the bitwise or operator. The final argument is a pointer to a rusage struct that wait4() fills with resource usage information of the examined process and its children. If you pass a NULL value, then no resource usage information is returned.
Each of the three additional functions offers a subset of wait4()'s functionality:
```
    pid_t wait(int* status);
```
The call always blocks until a child has returned.
```
    pid_t waitpid(pid_t pid, int *status, int options);
```
Identical to wait4() except that it doesn't retrieve resource usage information.
```
    pid_t wait3(int *status,  int options, struct rusage *usage);
```
Identical to wait4() except that the user cannot specify which child should be checked.
Analyzing the Exit Status
As the exit status format is somewhat convoluted, a set of macros enables you to extract information about the causes of the examined process's exit:
```
    WIFEXITED(status)
```
Returns true if the process exited normally (i.e., if it exited as a result of calling exit() from the main() function or if its main() executed a return statement). If WIFEXITED(status) is true, WEXITSTATUS(status) returns the process's exit code.
```
    WIFSIGNALED(status)
```
Returns true if the process exited due to a signal. If WIFSIGNALED(status) is true, WTERMSIG(status) returns the signal number that terminated the process.
```
    WIFSTOPPED(status)
```
Returns true if the process has been stopped due to a signal (read more on
Linux processes' modes. If WIFSTOPPED(status) is true, then WSTOPSIG(status) returns the signal that stopped the process. Note: wait4() returns information on stopped processes only if you specify the WUNTRACED flag as on option.
By Danny Kalev
Dynamic Loading
This week, I will present the dl library and discuss dynamic loading concepts. In essence, dynamic loading consists of opening a library, searching for one or more symbols (which can refer to functions, variables, arrays, etc...), using the retrieved symbols, and closing the library. All of the functions and data types pertaining to dynamic loading are declared in the header <dlfcn.h>.
The dl Functions
The dlerror() function reports the most recent error that occurred with any of the remaining dl functions. I will discuss the other dl functions shortly. The dlerror() function's prototype is as follows:
```
const char * dlerror(void);
```
dlerror() returns a string describing the error that occurred. If no error occurred, then it returns NULL. Remember that this function clears the error after it returns, so you must store the result if you intend to use it later.
The dlopen() function opens a library and returns a matching handle. A library is a file of compiled code modules. Here is an example of the dlopen() function:
```
void * dlopen(const char * filename, int flags)
```
Typically, you pass an absolute pathname (i.e., a file beginning with a slash) as the filename but in this case, dlopen() doesn't need to search for the library. However, if you pass a simple filename, then the function looks for the library in the following places until it locates the specified library file:
1. A set of directories specified in the LD_ELF_LIBRARY_PATH environment variable. If this variable isn't defined, the directories listed in LD_LIBRARY_PATH are searched.
2. The libraries listed in the /etc/ld.so.cache file.
3. /usr/lib
4. /lib
dlopen() returns NULL if it fails to open the specified library.
Two types of symbol resolution exist: lazy and immediate. A lazy resolution means that symbol resolution is performed on demand. Lazy resolution is useful when you need to load a small portion of a huge library efficiently. By contrast, immediate resolution resolves all the library's unresolved symbols at once. Immediate resolution is useful when the library is small and for debugging purposes.
dlopen()'s second argument indicates the resolution type: RTLD_LAZY and RTLD_NOW, for lazy and immediate resolution, respectively. You may combine either of these constants with the RTLD_GLOBAL to export symbols to other modules.
For the symbol lookup process, use the dlsym() function:
```
void* dlsym(void * libhandle, char * symbol);
```
libhandle is a value returned from a previous dlopen() call. The second argument represents the function or variable's name that you want to retrieve. If the symbol was found, then dlsym() returns its address (you have to explicitly cast the return value to the desired type). Otherwise, dlsym() returns NULL. Note, however, that a NULL value can also be a valid address of a symbol, an initialized pointer for example. Therefore, if the sought-after symbol may have a NULL value, you have to call dlerror() after calling dlsym() to check whether the retrieval was successful.
Calling dlclose() closes a library. It has the following prototype:
```
void * dlclose(void * libhandle);
```
A Complete Program
Suppose you have a shared library called greetings.so that you compiled and installed in the /usr/lib directory. For the sake of brevity, let's assume that the greetings library has only one function: goodday(). The greetings.c file looks as follows:
```
/* greetings.c - implementation of the greetings shared library.
   The resulting greetings.so file is located at /usr/lib */
#include <stdio.h>
void goodday(void)
{ printf("have a good day\n");
}
```
The greetings.h header looks like this:
```
/* greetings.h -- declares the goodday() function of the greetings.c library */
#ifndef GREETING_H_INCLUDED
#define GREETING_H_INCLUDED
  void goodday(void);
#endif
```
Let's assume that we already compiled greetings.c and installed the resulting greetings.so file at /usr/lib. Now we want to use this shared library in another program that uses the dl interface. Our program opens the greetings.so library, locates the function goodday(), calls it using the pointer returned from dlsym(), and closes the library. I included below two technical notes that refer to the comments numbered 1 and 2.
```
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>
#include "greetings.h"
int main()
{ void* lib;
  void (*pgoodday)(void); /*pointer to function */
  const char *errmsg;
  lib = dlopen("greetings.so",RTLD_LAZY);
  if (lib == NULL) {
    printf("failed to open gretings.so: %s",dlerror());
    exit(-1);
  }
  dlerror(); /* clear before calling dlopen() */
  pgoodday = dlsym(lib, "goodday");
  /*locate function's address*/
  errmsg = dlerror(); /* #1 check whether dlsym succeeded */
  if (errmsg) {
    printf("failed to locate goodday(): %s", dlerror());
    exit(-1);
  }
  goodday(); /* #2 call library's function through the pointer obtained previously */
  dlclose(lib);
  return 0;
}
```
Technical Notes
1. In this case, there's no need to call dlerror() to check whether dlsym() succeeded. Instead, we could have examined the value of pgoodday. Because functions' addresses are never NULL, a NULL value would have indicated a failure. However, I wanted to demonstrate the use of this technique.
2. When you call a function through a pointer, you use the pointer as if it were the function's name. However, the following form is also legal, albeit antiquated and perhaps less readable: (*goodday)();.
By Danny Kalev
Memory Mapping
A process may map files to its address space, thereby creating a 1-to-1 equivalence between the files' data and its corresponding memory-mapped region. Memory mapping has several uses:
- Dynamic loading
  A program can load executable code sections dynamically by mapping executable files and shared libraries into its address space.
- Fast-file I/O
  Ordinary file I/O operations, such as read() and write(), require the kernel's intermediation; data from the file is copied to a kernel's buffer rather than directly to the process. Memory mapping eliminates this intermediary buffering, thus improving the overall I/O performance.
- Just-in-time compilation
  Memory allocated through memory maps can be made executable and filled with machine instructions that are then executed. Utilities that generate code on the fly and just-in-time compilers use this feature.
- Streamlining file access
  Once you map a file to a memory region, you access it via pointers, just as you would access the program's variables and arrays, thus eliminating the need for fread(), fseek(), fwrite() function calls.
- Memory persistence
  Memory mapping enables processes to share memory sections that persist independently of a certain process's lifetime.
Mapping and unmapping
You map memory by using the mmap() system call that is declared in <sys/mman.h> as follows:
```
caddr_t mmap(caddress_t map_address, size_t length, int protection, int flags, int fd, off_t offset);
```
The map_address parameter indicates the address to which the memory should be mapped. A NULL value allows the kernel to pick any address. The second parameter indicates the number of bytes to map from the file. The third parameter indicates the types of access allowed to the mapped region:
- PROT_READ the mapped region may be read
- PROT_WRITE the mapped region may be written
- PROT_EXEC the mapped region may be executed
The fourth parameter indicates various mapping attributes. The MAP_LOCKED attribute, for instance, guarantees that the mapped region is never swapped. The fd parameter is a descriptor of the file being mapped into memory. Finally, offset specifies the position in the file from which mapping should begin. Offset 0 indicates mapping from the file's beginning.
In the following example, the program maps the first 4 KB of a file passed whose name is passed as a command line argument:
```
#include <errno.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/types.h>
int main(int argc, char *argv[])
{ int fd;
  void * pregion;
  if (fd = open(argv[1],O_RDONLY) < 0) {
    perror("failed on open");
    return -1;
  }
  /* map first 4 kilobytes of fd */
  pregion = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd, 0);
  if (pregion == (caddr_t)-1) {
    perror("mmap failed");
    return -1;
  }
  close(fd);
  /* access mapped memory */
}
```
The munmap() function unmaps a mapped region.
```
int munmap(caddr_t addr, int length);
```
addr is the regions address and length specifies how much of the memory should be unmapped (you may unmap a portion of a mapped region).
By Danny Kalev
Memory Locking
A process's memory regions may be paged, or swapped, to a disk file when RAM has become scarce. The virtual memory management system is responsible for this swapping. In most cases, a process doesn't care about the exact, physical location of a certain memory page. However, the latency caused by the kernel's paging of memory back into RAM may be unacceptable to time-sensitive applications.
To avoid this latency, a process can lock specific memory regions in RAM and thus ensure that they aren't paged to a disk file. Only processes with root permissions are allowed to lock memory. The total amount of memory that a process can lock shall not exceed the RLIMIT_MEMLOCK value. The memory locking interfaces are declared in <sys/mman.h>. This header file declares the following functions:
```
int mlock(caddr_t addr, size_t length);
```
mlock() locks length bytes starting at address addr. Since Linux can only lock entire pages, the beginning and the end of the locked region are "rounded" to fit into page boundaries. Consequently, the locked block may be equal or larger than length. When mlock() returns, all the pages in the specified range have been locked in RAM.
```
int mlockall(int flags);
```
To lock the entire address space of a process, you call the mlockall() function. The flags argument may be one of the following constants or both of them combined with the bitwise OR operator:
MCL_CURRENT - all the pages in the process's address space are locked into RAM. MCL_FUTURE - all future pages added to the process's address space would be locked into RAM.
To unlock memory, you call munlock():
```
int munlock(caddr_t addr, size_t length);
```
The munlock() function takes the same arguments as mlock() and unlocks all the pages in the indicated region.
To lock all the locked regions at once, call munlockall(). This function has the following prototype:
```
int munlockall(void);
```
Locking the same page multiple times has the same effect as locking it once. Regardless of the number of lock operations, a single call to munlock() will unlock the specified region.
By Danny Kalev
Obtaining the Current Working Directory
This week, I will demonstrate three techniques for obtaining a process's current working directory from a program.
1. The getcwd() Syscall
  The getcwd() syscall that retrieves the current working directory is declared in <unistd.h> as follows:
```
char * getcwd(char * buf, size_t max_size);
```
  A getcwd() call fills the argument buf with the path of the current working directory. The path's length may not exceed max_size - 1 (remember that the last char holds the null character). If the path's length exceeds this value, then the function sets the errno flag to ERANGE and leaves buf empty. If an error occurs, then the function returns NULL.
  Under certain conditions, the path's length may be extremely long. In this case, you need to allocate buf dynamically. Linux, like many other UNIX systems, defines an extension to getcwd(): when you pass NULL as the first argument. The function dynamically allocates a buffer large enough to hold the current path and returns its address. You must call free() to release the dynamically allocated buffer after using it. Here's an example:
```
    char *dir=getcwd(NULL, 0);
    printf("current working directory is: %s",dir);
    free(dir);
```
2. The PWD Environment Variable
  A second method for obtaining the current working directory is using the PWD environment variable. Most modern shells store the current working directory in this variable. You can retrieve it using the getenv() function:
```
    char *cwd=getenv("PWD");
```
  Note, however, that PWD may contain symbolic links. By contrast, getcwd() will always return a path free from symbolic links.
3. The getwd() syscall
  A third method for obtaining the current working directory is calling the function getwd(), which has the following prototype:
```
char *getwd(char * buf);
```
  As you can see, getwd() doesn't take a size limit; it assumes that buf is sufficiently large enough to hold the path. However, it will not write more than PATH_MAX (this constant is defined in <unistd.h>) characters to buf, so you can avert potential buffer overflows by passing a buf of at least PATH_MAX characters. If the path is longer than PATH_MAX, then getwd() will not allow you to retrieve it. Due to this limitation, getwd() is deprecated and should be avoided although you may still find it in legacy code.
By Danny Kalev
uids and gids
This week, I will introduce two fundamental concept of the Linux process model: user ids (uid) and group ids (gid). Then, I will exemplify how to use the relevant library functions for setting and retrieving these attributes.
A process is associated with a user id (uid) and a group id (gid). Uids and gids are integers that the system maps to the corresponding user names and group name listed in the /etc/passwd and /etc/group directories, respectively. The uid 0 is reserved for the system administrator, or root. Security checks are disabled for processes with this uid. Generally, a process has one uid and one gid associated with it. However, in large projects where users of different groups access the same files, this restriction can be limiting. The solution is to assign supplemental groups to a process. Thus, a process may still have a primary gid plus a set of supplemental groups. Consequently, security checks that ensure that a process belongs to a specific group will check whether it belongs to one of the supplemental groups. The constant NGROUPS_MAX defined in <sys/param.h> holds the maximum number of supplemental groups to which a process may belong.
Setting and Retrieving Supplemental Groups from a Program
The setgroups() syscall allows a process with root permissions to assign supplemental groups to itself. Here's its prototype:
```
int setgroups(size_t n, const gid_t *glist);
```
The argument n specifies the number of supplemental groups, or elements, in the array glist. The argument glist points to the beginning of an array of gids that will serve as a list of supplemental groups for the current process.
To obtain the list of all supplemental groups to which a process belongs, use the getgroups() syscall. It has the following prototype:
```
int getgroups(size_t n, gid_t * glist);
```
The argument n specifies the maximum number of gids that the array glist may contain. The function returns -1 in case of an error, or the number of supplemental groups. As a special case, you can obtain the number of supplemental groups without copying them into an array by passing 0 as the first argument. In that case, the value returned from getgroups() is the number of supplemental groups of the process.

By Danny Kalev
File Truncation
The file system automatically expands a file's size as more data is appended to it; however, it has no way of knowing how to shrink a file when the data at its end is no longer needed. Consider the following scenario: You have a log file that contains a header at its beginning and a varying number of fixed-size records. You want to keep the log file for a certain amount of time (say up to a week after the last record was appended) and then shrink it to its original size (i.e., remove the records while keeping the header) so that new records will be appended after the header. One way to do this is copying the header to another file, delete the original file, and rename the copy. However, there is a simpler way to achieve the same effect. The <unistd.h> header declares two syscalls that truncate files, namely truncate() and ftruncate(). Here are their prototypes:
```
int truncate (const char * path, size_t length);
int ftruncate (int fd, size_t length);
```
truncate() takes a filename as its first argument. It sets the file's size to length, discarding any trailing bytes past the new end. If length is larger than the current file size, then the file is expanded accordingly. The ftruncate() syscall is similar to truncate() except that it takes a file descriptor rather than a filename. Remember that when using ftruncate(), the file descriptor must refer to an open file.
In the following example, the program accepts a filename as a command line argument, displays its current size, prompts the user for a new size and truncates it accordingly:
```
    #include <sys/stat.h>
    #include <unistd.h>
    #include <stdio.h>
    #include <stdlib.h>

    int main(int argc, char * argv[])
    { struct stat info;
      size_t len;
      if (argc < 2) {
        printf("filename not provided, aborting\n");
        exit(1);
      } 
      if (stat(argv[1], &info)) {
        printf("stat() error, exit forced");
        exit(1);
      }
      printf("size of file %s is: %d \n", argv[1], info.st_size);
      printf("enter new size: \n")
      scanf("%d", &length);
      if (!truncate (argv[1], len)) {
        printf("file %s was truncated successfully\n", argv[1]);
      } else {
        printf("truncation failed");
        exit(1);
      }
      return 0;
     }
```
By Danny Kalev

System Calls
Every piece of code can run in one of two contexts: user mode and kernel mode. A user's program normally runs in user mode. By contrast, device drivers and file systems run in kernel mode. In user mode, the code executes in a protected environment and thus can't damage the system's other processes. Code that executes in kernel mode has full access to the hardware and low-level system resources.
Consider a device driver. In order to operate, it must have unrestricted access to the device it controls. Therefore, it must execute in kernel mode. It provides services to processes running in user mode, which cannot access the device directly. The technical details of calling through user/kernel mode are usually hidden from the programmer. A user typically invokes a system call (syscall for short) that interacts with the hardware and returns a value. Superficially, syscalls look like ordinary C functions. However, they differ from ordinary functions in two important aspects:
- Argument passing - A syscall may take only arguments of a native word size. Thus, when you pass a single char or short to a syscall, the compiler promotes the argument to a 32-bit word on a 32-bit system.
- Return value - A syscall returns a signed word. Typically, the return value indicates the status of the operation but it can also be a pointer. A syscall cannot return structs in aggregates. Likewise, you cannot pass structs by value to a syscall. There is a simple workaround to this restriction: pass a pointer.
In most cases, the bare syscalls are wrapped in ordinary C functions. For example, the standard library's exit() function wraps the _exit() system call. The wrapper function often performs additional operations before and after the invocation of a syscall, say flushing streams, closing files and checking the syscalls return value. It's recommended that you use syscalls wrapper functions instead of calling a bare syscall directly whenever possible. Most syscalls are declared in the <unistd.h> header. Most syscalls return 0 to indicate success and a negative value to indicate an error. You can use the perror() and strerror() interfaces discussed here a few weeks ago to obtain a complete description of the error code returned.
By Danny Kalev
Concurrency
As we already know, Linux supports two forms of concurrency ? multiprocessing and multithreading. Multiprocessing means that the system can execute two or more processes (either distinct ones or multiple instances of the same program) simultaneously. Multithreading means that the same process may have two or more distinct, concurrently executing code segments, or threads, that share some or all of its resources. Each of these two forms mandates different synchronization and mutual exclusion features to ensure data consistency and avoid race conditions.
Events are a common concept in multithreaded environments. A thread may depend on various event types: setting a flag, releasing a lock, receiving a signal and so on. When a thread is waiting for an event, the scheduler blocks it and removes it from the run queue until the event occurs.
In a multithreaded environment, every thread may be interrupted, or preempted, at any given time. In fact, it may be interrupted between every two CPU instructions. Certain operations, however, must be completed without interruption. Such operations are said to be atomic. A database transaction is a good example of this. Likewise, the process of inserting a new task to the kernel's active task list is atomic. A thread that performs an atomic operation has non-interruptible access to the system's CPU. On multiprocessor machines, spinlocks (which we discussed several weeks ago) lock a certain processor for a task that is waiting for a lock.
Linux supports the following synchronization mechanisms:
1. Critical Section
  A critical section is a piece of code that may be accessed only by a single thread at a time, for example a code segment that changes a shared variable.
2. Mutual Exclusion
  A mutual exclusion (mutex for short) is an object that restricts access to a variable or code section to a single thread at a time. Thus, you may use a mutex to protect a critical code section.
3. Semaphores
  A semaphore restricts access to a variable or object to a limited number of threads. For example, a semaphore may be used to ensure that three threads at most access a system's clock.
By Danny Kalev
Processes and Threads
In the Beginning
GUI-based systems? advent (around 10 years ago) provided the impetus to switch from multiprocessing to multithreading. Classic operating systems, e.g. Solaris, associated a process with a private table of file descriptors, a unique PID, a table of signal handlers, and a private address space. Obviously, launching a new process (e.g., by calling fork() or a similar function) required considerable overhead. The child process would get its own copy of the parent's addressable region, table of file descriptors, and so on. From this limitation arose threads.
Unlike processes, threads run within the same address space and share their process' data. In such environments, the thread creation and destruction takes place considerably faster compared to a full-blown process' creation or destruction. Under Solaris, for example, launching a new thread is about 70 times faster than launching a new process. Linux, however, is radically different.
The Linux Approach
Under Linux, threads and processes are almost indistinguishable except for one thing: threads run within the same address space whereas processes have distinct address spaces. However, no differences exist between the two from a scheduler point-of-view. Thus, a context switch between two threads of the same process essentially jumps from one code location to another, plus setting a few CPU registers.
What happens when you launch a new process? Linux implements the copy-on-write model, which leaves the mapped memory shared between a parent process and its child as long as the child doesn't alter the shared addressable region. Only when the child writes to the shared address space does the kernel allocate new storage. Hence, launching a new process in Linux involves significantly lower overhead compared to Solaris and other OSs.
Some Unix systems support the vfork() function to implement copy-on-write, as opposed to plain fork(), which copies the entire parent's address space to a new region. However, under Linux fork()uses copy-on-write anyway minimalizing the difference between the two functions. In fact, vfork() merely serves as a wrapper function that calls fork().
Practical Lessons
In this regard, Linux newcomers often are unaware of the substantial differences between Linux and other operating systems. To implement concurrency, they use multithreading exclusively, mistakenly assuming as high an overhead associated with Linux multiprocessing as on other platforms. However, this is not the case. In fact, many multithreaded applications ported from other platforms to Linux can benefit from replacing multithreading with multiprocessing; this will eliminate the overhead of critical sections and other locking mechanisms used in multithreaded applications.
By Danny Kalev
Know your Scheduler
Linux scheduler and its operation mode.
The scheduler (or dispatcher) is a kernel component responsible for coordinating process and thread execution (collectively referred to as tasks). The scheduler affects the system's overall performance, perhaps more than any other kernel component, by assigning each task a fair amount of CPU time based upon the tasks priority and class process. However, the scheduler also guarantees the lowest priority tasks aren't starved of CPU time.
Linux implements the preemptive multitasking model, whereby the scheduler removes a task from the CPU once its allocated time quantum has expired. Of course, a task may voluntarily surrender -- technically speaking, yield -- its CPU time if it's waiting for another process to release a resource. Furthermore, the scheduler reprioritizes tasks. If a task has already received some CPU time, then it will have lower priority compared to tasks with identical priority that have not. Under Linux, a task's priority can range between 20 and -20; the default priority is 0, and the highest priority is -20. In other words, the lower the value, the higher the priority (a common source of confusion for novices). A user can set a priority lower than 0 using the "renice" command, but the user must have an administrator's privilege.
Linux supports real time tasks as well. A real-time task always takes priority over all non real-time tasks. In essence, a scheduler sets a process's state into one of the following modes:
- Running -- A process is the current system process and is utilizing the system's CPU.
- Running, ready to run -- The process is in the run queue, waiting for the CPU to become available.
- Waiting, interruptible -- The process is waiting for a lock or an event. It hasn't disabled incoming signals, so it may be interrupted.
- Waiting, uninterruptible -- The process is waiting for a lock or an event. It has disabled signals and therefore, it cannot be interrupted.
- Stopped -- The process has stopped, typically due to a SIGSTOP signal. Such a process may resume execution later (i.e., its state may change to Running) if, for example, it's being executed under the supervision of a debugger.
- Zombie -- The process is dying (due to a malfunction or power fault) or has completed its execution. However, its entry hasn't been removed yet from the system's process table.
By Danny Kalev
Job Control
To distinguish them from system processes, the commands that users execute from a shell or a script file are often called "jobs". This week, I will show you how a set of job control operations can allow the user to control the execution of such jobs from the command shell.
Background Execution
Placing an ampersand at the end of the command executes a job in the background. The system assigns each user's job a unique number in addition to a PID. When you execute a command in the background, a user's job number and a system process id appear on the screen. For example:
```
    $ lpr payroll &
    [1]   236
    $
```
The number in brackets is the user's job id; the second number is the system process's PID. You can place multiple commands in the background at once by separating each command with an ampersand. For example:
```
    $ lpr july & cat *.c &
```
The "jobs" command lists the current background jobs. Jobs displays each entries' job number in brackets, whether it's stopped or running, and its name:
```
    $ jobs
    [1]   +   Running   lpr july
    [2]   -   Running   cat *.c
```
A plus sign indicates that the job is currently being processed. A minus sign indicates the next job to be executed.
Notification
By default, Linux doesn't interrupt other operations, say an editing session, to notify you that a certain job has completed. If you wish to be notified on a job's completion, then use the "notify" command. Notify takes a job number preceded by a percent sign as its argument:
```
    $ notify %2
```
This will cause the system to notify you when job 2 has completed, regardless of your current activity.
Switching between Background and Foreground
You can bring a background job to the foreground by executing the "fg" command. If there are several background jobs, then you must indicate which job you wish to move to the foreground by indicating its job number:
```
    $ fg %2
```
Likewise, you can move a currently running foreground job to the background with the "bg" command. In order to do that, you first need to suspend the job by pressing CTRL-Z. Then issue the bg command to resume the suspended job's execution in the background. For example:
```
    $ lpr july
    ^Z
    $ bg
```
Killing a Job
To cancel a running job, use the "kill" command. kill takes a PID or a job number preceded by % as its argument. In the following example, the user examines the currently running jobs and cancels job 2:
```
    $ jobs
    [1]   +   Running   lpr july
    [2]   -   Running   cat *.c
    $ kill %2
    $
```
By Danny Kalev
Sending signals
The simplest example of this is the hang up signal, or SIGHUP in Unix-speak. When a user is logged on from a remote terminal, the line can hang up for a number of reasons. Phone outages, modem problems, or a power loss at the remote terminal. All of these conditions cause a program to go "out of control." In other words a program or process that was being run from a terminal is no longer under the control of that terminal. The program itself needs to know what to do in that case. The Unix operating system keeps track of which processes are being run by which terminal, and when the terminal hangs up (drops the connection) the operating system sends a SIGHUP signal to all the programs that are launched from that terminal.
The process has three options when it receives the SIGHUP signal.
- The process does the default action, which is to stop executing immediately.
- The process can be programmed to catch the signal (called trapping the signal) and ignore it. The process continues running.
- The process can be programmed to catch (trap) the signal and do something sensible such as close all open files and exit.
All three options have their place depending on the type of program that is being run.
The kill command
The kill command can be used to send a signal to a running process. You can use ps -ef to locate a job number, and then use kill to send a signal to that job. But before you start killing processes, it is necessary to understand a little about what signals do and how programs handle them.
In order to see some practical results, let's try a few simple examples. First switch into the Korn shell by typing ksh and pressing enter. For these examples I will be using a short script and the kill command. The kill command lets you send a specific signal to a process. The syntax for kill is:
```
	kill -(SignalNumber) JobNumber
```
The hang up signal is 1 (one), so the command kill -1 1234 would send the hang up signal to job 1234 just as if the phone line had been hung up.
Listing 1 is a short script that sleeps for five seconds, wakes up, prints a message that it is awake and then goes back to sleep. Type this script in and save it as waiter, and then change its mode so that it can be executed by typing:
```
	chmod a+x waiter

    Listing 1: waiter
	while true
	do
	    sleep 5
	    echo "Awake"
	done
```
You can run Listing 1 by detaching it from the terminal. To do this type waiter followed by an ampersand (&) on the command line. After you press Enter, something like the following will appear:
```
	waiter &
	[1]  4567
```
Make a note of the second number as it will vary. It is the process ID or job number for the "waiter" process. This job is now running as a background process, but because it is echoing the word "Awake" every five seconds, Awake will appear on your terminal at five-second intervals.
You can stop the job by typing the following command, but use the actual process ID number that appeared when you started the job:
```
	kill -1 4567
```
Don't worry if the word "Awake" butts into the middle of your typing. It won't affect the command you are typing. Finish typing the command and press Enter. The kill command sends the hang up signal to waiter and waiter simply does the default action of dying. At this point Awake stops appearing at your terminal.
If Awake continues to appear on your terminal, you should make sure that you have noted the job number correctly and try the kill command again. If that doesn't work, repeat the kill command but change the -1 to -9. There will be more about -9 in just a moment.
In Listing 2, waiter has been modified to trap the hang up signal. First a function is created that prints the fact that a hang up signal has been received. Then the trap command is used to set up a trap for the hang up signal. Instead of quitting the program executes the function echo01 and then continues.
```
    Listing 2: Waiter with a trap
	function echo01 {
	    echo "Received signal 1 (SIGHUP)"
	}
	trap echo01 1
	while true
	do
	    sleep 5
	    echo "Awake"
	done
```
Now if you start waiter with an ampersand and try to kill it with -1, your screen will look something like this.
```
	$ waiter &
	[1]   951
	$ Awake
	Awake
	kill -1 951
	$ Received signal 1 (SIGHUP)
	Awake
	Awake
```
The trap in the program now catches the signal 1 and simply displays a message and continues running. You can stop the program by sending a different signal such as kill -2 or kill -9.
This technique is used in large applications. There the program is frequently in the middle of important or complicated actions, like closing open files, that should not be handled in "drop dead" fashion. Listing 3 is closer approximation of how a large application would handle a hang up signal.
```
    Listing 3: Waiter with a bigger trap
	function echo01 {
	    echo "Received signal 1 (SIGHUP)"
	    echo "Now I would close files if I had any open."
	    exit
	}
	trap echo01 1
	while true
	do
	    sleep 5
	    echo "Awake"
	done
```
Start this latest version of waiter with an ampersand and try to kill it with -1. Your screen will look something like this, and the program will stop executing:
```
	$ waiter &
	[1]   951
	$ Awake
	Awake
	kill -1 951
	$ Received signal 1 (SIGHUP)
	Now I would close files if I had any open.
	SIGKILL: The command that cannot be ignored
```
So what does kill -9 do? Signal 9 is SIGKILL, and it cannot be trapped. If you send a signal 9 to a process you are telling the operating system to cut it off at the knees -- drop dead now. The advantage of signal 9 is that the program cannot trap it and ignore it. The disadvantage of signal 9 is that the program cannot intercept it and perform an orderly shut down even if it needs to. Listing 4 is waiter, modified with a trap for signal 9.
```
    Listing 4: Waiter trying to trap signal 9
	function echo09 {
	    echo "Received signal 9 (SIGKILL)"
	    echo "Now I would close files if I had any open."
	    exit
	}
	trap echo09 9
	while true
	do
	    sleep 5
	    echo "Awake"
	done
```
Start version 4 of waiter with an ampersand and try to kill it with -9. Your screen will look something like this, and the program will stop executing:
```
	$ waiter &
	[1]   1151
	$ Awake
	Awake
	kill -9 1151
	$
```
There is no friendly message, no "now I am trying to close files" information on the screen. The process died where it stood on receipt of a signal 9 even though a trap was prepared for it.
Using kill -9 on a process that controls a database application or a program that updates files can be disastrous.
Most well behaved processes are written to allow an orderly shut down when some signal other than 9 is received. Signal 1, SIGHUP, is possibly the most common signal used for an orderly shut down. Many applications intercept and shut down correctly for most signals, so try 1 and others below before you try 9.
Signals you should know
Here are some of the other common signals and what causes them to be generated.
- 1: SIGHUP, hang up -- Caused by the phone line or terminal connection being dropped.
- 2: SIGINT, interrupt -- Generated from the user keyboard usually by a Control-C, Backspace or Delete. To find out which, type stty -a and press Enter. In the listing you will find intr = DEL, or intr = ^C, or intr = ^H, or something similar.
- 3: SIGQUIT, quit -- Also generated from the keyboard usually by Ctrl-\ or Ctrl-Y. To find out which, type stty -a and press Enter. In the listing you will find quit = ^\, or quit = ^Y or something similar. A SIGQUIT often causes a core file to be created. This is a copy of your current memory.
- 15: SIGTERM, software terminate -- This is usually generated by another program. In fact the kill command uses 15 as the default. If you specify kill job, with no signal number, kill sends a signal 15 to the job. Using kill without a signal number is usually a good place to start on killing a process.
It requires extra time and coding to write a trap for a signal into a program. When a trap has been written into a program, it is usually for good reason. If the program can simply die without doing any cleanup, then why go to the trouble of including a trap? That's why it is a good idea to try 15 and 1 and 2 before ever resorting to 9.
By Mo Budlong, Unix Insider
Passwd and Shadow Files
A user account consists of a valid username and password, a home directory, and a default shell. When the user attempts to log in, Linux examines the passwd file to ensure that these requirements are met. The passwd file, located in the /etc directory, contains user account records each consisting of 7 fields separated by colons. Here's an example of a passwd file:
```
    root:x:0:0:root:/root:/bin/bash
    bin:x:1:1:bin:/bin:
    james:x:600:600:201-234-5678:/home/james:/bin/bash
```
Let's parse the last entry and learn what each field means.
- The user's username.
- The second field traditionally stores the user's password in an encrypted form. However, newer Linux distributions use a shadowing system (I will discuss shadowing shortly). Such systems merely store a placeholder in this field and keep the passwords in a different file.
- UID. This number is attached to the user's processes and thus enables the sysadmin to associate the currently active processes to their users. Although you can assign arbitrary UIDs to users, restricting these numbers to a range (e.g., 600-699) is advisable. Remember that UID 0 is reserved for root.
- GUID. A user may belong to several groups but has only one native group. This field stores the native group value.
- The fifth field is called the General Electric Comprehensive Operating System field (GECOS). Traditionally, it stores the user's real name. However, you can store any other value in this field such as the user's telephone number. This field is mostly used for reporting purposes such as Finger queries. In this example, the field contains the user's telephone number.
- User's home directory. In this example, the users home directory is /home/james.
- User's default shell. The default shell is the one that Linux invokes when the user has logged into the system. Although bash is the most common shell, other options are available -- namely ash, csh, ksh, tcsh, and zsh.
Shadowing
Shadowing systems store users' password and associated rules in a special file called /etc/shadow. When a shadowing system is in use, the passwd file remains readable but it doesn't contain passwords anymore. Instead, the password field is filled with a placeholder. A shadow file looks like this:
```
    root:HDJIKW1.PA:11015:0::7:7::
    james:7aNicVa5rg9B:11015:0:-1:7:-1:-1:
```
A shadow file contains 9 fields separated by colons (the values in parentheses are taken from the last entry of the above shadow file):
- Username (james)
- Password in an encrypted form (7aNicVa5rg9B)
- Number of days since 1/1/1970 that the password was last modified (11015)
- Number of days left before the user is allowed to change his password (0)
- Number of days left before the user is forced to change his password (-1)
- Number of days in advance that the user is prompted to change his password (7)
- Number of days left before disabling the account unless the user changes his password (-1)
- Number of days since 1/1/1970 that the account has been disabled (-1)
- Reserved
By Danny Kalev
Preemptive Kernel
Preemption 101
"Preemption" basically means that a running task can be suspended while another task takes its place as the currently running process. Once a task has consumed its allotted quantum, or CPU time slice, the scheduler will suspend it and run the next task in the queue. In practice, the scheduling algorithm is more complicated because of different process priorities.
User vs. Kernel Preemption
You might get the impression that the scheduler can preempt a running process at any given time. That's not entirely correct. Linux supports only user-level preemption, which allows the scheduler to suspend a process as long it's running in user mode. It will not suspend a running process that is in kernel mode, though (read more on kernel mode vs. user mode).
Kernel Preemption Pros and Cons
By contrast, a preemptive kernel may suspend a running task even if it's in kernel mode. This is necessary in hard-core, real-time applications that must meet very rigid time constraints. Under the user preemption model, syscalls present a challenge in this regard. Although their execution time is very short, it isn't constant. For instance, the time needed to complete a malloc() call depends on heap fragmentation, the size of the requested memory block, and the system's load. Thus, one call may take 10 milliseconds to complete whereas another might take 50. A task that must terminate within 40 milliseconds might not finish on time if another process calls malloc() in between.
By contrast, the scheduler could preempt the malloc() call and enable the real-time task to always finish on time under a preemptive kernel. Kernel preemption can also improve the responsiveness of some notorious applications that cause the system to freeze up for a short while during certain activities.
Kernel preemption exacts a price, though. Many applications aren't ready to deal with the possibility of a preempted syscall. Preemption also will require a kernel overhaul and complicate the kernel's code. Is it really worth the trouble? For most everyday uses, the answer is "no". That said, with Linux becoming the OS of choice in embedded and real-time environments, the demand to add kernel preemption is constantly increasing. Presently, several preemptive kernel patches are available such as Robert Love's patch. For benchmark results, visit the Linux kernel preemption project.
By Danny Kalev
Spin locks
While many kernel code fragments can be executed simultaneously by different tasks, certain critical code sections in the kernel (e.g., adding a new entry to the list of processes) must be locked for exclusive execution -- executed by a single task at a time. For this purpose, the kernel sets a special flag to -1 just before entering such a critical code section and resets it to 0 once it leaves that critical code section. Any other task wishing to execute this code fragment must acquire a lock, i.e., check that flag first. If its value isn't 0, then the task waits until the flag clears. This synchronization mechanism is called a mutex or, in kernel parlance, a spin lock. There are two types of spin locks: adaptive and spin.
Spinning and Blocking
When a task attempts to acquire a lock held by another task, it can either block or spin. "Spinning" means executing a tight loop that attempts to acquire a lock on each iteration. "Blocking" means that the task waiting for the lock is put to sleep until the lock is released. Each option offers pros and cons. Putting a task to sleep removes its context from the processor's cache. When the task awakens, its context has to be retrieved from the RAM. RAM access is significantly slower than cache access. By contrast, spinning ensures the context of the task attempting to acquire the lock remains in the cache. Although precious CPU time is wasted iterating through the tight loop, the overhead of putting it to sleep and waking it later is avoided.
Adaptive spin locks
In some cases, predicting the more suitable approach is easy. For instance, an I/O lock is usually held for a long time; therefore, blocking a task trying to acquire an I/O lock is a reasonable decision. That said, the Linux kernel has thousands of critical code sections. Predicting the more suitable approach (i.e., blocking or spinning) at runtime is not always possible. Adaptive spin locks solve this dilemma. An adaptive spin lock selects the optimal approach dynamically. For instance, if a task is waiting for another task to release a lock, then the kernel first checks the lock owner's status. If it's running, then the task waiting for the lock will spin, assuming that the lock will be released soon. On the other hand, if the task holding the lock is sleeping, the task waiting for that lock will block because it's likely that releasing the lock will take much longer.
The Data Encryption Standard
The Data Encryption Standard (DES) has been the most popular data encryption technique since the mid 1970s. For more than two decades, its 56-bit key was considered infallible. However, the growing computational power of CPUs and advanced clustering technologies enabled users to break its code in the late 1990s. Consequently, 128-, 512-, and even 2,048-bit keys have been introduced. However, DES remains a classic algorithm for encrypting Unix/Linux passwords and other nonclassified material.
A historical perspective
In 1973, the National Bureau of Standards (NBS) established a committee for developing a standard data encryption algorithm. This algorithm, to be used in the US federal government's computers, was expected to become widespread in the industrial and private sectors as well. Several companies proposed solutions, but only IBM's prevailed. After rigorous tests, the NBS and NSA endorsed it in 1977. Since then, DES has been the de facto encryption algorithm in many applications, operating systems, and databases.
Key-based encryption
Both the encryption and decryption processes rely on a key derived from the user's password, as well as additional information. Without the key, unauthorized users cannot decrypt a DES-encrypted message -- at least in theory. The key consists of 64 bits; 8 bits are used in error checking, leaving 56 bits for the key itself. The number of unique keys that can be generated from a 56-bit number is immensely high -- about 70 quadrillion (70,000,000,000,000,000). This gigantic number stultified unauthorized attempts to decrypt DES-encrypted data for more than two decades; however, the advent of the Internet and the ability to join thousands of personal computers' calculating power revoked the 56-bit key's immunity.
Encryption and decryption
DES is a "block cipher" -- that is, a cipher that applies to chunks of data (64-bit chunks in this case). Data chunks larger than this are broken into 64-bit blocks; smaller chunks are filled with additional padding bits to create a full 64-bit block. In the first encryption phase, DES shifts the positions of the bits in a block according to its key. This process is called "permutation." Next, DES derives an input block from the result and scrambles it by complex mathematical operations. This process is called "transformation," the result of which is a pre-output block. Finally, this pre-output block undergoes an additional permutation phase. The result is called "encrypted text" or "encoded text." When given the original key used in the decryption process, DES reconstitutes the original data from DES-encrypted text.
For further information about the DES algorithm.
For further information about cryptography, see http://www.ciphersbyritter.com.
By Danny Kalev
Journaling and Logging
The traditional Linux file systems were based on the legacy Unix file systems. Such file systems (e.g. ext2fs) are static, which means they do not track changes applied to files and directories to guarantee that all updates are performed safely. Furthermore, ext2fs works asynchronously. Information about a file -- for example its permissions, creation date, and ownership -- are written in a delayed fashion and, often, in several distinct operations.
This approach results in a noticeable performance gain; however, it also incurs data consistency problems. If a power failure occurs exactly when the file system has updated the contents of a file but before it managed to update its header, then the file becomes corrupted. Worse yet, if the disk is highly fragmented, then it's likely that other files may have been corrupted as a result and the entire directory needs to be restored.
Traditionally, a process called fsck (file system check) would check the file system during reboot and detect the corrupt files. In some cases, it would manage to fix them too, but usually you would have to reconstitute the files from a backup set. In the Internet age, when servers are required to stay up for months, this approach is unacceptable. The demand for a more reliable file system and faster recovery time led to the development of several journaling and logging file systems.
What is journaling?
The concept, introduced about a decade ago in database systems, ensures data consistency and integrity in the event of a failure during a transaction. A typical database journaling system records every operation applied to the database records. If a transaction can't be completed due to a hardware fault or a network failure, then the database system restores the records to their original state. A journaling file system uses a similar method by constantly monitoring inode changes.
Logging, as opposed to journaling, keeps track of both inode changes and file content changes. Each of these approaches has advantages and drawbacks. In terms of performance overhead, journaling requires less resources but logging enables faster recovery time. In either case, recovery time is much faster compared to a static file system. Furthermore, it doesn't necessitate a reboot.
Vulnerabilities of traditional static file systems
Under a static file system, each file consists of two logical units: a metadata block (commonly known as the inode) and the file's data. The inode (information node) contains information about the physical locations of the file's data blocks, modification time, etc.... The second logical unit consists of one or more blocks of data, which needn't be contiguous. Thus, when an application changes the contents of a file, ext2fs modifies the file's inode and its data in two distinct, synchronous write operations. If an outage occurs in between, then the file system's state is unknown and needs to be checked for consistency. A metadata logging file system overcomes this vulnerability by using a wrap-around, appending only log area on the disk.
The logging system records the state of each disk transaction in the log area. Before any change is applied to the file system, an intent-to- commit record is appended to the log. When the change has been completed, the log entry is marked as complete. In the event of a recovery from a failure, the system replays the log and checks for an intent-to-commit record without a matching completion mark. Since every modification to the file system is recorded in the log, the file system only needs to read the log rather than performing a full file system scan. If an intent-to-commit record without a completion mark is found, then the change logged in that record is undone.
Let's look at a concrete example. Suppose we have a file that contains three data blocks: 1,2 and 3. The first two of blocks are contiguous:
```
    bbb12bbb3Hbbb
```
The b area indicates discarded data blocks and H is the file header. Now an application updates blocks 2 and 3. Consequently, the file system looks as follows (the a area marks obsolete data blocks that previously contained the blocks 2 and 3 and the header):
```
    bbb1abbbaabbb23H
```
Notice that the modified data was appended to the end: first, the blocks 2 and 3, and finally the header. The previous location of blocks 2,3, and the header was discarded. This approach has several advantages. It's faster because the system doesn't need to seek all over the disk for writing parts of the file and it's safer because file parts that have been changed aren't lost until the log has successfully written the new blocks. Finally, a recovery after a crash is much faster because the logging system needs to check only the updates that took place after the last checkpoint.
At present, there are several journaling file systems available for Linux. The SGI xfs file system is an Open Source product. It's a reliable, fast, and 64-bit file system. IBM's JFS is another highly acclaimed open source product. Its 1.0.0 version was released recently. For further information on JFS.

By Danny Kalev
System Clock
This week, we'll explore the notion of time measurement and processing under Linux. We will start with a quick overview of the low-level hardware clocks and their interrupts, and then we will discuss associated device drivers and synchronization with external time sources.
Real-Time Clocks
All modern PCs possess an internal real-time clock (RTC), typically built into the machine's chipset; however, some machines have an on-board Motorola MC146818 clock (or a compatible chip). Real-time clocks can send periodical signals in frequencies ranging between 2hz to 8192hz and functions as an alarm, raising IRQ (interrupt request) 8 when a timer countdown completes. Linux's /dev/rtc driver, a read-only character device type, controls the system's RTC and reports the current value as an unsigned long whose low-order byte contains the interrupt type. The interrupt type can be update-done, alarm-rang, or periodic. The remaining three bytes hold the number of interrupts since the last read. You can access status information of the /dev/rtc driver via the pseudo-file proc/driver/rtc. (if the /proc filesystem is active).
Time-related Interrupt Requests
On a congested system, the IRQ load can affect the system's performance. Thus, several interrupts may pile up, causing an "IRQ jam". Users must check the number of interrupts accumulated since the last read, as it may be higher than one. Modern hardware architectures can handle clock signals at a rate of up to 2048hz. Higher frequencies, however, might cause IRQ jams. By design, a non-privileged process may enable interrupts and signals at a rate of 64hz or lower. For higher frequencies, the process must have a root privilege.
Synchronization with External Time Sources
Certain systems are synchronized with an external time-measuring device. Using an external time source is common practice in hard-core, real-time processing, embedded systems and clusters. Synchronizing the kernel with the Network Time Protocol (NTP) enables Linux to keep-up with very accurate atomic clocks around the world via the Internet. In such systems, the kernel writes time to the CMOS every 11 minutes. When doing so, the kernel disables the RTC periodic interrupts for a short time.
By Danny Kalev