Hybrid Cluster Documentation

Hybrid Clustering
A pair of machines that share process load, data, and services between them,
using only commodity hardware and free software.
Clustering on a Budget!
by Tom Kunz

Definitions
History
Purpose
Credits
Design Philosophy
Hardware Components
Software
Setting Up
Step One: Compute Clustering
Step Two: Data Replication
Step Three: Fail-Over
TKCluster Configuration
Things Left To Do
Conclusion
About the Author

Definitions

A few definitions are in order before starting a discussion on clusters:

Cluster

A group of one or more machines connected to one another and sharing one or more services. The two most popular types of clusters are generally known as computational clusters, or high-performance computing (HPC) where processes and/or specific computations are split up and shared among many machines, and high-availability (HA) clusters where access to a certain service or group of services is shared so that if any single machine in the cluster is taken offline for any reason, it does not affect the ability of users to use that service. Computational clusters are also sometimes called "Beowulf clusters". Early Beowulf clusters used tools like rsh to execute parts of a parallel computation on various nodes. The problem with this was that no single node knew what was happening on the other nodes, or if other nodes were heavily loaded or not. MOSIX is a much more modern computational clustering tool allowing multiple processes to be moved to other systems in the cluster, and keeps track of the loads on various nodes.

The cluster described in this document is designed to be a synthesis of both cluster types. Both MOSIX HPC clustering and data-replicating cluster types are merged in this cluster design.

(FOOTNOTE: In the full spectrum of HPC clustering, just having two machines connected for sharing process load is no great achievement. However, for small workgroup or small business loads, having double the computing horsepower available is a side benefit. If you are really looking for true HPC clustering in the mega- or tera-FLOP range, the MOSIX clustering described here can be expanded to hundreds or thousands of nodes.)

Service Clustering

Service clustering is a very specific type of cluster which refers to using multiple machines to give the illusion of "high availability" for a particular server. One of the most popular illustrations of this is a webserver cluster. A typical way to set this up is to configure up a number of machines identically, all running a local copy of the webserver, and then using a "trick" with a specialized DNS server or router which, essentially, round-robbins the connections from the outside world into a different machine (or the least-loaded machine) on each access from the outside. A machine that dies or is taken offline is noted by the DNS server or router and is ignored until it shows it is ready again. This "trick" works fine when each of the webservers themselves are accessing files and databases on a shared "backend" server, however the question then arises, "Well, what kind of cluster is that database engine living on?" The answer is that the "backend" server is either not a cluster at all (ie, if it fails, the whole web cluster fails), or relies upon some expensive, proprietary hardware and/or software. In some respects, service clustering almost doesn't really qualify as a true cluster because it still has a single point of failure, the "backend" server. But it certainly looks good for Linux distribution vendors to advertise "Linux cluster" products, without addressing the real issue of data replication and fail-over. This marketing technique was very popular in the later 1990's among certain Linux distribution vendors seeking to make some quick bucks on the buzzword.

Device-Level Replication / Filesystem-Level Replication

Device-level refers to data replication occurring on a block-device. Block writes occur on, for example, /dev/hda1, a block device. This block device need not contain any specific data structure or be formatted in any way. In contrast to this is filesystem-level where files on a formatted filesystem are being read and written, for example, to the /opt filesystem. Device-level replication may be used for duplicating the data from something like an Oracle or Informix database partition. Oracle, Informix, and other database engines sometimes boast performance advantages over what they would be if used on formatted filesystems. In recent years, it seems this belief has been challenged, with filesystem-based engines giving comparable performance to raw-device based engines.

Filesystem-level replication means that the two replicated filesystems are probably not identical at the block-level, however they contain the same files. Periodic synchronization via rsync, which walks the directory structure tree looking for changed file contents, is the most common way to achieve filesystem-level replication. However, this is an extremely inefficient way to attempt clustering, and is fraught with peril. Using rsync does not work well with large files or files with contents that change frequently. Also, should rsync be interrupted for any reason, the file it was duplicating will be left as a "stray" on the system (ie, the next invocation of rsync won't realize that the temp file the previous invocation created is actually an rsync file, and will duplicate it over again). Rsync uses temp file during duplication, and so an interruption of rsync means a duplicate of a duplicate is left behind. Also, because rsync must was the entire filesystem tree specified looking for changes, it's highly inefficient and ends up looking for changes it need not worry about. It works well for duplicating data over slow links, such as the internet, and with long intervals between refreshes, such as once a day. Rsync can be relied upon to synchronize online archives and mirrors periodically, but it doesn't seem suited to clustering. If it's set up to work as a cluster with long delays between rsync invocations, data is lost if failure occurs between rsyncs. Also, if rsync fails on large files (ie CD-ROM images) the failure leaves behind large temporary files which need to be cleaned away by hand after recovery. If the delay between rsyncs is shortened to be nearly-continuous, the machine is wasting most of its time examining files which need no examination. When large files are written to disk on the master, continuous rsyncs separated with short delays will cause the disk to thrash heavily until the file is written on the master. If someone has found a way to use rsync more efficiently, please let me know.

There were a small number of proprietary vendors in the mid to late 1990's who advertised commercial systems which performed either device-level or filesystem-level replication. Two such products were SavWareHA (formerly known as Sentinel) for SCO UNIX and Legato (formerly Vinca) Co-Standby Server for Windows NT.

History

In ages past, there have been several attempts to build cluster systems, of varying degrees of success and failure. Solutions from vendors like DEC, IBM and Sun relied upon specialized, proprietary, expensive hardware and software. Within the Linux community, there has always been a lack of almost any kind of true cluster, although the earliest successes in Linux clustering was specifically the early MPI / PVM computational clusters and in the realm of webserver clusters. Unfortunately, both MPI/PVM style clusters and webserver clusters are "niche" products which do not really fully address all aspects of clustering (ie fail-over and data replication).

The expensive route to go with clustering and shared-storage and/or data replication has been in the form of a pair of machines with an external array of disks between them. To do this, the servers must be in the same proximity (within a few meters of each other, typically in the same rack or neighboring racks), and each server must have redundant connections to the external storage array. This usually meant using either Fibre Channel or SCSI controllers, at least two per machine, both of which plugged directly into the external disk array. The external array must have at least two available, independent connections into it per server in the cluster. Shared SCSI and Fibre Channel solutions using this technique are not uncommon, but also not cheap. A minimal configuration would consist of two servers, two SCSI RAID controllers in each for a total of 4 controllers, an external JBOD case with 2 electrically-independent SCSI buses and at least 2 disks in it. This is very specialized hardware. Usually this kind of cluster is sold by high-margin server vendors such as IBM, HP, or Sun. The following graphic illustrates the typical layout:

Shared SCSI Cluster Image

These types of clusters are truly effective, they work splendidly. But only if you have the cash to set them up in the first place. A 5-figure pricetag would be typical for an entry-level system, and a 6-figure pricetag would conceivably be the next step up in the technology. A drawback to the high-availablility was both cost and the fact that the machines had to be located in the same physical area. With high cost comes the questions asked by those who hold the purse strings in the organization: "How much is my data really worth? Can I afford to be out of service for a little while until the server is fixed? Is it worthwhile to spend 6-figures on this server? How do I prioritize which services and data are important enough to get a slice of this server and which is left out in the cold on a single, non-clustered server?". By using shared SCSI, servers had to sit very near each other in order to work, and stepwise degradation in performance and reliability was the cost of moving just a few feet too far. Ethernet, on the other hand, can connect two PCs relatively cheaply, across very long distances without speed or reliability problems. Despite the ubiquity of relatively cheap PC's, Linux, and ethernet hardware, until recently there were no really good data replication clusters that took advantage of the power of commodity hardware.

Purpose

The purpose of this document is to fully describe and enumerate the steps involved in building a hybrid cluster which takes advantage of both computational clustering and device-level, high-availability clustering using nothing more than standard, open-source software and commodity hardware. None of the software used in this costs a thing, and the hardware used in this arrangement should be available for relatively cheap prices from most computer parts stores or online retailers. Basically, this document should allow you to take two commodity PC's running Linux and make them think and act as one highly reliable storage cluster that can handle a pretty hefty user load. Also, because ethernet is being used as the interconnect instead of SCSI, servers can be physically isolated from each other in the event of catastrophe, and placed hundreds of yards apart, or even further with the proper switch or repeater in place.

Credits

This cluster would not be even remotely possible without the efforts of these people:

Philipp Reisner: author of DRBD, the soul of the data-replication portion of the cluster. Many thanks for his work on DRBD.
The MOSIX Team: Headed up by Professor Amnon Barak at the Hebrew University in Jerusalem, Israel, the MOSIX project gives ordinary people and businesses the ability to build their own supercomputing cluster from commodity PC's at a tiny fraction of the cost of a proprietary system. Thanks for creating such a cool clustering technology!
Peter Braam: Although I don't believe he wrote any of the software directly, he pointed me in the direction of using DRBD. He is THE source for info on real filesystem clustering. He is the author of the Lustre filesystem from Cluster File Systems, Inc. Lustre is still relatively young and deserves much exploration. I heard of it at Linux World in Jan 2003 and attended lectures by Peter Braam and his colleagues, and I would encourage anyone else interested in clustered filesystem to take a look at his work. Lustre powers MAJOR clusters, while my attempts here are to describe a small, low-powered cluster I put together from other people's works. If you need serious enterprise-capable clustering, go take a look at Lustre and see if it is better suited for your work.
Simon Horman: Simon is the author of heartbeat, the package which contains some of the nifty utilities used in this cluster.
The entire Linux kernel community: Without you all, none of this would be possible. Your work is of the utmost importance.

Design Philosophy

In recent years, there has been a paradigm shift from the uage of large, expensive, proprietary systems to clusters of smaller, cheaper, open components. Industries which relied upon large systems, such as HP N-class, IBM z-Series, or Sun Enterprise servers are moving from the concept of single, large systems (often the size of a household refrigerator, or group of refrigerators) that handle the workloads of numerous processes, to the concept of multi-node clusters, say 2 nodes to 128 or even more, to share workload and storage.

This type of shift from big and proprietary to small and open is not new. Since the days of Thomas Edison, electrical and electronic devices have gotten smaller, cheaper, faster, and better. The shift in the 1980's was from mainframe IBM machines to smaller, faster DEC VAX machines. The shift in the 1990's was from DEC VAX to UNIX-based servers and workstations. Today, the proprietary UNIXes are giving way to free Linux distributions, which draw from the best parts of the UNIX tradition. All previous server-quality systems had high price tags associated with them, and could maintain that level of pricing because only larger companies required the help of computers to transact business. But now, computing is essential for everyone from the corporate megalith down to the mom-and-pop-shop corner business. Today, Linux has emerged as the server platform of choice for many businesses of all sizes, largely due to two fundamental properties of Linux: price and stability.

However, no matter how robust the software and operating system, the Second Law of Thermodynamics takes its toll on all Creation: everything wears out and breaks down, no exceptions. Motherboards, hard drives, CPUs, memory chips, everything has a lifetime. In this article, I discuss the specifics of a cluster design which minimizes the reliance on any single point of a system, and allows users on a network to realize uninterrupted service for a variety of applications including filesharing, web serving, application hosting, network firewalls, database connectivity, and almost any other service.

Hardware Components

The hardware components used in this type of cluster are generally cheap and easily available. The minimal hardware required to setup a usable demonstration of this cluster is as follows:

A pair of identical servers.
The level of "identicalness" is one only of convenience for the administrator. I find that using exactly identical hardware, down to board revision level and firmware level, makes life much easier when administrating the servers. The only thing to keep in mind is that because you are duplicating data between partitions on two different machines, special care must be taken to ensure that both partitions are the same size, down to the exact number of blocks. If they are not, it is possible to force the cluster to use the smaller-size of the two partitions as the "actual" size. The bottom line is, yes it is possible to have wildly disparate machines as the cluster with only a few points of commonality, but I strongly advise you to just bite the bullet and get identical machines with identical hardware. The best way to do this is to build them yourself, as you can personally inspect the hardware to ensure same make, model number, board revision, and firmware are the same. If you are leery about doing this yourself, I offer reasonably priced clusters which are configured specifically for this purpose. As this is not necessarily an ad for my business, you may also contact me for more info if desired.

For this article, I will suggest that you have a physically separate disk or disk array for data sharing. You can do data sharing on the same disk as your system disk, but for the sake of this article, lets say /dev/sda is your boot and system disk and /dev/sdb is entirely devoted to data sharing between nodes of the cluster. In your particular installation, you can use whatever you want. I've even done clusters where the data lived on software RAID partitions in an external JBOD rack shelf. What exactly you do with the shared parition(s) will be determined by your budget and performance needs.
Network cards, cabling, and a switch or crossover cable to connect the two servers.
I recommend gigabit ethernet for this, although any network connection could, technically, work. The necessity of gigabit or not depends upon your usage and needs. Most modern servers and workstations come with either 10/100 or gigabit already built into the motherboard. If you want to use gigabit and don't already have it, the Intel MT 1000 cards can be picked up for somewhere in the neighborhood of $60, and they work well in either 33 or 66 MHz PCI slots. Unless you're on the strictest of budgets, $60 for a gigabit card is a wise purchase. Intel also makes more expensive 66MHz PCI versions of the same gigabit card in the $100-$120 range. You could even try out cheaper Netgear or similar gigabit cards, which are down in the $40-$50 price range. I have generally obtained best results with the roughly-$60 Intel Pro 1000 MT card, using Intel's own open-source drivers for it. The drivers in the canonical Linux kernel are older version of the same as the ones Intel has on their website, and drivers downloaded directly from Intel will be newer. Last I checked, Intel had frequent revisions and development on their gigabit driver, so you will probably get better performance and reliability from those drivers.

The purpose of the card is to transport the data from the master to the slave during data replication and synchronization, as well as to shuffle process load to be shared between the machines. Something to keep in mind is the speed of each related component in the system. The speed of data replication and process sharing is a function of how well-matched all the components of the entire system are. While it's possible to build a reasonably effective cluster from PC's you pick up from Wal-Mart or Dell for $499, you won't be reaching the full potential of what could be done with the equipment available for not much more cost.

The data replication process between the nodes of the cluster exercises most of the system other than the CPU and RAM. It's a good way to find out, really, what kind of system you have. Manufacturers love to advertise the speed of the CPU and the RAM, but these are the easiest things to boost the speed on, and generally don't participate very much in the data replication process. The more important numbers are the bus and disk speeds, which are frequently not advertised, and usually quite misunderstood. I will outline the roles of each here:
- CPU: The speed of the CPU will really only participate in the data replication process if software RAID is being used. In general, I recommend hardware RAID, although I have made serveral systems with software RAID that had acceptable performance. Modern CPU speeds are so vastly high in comparison to the rest of the computer that it spends little time doing real work. Even when given full IO loads, most modern CPUs are really doing almost nothing of what they *could* be doing.
- RAM: The speed of the RAM is also vastly faster than anything the disk subsystem or IO bus can produce. Like the CPU speed, RAM speed usually makes little to no difference to the rate of data replication.
- Bus speed: This is an important factor to the overall performance of the data replication. If within budgetary limits, a faster PCI bus should be used to maximize the capabilities of your cluster. This will affect the efficiency of the gigabit card as well as the disk subsystem.
- Disk and Controller speed: This is probably the most important piece of the equation. The bus is faster than the disk speed, but not by the same size margin as the CPU & RAM are. Maximal throughput across the bus can only be realized with a fast disk controller and comparably fast disks, arranged in a RAID for maximal parallelization of IO. Some will automatically assume this means SCSI RAID, but this is not necessarily true. Given an infinite budget, perhaps SCSI RAID isn't a bad idea, but most of us have budgetary limits. Entry-level ATA RAID and SATA RAID cards such as the 3Ware, Promise, and LSI cards are incredibly cheap by comparison to most SCSI RAID cards, and can provide performance in the same range as the expensive SCSI RAID cards.
A note about disk performance: no matter what kind of disk you buy, whether ATA, SATA, or SCSI, the sustained throughput of almost any disk on the market today is limited to about 65 MB/s. "What?" you say, "I just spent a mint on a 320MB/s SCSI RAID controller for each cluster node!". Yup, that's right. That 320MB/s only applies to the data transfer from the buffer on the controller card to the cache on the disk itself. The actual transfer of data from the cache of the disk to its physical platters is the limiting factor. To my knowledge, and someone please correct me if I'm wrong, but after reading the detailed spec sheets on numerous Western Digital, Seagate, and Hitachi/IBM drives, I believe the highest sustained throughput rate was on the order of about 65 MB/s. The 320MB/s SCSI controller is advertising only the burst rate that is possible until the cache of the drive fills up. Once cache memory on the drive is full (which is usually 2MB, 8MB or sometimes even 16MB on the best drives) the data can only be processed at the throughput speed of the cache-to-platter speed. However, don't fret. Using the proper RAID configuration can basically remove this limitation. On many controllers, using RAID 0, 1, or 0+1 will provide faster performance than using RAID 5 because a RAID 5 configuration requires on-the-fly checksumming routines to be processed by the RAID controller's CPU. Software RAID-5 calculations are done in the MMX high-speed registers of Pentium-class CPU's. RAID 0, 1, or 0+1 allows for parallelizing of read and write operations between multiple disks. Disk blocks can be handled in parallel between each disk, which multiplies that 65MB/s limitation. Data is still flowing in and out of each disk at 65MB/s, but if you have several disks striped together, that 65MB/s can be doubled. At about 133MB/s, bus speed becomes a factor for regular PCI. For the purpose of this paper, we'll just assume you have a typical PCI bus, and nothing overly "special".

It should be no surprise that if you want to push the limits on speed, just know that ultimately, the specifics of how fast you'll really be able to go with the data replication within the cluster will depend on the entire assembly of gigabit card, disk controller card, RAID layout, PCI bus, and disks you choose. If budget limits are stopping you from getting the very best and fastest in hardware, do not worry. You can still have quite a nice cluster without tremendous expense using basically commodity PC parts you should find at almost any retailer or online parts store. If you're just building this for fun or for a small business cluster, having even just 30MB/sec of replication throughput is still respectable. You couldn't even get that level of redundancy and performance from IBM, Sun, or HP without spending about 5 digits, so do not be overly concerned if your cluster works but isn't all that fast.

Software

If you're on a limited budget, you're in luck. Not a single byte of the software in this cluster costs anything. At least not for just using it: if you want support from someone on a particular piece, you have to pay, but otherwise it's all GPL'd software. MOSIX's license isn't 100% GPL. If you are a GPL zealot, you will want to check out openMosix and compare it. It has some neat features to it, but this cluster is using regular MOSIX. I'm sure it could be done with openMosix as well, but I haven't had the time to do so.

This cluster is not specific to a particular distribution. I have tested it on Red Hat 7.3 and Red Hat Enterprise Server 3.0, however it should work on virtually any Linux distribution. My cluster manager is all written in Perl, and I had to tweak some of the calls for forking when switching from RH 7.3's Perl 5.6 to RH ES's Perl 5.8. I'm sure a much more skilled programmer could generalize the calls in my cluster manager to work right regardless of Perl flavor. As of this writing, it works well with Perl 5.8. You will also need to do some kernel configuration and recompilation, so if you're squeamish about doing that, this is the perfect opportunity to learn it and get comfortable with it.

Here is the list of packages you will want to download:

MOSIX: As of this writing, the latest version of MOSIX is 1.10.1, for kernel 2.4.22. You may wish to download the MOSIX RPM file, however I will cover this installation from the perspective of someone wishing to do all the compilation themselves. RPM files may or may not be available for your particular distribution, so compiling from source gives you the opportunity to build it and have it work regardless of the distribution you are using. MOSIX is the part of the cluster which does the process-load sharing, as well as being an integral part of maintaining contact with the other machine(s) in the cluster and keeping track of which is up and which is down.
Linux Kernel: You will need to make sure the version of MOSIX you downloaded above matches the version of the kernel. The MOSIX patches are specific to kernel version, so make sure you get the right one. MOSIX patches are always a few versions behind the latest kernel version, so make sure you get the right one.
DRBD: DRBD is the piece of the cluster which will be doing the data replication. As of this writing, the most current version of DRBD is 0.6.10. I have built clusters with 0.6.5, 0.6.6, and 0.6.10, and DRBD just keeps getting better. DRBD is a block device driver which sits at a layer above the usual block devices. For example, a typical disk layout on an IDE-based machine will have /dev/hda as the main system disk. Partitions will be /dev/hda1, /dev/hda2, /dev/hda5, etc. Normally, writes would occur to those partitions directly in a non-clustered system. In a cluster with a shared partition, however, the block device is the network block device, /dev/nb0, which is mapped to "lower" block device. In drbd, you tell DRBD to map one of its devices, /dev/nb0, to a physical device, say /dev/hdb1, as well as to a remote network address. When writes occur on the master node to /dev/nb0, DRBD writes the data to the local, physical device /dev/hdb1, and copies the data to the specified remote network connection, where the slave node writes it to its own local device. DRBD is elegant and well-maintained, and commercial support contracts are available for reasonable rates from Linbit, if desired.
Heartbeat: Although I'm not specifically using the heartbeat software in this cluster, heartbeat has some nice utilities in it for IP address takeover.
TKCluster: This is my own cluster manager. It's a pile of Perl scripts which talk to each other within the cluster. On the master node, it makes sure that DRBD maintains connectivity with the slave. If the slave dies and the master node goes into a StandAlone mode, it instructs DRBD to go back into Primary mode and wait for connection from the slave. It also attempts to force a reboot of the master node should the slave node detect failure of the master, and resets DRBD to push the slave into master mode, and restarts necessary services. Basically, it glues all the other above packages into a more unified system.

Setting Up

At this point, you should have a pair of servers with all the above software downloaded. I will write the rest of this assuming you have basically identical hardware. More experienced Linux administrators can play around with having disparate hardware, and it's quite possible to do that, but I won't go into all the details of maintaining different drivers and platforms.

I will assume you have /dev/sda as your boot and system drive, /dev/sdb1 is the partition that will be replicated between the servers, eth0 will be your connection to your 10/100 LAN that the rest of the office lives on, and eth1 will be your gigabit connection via crossover cable between the two servers. Your LAN on eth0 will be 192.168.0.* with a 24-bit netmask of 255.255.255.0. The gigabit connection will be 192.168.1.*, also with a 24-bit netmask. Lets make it easy to remember and have server1 set up with 192.168.0.1 on eth0 and 192.168.1.1 on eth1. Likewise, server2 should be brought up with 192.168.0.2 and 192.168.1.2.

The first thing to do is bring up your favorite Linux distribution on what I will call "server1". Make sure you install the development tools for your distribution, including header files for kernel development. We will setup server1 to be the master node of the cluster, and the partition it replicates to the slave, server2, will be used for NFS and/or Samba filesharing. Any number of other types of services, including web, email, DNS, Postgres, MySQL, or any number of other server, could be set up on this partition as well. Also, DRBD comes ready for sharing 2 devices when you compile it, however this can be expanded to as many as 256 devices by changing the value of minor_count in drbd/drbd_main.c of the DRBD source package. I have used DRBD for sharing up to 16 devices between servers, and this small change to drbd_main.c is easily done.

Step One: Compute-Clustering

The very first thing to do is install the kernel source and if you're not familiar with the process of compiling it, take some time and compile yourself up a kernel. If you are new to this, don't be afraid to try it on your own. There's an abundant supply of documentation on how to do this all over the web, and the details you'll need for your particular server are outside the scope of this document. Currently, you would want the file linux-2.4.22.tar.gz. Download this to your home directory and untar it in /usr/src and start up the configuration script:


# cd /usr/src

# tar xvfz ~/linux-2.4.22.tar.gz

... <tar output here>...

# cd linux-2.4.22

# make menuconfig

If you have X installed and running, you can replace "make menuconfig" with "make xconfig" above. If you have neither X nor a modern terminal and are stranded on a terminal which can't do cursor positioning with the NCurses calls found in "menuconfig", you can always fall back to "make config", which is just line-by-line configuration of every option in the kernel. I've done it, it's painful, you'd much rather configure your kernel with menuconfig or xconfig.

My only "tips" for this are few. Make sure you get the canonical kernel source and not just the distribution-packaged source. Although distributions carry the full kernel source packaged up with them, I tend to not trust those packages for use with MOSIX. Not for any other reason than that the distribution-packaged kernels tend to have extra, non-canonical source added in. This is good for the vendor and good for the end-user because sometimes hardware which would otherwise have been unrecognizable will work with those extra patches applied. The vendor of the distribution has something to boast over, and the end user gets to use their new gizmo, which would have been unusable if plain canonical source had been used. Usually, the patches for that hardware will make it into the canonical kernel eventually. The problem is that MOSIX patches may not apply properly to kernel sources that have non-canonical patches already applied. Once you have successfully compiled and booted your own kernel a few times with different options and have familiarized yourself with how your bootloader (probably either GRUB or LILO) works, you're probably ready to go further. You will probably want to install the source under /usr/src/, as this is where MOSIX looks for source by default. Lastly, make sure the the option for "Network Block Device" is set to be a module. You do not want it to be compiled into the kernel or the DRBD module will not be able to acquire the /dev/nb* logical devices.

For individuals who are very new to kernel compilation, one pitfall that catches many unaware is the initrd image. When you compile your kernel to have lots of modules in it, you'll need to list them in /etc/modules.conf. Especially, modules critical to booting (ie disk controllers) will have to go into there. When your new kernel boots, it will expect a matching initrd image, which is a boot-time RAM disk that is loaded with the drivers you need to boot up. This can be somewhat complex, so I won't go into the long and detailed description here, but just beware that this is a potential pitfall to kernel-compilation-newbies. In fact, I've been compiling kernels myself since about 1994 or 1995 and I still forget about the initrd sometimes, only to find my kernel demanding drivers it can't find. Alternatively to using modules, which are flexible by being able to insert and remove them at will, you could also just not use modules at all and build everything into the kernel itself so you don't needmodules. There is technically nothing really wrong with this, although this is less flexible, especially if you ever have to add or change hardware, so I always recommend going the route of using modules.

The next thing to do is to install MOSIX. Make sure you download both the kernel patches and the userland utilities, because as of this writing they are packaged separately. Currently, you would download MOSKRN-1.10.1.tar.gz and MOSIX-1.10.1.tar.gz. Untar these and run the installation script. Assuming you have the source tarballs in your home directory, you will do something like this:

# tar xvfz ~/MOSKRN-1.10.1.tar.gz ... <tar output here>... # tar xvfz ~/MOSIX-1.10.1.tar.gz ... <tar output here>... # cd MOSIX-1.10.1 # ./install.mosix

Then answer the questions from the script. These are very simple questions, like where the kernel source lives and whether you want to watch the compilation happen or not. If you're unsure when it asks you about the "MFS mount point", just say yes. It will take care of the details for you.

MOSIX's install script will ask for your desired kernel config type and bring it up. The only thing you'll need to worry about there is to double-check to make sure that MOSIX process migration is turned on. This allows processes running on one machine to transparently migrate over to the other machine if the load on the first machine, or home node, becomes too high. This is one of the coolest features of MOSIX. No applications have to be recompiled, the user will never see a difference in the execution or debugging of a process. It is 100% transparent to the user.

The /proc pseudo-filesystem is integral to MOSIX, so you'll also want to make sure that is turned on. When the kernel compilation tool returns to the install script, the script checks to make sure that the critical pieces of MOSIX are turned on, and will ask to make sure if it finds something amiss.

If you've already been successful at compiling your own kernel, you should feel confident once the compilation completes that MOSIX will work for you.

Once the compile completes, my natural temptation is to immediately bounce the server and try to boot that new kernel. However, in this case, I would encourage some file editing before rebooting. MOSIX uses /etc/hosts and /etc/mosix.map to figure out who it is in the cluster and where others in the cluster live. In our example, we want to make sure we have the following:

In /etc/hosts:


127.0.0.1        localhost localhost.localdomain

192.168.0.1      server1 server1.mydomain

192.168.1.1      server1p

192.168.0.2      server2 server2.mydomain

192.168.1.2      server2p

In /etc/mosix.map


1    192.168.1.1   2

The change that might surprise you in /etc/hosts is 192.168.1.1 being mapped to "server1p". The "p" on the end stands for "private". Same for 192.168.1.2 pointing to "server2p". Each of the two servers will talk amongst themselves as "server1p" and "server2p", while talking to the rest of the office network as "server1" and "server2". If you have a DNS server in your office, make sure these changes go in.

The format of /etc/mosix.map is: the first field is a unique MOSIX number, the second field is the starting address, and the third field is the number of nodes. In the above /etc/mosix.map, this means "I belong to a MOSIX cluster which starts numbering at MOSIX id # 1, starting with 192.168.1.1, and containing 2 nodes." The same /etc/mosix.map file will be used on both nodes, and when MOSIX looks up the address in /etc/hosts of the machine it is running on, it automatically recognizes which MOSIX ID number it is. When MOSIX runs on the second node with the same /etc/mosix.map, it will see that it has address 192.168.1.2, and must therefore be MOSIX ID #2. Quite elegant and effective. Also, if you're going to do something complicated with multiple MOSIX nodes on different subnets, that's possible just by adding another line to /etc/mosix.map. And, if you choose to have a multi-homed MOSIX node where MOSIX is allowed to talk out both interfaces, you can also use the ALIAS keyword in /etc/mosix.map to denote an alternative address for a node. Details on this are outside the scope of this document, but do keep in mind it's possible to accomplish.

At this point you should have an updated kernel, modified /etc/hosts, and a new /etc/mosix.map. Again, the temptation is to boot it and see how it works. But not just yet...

DRBD should be downloaded and installed next. It is kernel-specific, and must be compiled for a given kernel. This will not participate in the computational cluster, but because we already have a kernel built (and it probably works) we might just want to get DRBD built and out of the way now.

The current DRBD file is drbd-0.6.10.tar.gz. Untarring and building it will go something like this:


# tar xvfz ~/drbd-0.6.10.tar.gz

... <tar output here>...

# cd drbd-0.6.10

# make

... <make output here>...

# make install

... <make output here>...

The last step of "make install" should place a copy of drbd.o into your module directory for this kernel. DRBD is surprisingly small, so the build won't take very long. Other utilities for using DRBD will be installed under /usr/bin and/or /usr/sbin.

Now, double-check and make sure you have ALL the previouly-listed packages for this cluster downloaded and on your system disk. If so, go ahead and do a reboot and bring up your new kernel!

If all went as planned, you now have a brand new kernel running, and MOSIX is listening and waiting for the other node to come up. You can see what MOSIX is thinking by doing a "mon" command. The output will be a screen that shows the configured nodes across the bottom, and the loads on each. In this case, you will see numbers 1 and 2 briefly appear on the bottom of the screen, and then when it discovers that 2 is not yet up, 2 will disappear leaving only 1. If you open up another window and do something that is computationally-intensive or IO-intensive, and then watch the "mon" window, you should see a bar rising above the "1" to indicate load.

If you've gotten this far, you're but minutes away from having a working compute cluster. Only minutes? Of course! All you have to do is duplicate the work you've already done on server1 using "dd" onto the other server's disk, and then you'll be ready!

Of course, if you haven't gotten this far successfully, go back and check to make sure you have booted the right kernel, make sure you MOSIX kernel build worked properly without errors, and just overall check and make sure everything succeeded the way it was supposed to. If you're a kernel-compilation-newbie, make sure you have all the drivers for all your hardware configured in properly. Check out the websites, newsgroups, and mailing lists for the kernel, MOSIX, and DRBD to see if you have done things right and if maybe there's a bug in one of the packages. When you've booted MOSIX successfully and you can run "mon" successfully and see the load, you're ready to move on.

Now, back to duplicating that system disk. It's quite simple. Bring down server1 and pop in the disk you would have used as the system disk for server2. If this is a SCSI machine, make sure you change the SCSI ID to point to an unused ID. If this is an IDE machine, make sure you have set the master/slave jumpers appropriately. If you are using a RAID array as your system disk, you may or may not be able to attach your group of disks to the controller as a secondary logical group. That depends entirely upon the controller.

If you are using SCSI, IDE, or a compatible RAID controller that can handle it, boot up server1 with the system disk(s) from server2 attached as a secondary. Login as root. In this example case, I am using /dev/sda as the system disk, /dev/sdb as the shared disk, and so I attach the system disk from server2 as /dev/sdc. (I could have also yanked /dev/sdb and replaced it with server2's system disk, you could do that if you are short on space, power connectors, or whatever.) To duplicate server1's system disk (/dev/sda) over to server2's system disk (/dev/sdc), I do this at the prompt:


# dd if=/dev/sda of=/dev/sdc

...and then wait. If you're not on a RAID 0, 1, or 0+1 system, keep in mind that maximum throughput rate of about 65MB/second while you wait.

Please note that "dd" is a powerful tool. If you mix up the "if" and "of" designations ("if" stands for "input file" and "of" stands for "output file"), you WILL destroy all your hard work of installation thus far. Read the man/info page for more details about "dd", but just make sure you don't mix things up. It's so heartbreaking to issue a "dd" on something you worked so hard on, only watch it evaporate into oblivion when you thought it was being duplicated. Trust me, I know.

Just for giggles, you might want to put the command "time" before "dd", such as "time dd if=...". This will give you an idea of the speed of the disk's raw throughput if you don't already know. Likewise, the "hdparm" tool is a fun item that can give you performance numbers as well. Try "hdparm -Tt /dev/sda" on an idle machine and see what you get. The numbers may very well surprise you. The man/info page for hdparm has more details.

You might be able to improve the performance of "dd" a bit by passing it the "bs" parameter, which specifies block size. By specifying a larger block size, you can optimize the read and write operations. The "bs" parameter also takes a suffix like "k" or "M" to give multiples of kilo or megabytes. If you're on a system with say 512M, you could give append "bs=256M" on the end of the dd command and get better performance than without it.

When the dd is done, bring the machine down, put the disk back into server2, and boot up server2. Leave server1 down for the moment. If all went well with the dd, server2 will come up thinking it is server1. All you have to do is change it's IP address and hostname and you should be set. Each distribution can be different in how it sets hostname and IP, so you'll need to figure that out for your particular distribution, but I can tell you that Red Hat systems will want to have stuff found in /etc/sysconfig modified. On Red Hat 7.3, 8.0, and 9, you will want to modify the following files:


/etc/sysconfig/network-scripts/ifcfg-eth0

/etc/sysconfig/network-scripts/ifcfg-eth1

/etc/sysconfig/network

Again, YMMV from one distribution to another. Once finished, a reboot of server2 should bring it up with eth0 as 192.168.0.2, eth1 as 192.168.1.2, and hostname "server2". If this is true, boot up server1 and we'll see our first "real" demo of MOSIX in action.

Over on server1, login and start up "mon". You should now have both numbers "1" and "2" showing on the bottom of the screen, meaning that MOSIX sees both machines as being members of the cluster. Leave "mon" running here while you go over to server2 and login as root. Do something like a recursive directory listing of the whole system, "cd / ; ls -aglR". Watch the "mon" screen on server1, and you'll see the load come up on server2. This verifies that server1 knows about what server2 is doing. While this is running, open another terminal on server2 and issue a similar command, only this time force it to execute on server1 with the "runon" command:


# cd /  ; runon 1 ls -aglR

You'll see that even though you are typing in the commands on server2, the "runon" command forced server1 to start executing the recursive directory listing, using the contents of server2's hard drive. You should see the activity light on the gigabit card stay on solid, or at least blinking wildly fast. Now, directory listings are not all that interesting, however if you had a computationally-intensive process running, MOSIX would automatically recognize this and internally do a "runon" and shuffle that process over to the other machine transparently. I do not have a decent example to give you to do this, but with a little imagination I'm sure you can come up with something. My first few tests of MOSIX and my demonstration of it for a company I worked for when I built my first MOSIX cluster in 1998 or so involved using it to call a "cracklib"-like homebrewed program for checking crackability of passwords on our servers. The crypt() function is purely mathemtical with no IO, you might try using a program to call crypt() continuously and watch the execution migrate as a purely computational migration. It's really neat to watch.

Note that MOSIX also gives you some performance-tuning tools. Using these, you can tune MOSIX to be responsive to your specific speed of network card. In this case, we have just a plain gigabit connection, which although it's fast for ethernet, isn't really all that fast compared with proprietary networking technologies like Myrinet. Myrinet cards start off at sustained throughputs of about 250MB/sec, with much lower latency than anything in the ethernet world, and go up from there. Likewise, the price of Myrinet reflects this fact. If you're really interested, MOSIX allows you to fine-tune your installation to the cards you have whether you have plain 10/100 or Myrinet, but because this installation is using $60 gigabit cards and not $1100 Myrinet cards, you won't need to worry about it until someone hands you a budget that can accomodate those kinds of expenses.

If all you wanted was to brag to your friends or put something on your resume that says "Built Linux Cluster", you could quit here and legitimately call it a success. You did it. What you have is, truly, a computational cluster. You could buy up a zillion other PC's just like these, lay down a copy of the kernel you just built in each one, and use them to predict the weather, calculate prime numbers, find the next digit of Pi, or play chess. In fact, the only change you'd have to make is to change the /etc/mosix.map to say it has some "X" nodes starting at 192.168.1.1 instead of just 2, set the IP address on each one, and away you go. You now have the very embryo of a major supercomputer, all thanks to the fine gentlemen in Jerusalem and their brilliance and hard work. It measures "cool" on the geekometer. But not quite way cool until you tap into doing more with it than just being a rocket-fast Quake engine. Next stop: high-availability data!

Step Two: Data Replication

If you have not already done so in the setup of the machines, fdisk and format the device you want to share between them. In this case, that is /dev/sdb. You only need to do this once, on server1 the partition /dev/sdb1 will be mastered onto server2 by DRBD. Neat, eh? All you have to do is prepare one disk, and the other will be handled automatically. This is, of course, assuming identical disks. The most critical part of the "identicalness" of the disks is the number of blocks. You can mix disparate disks together, but for the purpose of thie article, everything is fully identical between server1 and server2.

To partition and format /dev/sdb, run "fdisk /dev/sdb". Below is the dialog between me and fdisk, with my input shown in bold italics. The disk being worked on is a 9G SCSI.


# fdisk /dev/sdb



The number of cylinders for this disk is set to 1106.

There is nothing wrong with that, but this is larger than 1024,

and could in certain setups cause problems with:

1) software that runs at boot time (e.g., old versions of LILO)

2) booting and partitioning software from other OSs

   (e.g., DOS FDISK, OS/2 FDISK)



Command (m for help): p



Disk /dev/sdb: 9100 MB, 9100044288 bytes

255 heads, 63 sectors/track, 1106 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes



   Device Boot    Start       End    Blocks   Id  System



Command (m for help): n

Command action

   e   extended

   p   primary partition (1-4)

p

Partition number (1-4): 1

First cylinder (1-1106, default 1):   1  

Last cylinder or +size or +sizeM or +sizeK (1-1106, default 1106): 1106



Command (m for help): p



Disk /dev/sdb: 9100 MB, 9100044288 bytes

255 heads, 63 sectors/track, 1106 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes



   Device Boot    Start       End    Blocks   Id  System

/dev/sdb1             1      1106   8883913+  83  Linux



Command (m for help): w

The partition table has been altered!



Calling ioctl() to re-read partition table.

Syncing disks.

Once you have done this, you're ready to format the disk. For most general purpose usage, the ext3 filesystem is ideal, however you may choose any filesystem that meets your needs. ReiserFS is ideally suited to large mail and news servers where there are many thousands of files in a single directory. XFS and JFS are also good choices as journalling filesystems. Unlike other operating systems which may only support just one type of journalling filesystem, Linux supports several journalling filesystems to choose from. In this case, we will format /dev/sdb1 with ext3. Use the following command:


# mke2fs -j /dev/sdb1

The -j parameter means to turn on journalling. The ext3 fliesystem is backward compatible with ext2, hence the need to distinguish between plain, unjournalled ext2 and journalled ext3 with the same executable to format either.

When this completes, you will have a partition that could be mounted and files placed on it. I would recommend that you do so, and place some files on there so that you can better see the replication occur. Once you have copied some files onto there, unmount the partition. It will be remounted under DRBD later.

Now, the next step will be to verify that DRBD is installed and working correctly. To do this, execute:


# modprobe drbd

This should complete without any output or errors. Then, to see what modules are installed, execute:


# lsmod

In the output, you should see "drbd" in the list of installed modules. You can see what DRBD is thinking by inspecting the contents of /proc/drbd. If you're not familiar with /proc, it's a pseudo-filesystem (not really files on the local disk) that is a sort of viewport into the running kernel. You can treat entries under /proc as if they were real files and directories, and use "cat /proc/drbd" to print out what the DRBD driver is currently thinking. If you have modified the file drbd_main.c in the DRBD package and modified minor_count with a larger number of devices, you will get more output, but otherwise you will get just these two devices:


#  cat /proc/drbd

version: 0.6.10 (api:64/proto:62)



0: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0

1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0

Now, we have to tell DRBD that you wish to use the device /dev/sdb1 on server1 as the source drive, and /dev/sdb1 on server2 as the destination drive. The way we do this is to with the "drbdsetup" utility that comes with drbd. It will install as /usr/sbin/drbdsetup when you install it with "make install".

You may see all the options "drbdsetup" allows by simply executing drbdsetup without any arguments:


# drbdsetup 

USAGE:

 drbdsetup device command [ command_args ] [ comand_options ]

Commands:

  primary [-h|--human]

  secondary

  secondary_remote

  wait_sync [-t|--time val]

  wait_connect [-t|--time val]

  replicate

  syncer --min val --max val

  down

  net local_addr[:port] remote_addr[:port] protocol

     [-t|--timeout val]

     [-r|--sync-rate|--sync-max val] [--sync-min val]

     [--sync-nice val] [-k|--skip-sync]

     [-s|--tl-size val] [-c|--connect-int] [-i|--ping-int]

     [-S|--sndbuf-size val]

 disk lower_device [-d|--disk-size val] [-p|--do-panic]

 disconnect

 show

Version: 0.6.10 (api:64)

First, tell DRBD to map the physical device /dev/sdb1 to /dev/nb0 using "drbdsetup":


# drbdsetup /dev/nb0 disk /dev/sdb1

The next thing to do is tell DRBD that you wish to make /dev/nb0 replicate to another node on the network. The default port number for drbd to talk on is 7788. If you're not familiar with port numbers and what they mean, you might like to read over the Linux Networking Overview document or one of the excellent books on networking available from O'Reilly.

At this point, if you would like to access the files you copied on to /dev/sdb1, you need to access them through the block device /dev/nb0. Because of the above drbdsetup command, DRBD will be managing the writes to /dev/sdb1 through /dev/nb0. You can mount it again as /dev/nb0 and verify that you can still see those files you placed on there back when you initially formatted it. In this case, mount it to /mnt/testarea with the command "mount /dev/nb0 /mnt/testarea". If /mnt/testarea does not exist before you mount it, execute "mkdir /mnt/testarea" first and then mount it. Make sure you do this ONLY on server1. Server2 must NOT attempt to access /dev/nb0 except with DRBD itself.


# drbdsetup /dev/nb0 net 192.168.1.1:7788 192.168.1.2:7788 C

# drbdsetup /dev/nb0 primary

The parameter "192.168.1.1:7788" above sets the local address on which to bind drbd. The next parameter, "192.168.1.2:7788" instructs drbd to expect the other half of the mirror at that address on the other machine. The final parameter, "C", means to use "protocol C" of drbd to transfer the data. You may try other protocols (A and B are also available) and read the documentation on DRBD to find out which protocol is right for your needs. In my experience with DRBD thus far, protocol "C" has afforded me the highest throughput between the two machines. The second command line forces this to be the "primary" node, the source from which server2's /dev/sdb1 will be mastered.

Once you have executed the above drbdsetup commands, you should have output that looks like the following:


version: 0.6.10 (api:64/proto:62)



0: cs:WFConnection st:Primary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0

1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0

You can see that the first device has changed and shows "WFConnection". This means "Wait For Connection". It's waiting for server2 to start talking back to it. So, we now move to server2, boot it up and issue similar drbdsetup commands:


# modprobe drbd

# drbdsetup /dev/nb0 disk /dev/sdb1

# drbdsetup /dev/nb0 net 192.168.1.2:7788 192.168.1.1:7788 C

Now, find out what DRBD is thinking:


# cat /proc/drbd 

version: 0.6.10 (api:64/proto:62) 

 

0: cs:SyncingAll st:Secondary/Primary ns:0 nr:960 dw:960 dr:0 pe:0 ua:0 

        [>...................] sync'ed:  0.1% (8674/8675)M 

        finish: 7:25:08h speed: 318 (318) K/sec 

1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0

Congratulations, you now have data flowing from server1's /dev/sdb1 over to server2's /dev/sdb1! It is currently replicating data over the eth1 network cards.

However, take a look at the "finish" and "speed" fields. The speed is only about 300KB/sec, and it's going to take over 7 hours complete. This is much too slow to be useful. DRBD's default rate is 250KB/sec, however this can be bumped up to the physical limits of the hardware involved. DRBD's hard limit at the time of this writing is 600MB/sec, however to my knowledge there are no commercially available storage arrays that will give you this kind of sustained performance, so the limitation of 600MB/sec is not really a limitation.

The drbdsetup program can be used to set the sync rate. It's important to note that the "primary" node (server1) determines the sync rate. The following command will do nothing if run on the "secondary" (server2). The sync rate value can be set to the maximum with the following command line (note that these numbers are in units of KB):


# drbdsetup /dev/nb0 syncer --min 599999 --max 600000

Now, take a look at the contents of /proc/drbd:


# cat /proc/drbd 

version: 0.6.10 (api:64/proto:62)



0: cs:SyncingAll st:Primary/Secondary ns:1451064 nr:0 dw:0 dr:1451100 pe:1129 ua:0

        [===>................] sync'ed: 16.4% (7258/8675)M

        finish: 0:04:14h speed: 38,121 (6,877) K/sec

1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0

You can see two important changes. First, the sync rate is now much higher. In fact, most likely, it approaches the maximum sustained throughput of your chosen hardware. Secondly, the time to resync has dropped dramatically. In the "speed" field are two numbers. The first number represents the instantaneous speed of the data being replicated at that instant. The second, parenthesized number is the average over the duration of the synchronization. Here, the instantaneous speed is about 38MB/sec, while the average is under 7MB/sec because the earlier period of 250KB/sec is averaged in.

On server1, during the sync, you should test the replication a little bit by creating a new directory or two under /mnt/testarea, and copying some files into that area. Do this during the sync because it will demonstrate how DRBD duplicates the data continuously and transparently as it changes. As you are writing files and directories to /mnt/testarea on server1, those write commands are being caught by drbd and sent over to server2 where those write commands are duplicated.

I would now recommend waiting for the two nodes to complete their synchronization. The time involved will depend on the size of the disk to be replicated and the speed of the hardware you have in each node. Ponder for a moment the numbers you see as the sync rate, as well as the numbers reported by hdparm, if you ran that before. If you have the hardware and time to do so, I would encourage you to rebuild this exact same cluster on machines of various hardware platforms. Compare a pair of $700 desktop PC's with this configuration, a pair of $5000+ "server" class machines, and some older hardware. My guess is that you'll find a relatively small difference in performance between the desktop and "server" class hardware, but marked difference between the older and the newer hardware. Keep in mind that no matter how expensive and sophisticated your system, hard drives can only push a maximum of about 65MB/sec, regardless of interface controller. If you are someone responsible for hardware recommendation or purchase in your organization, this may be the opportune time to scrutinize your requisitions, and you may be able to realize future savings by not overspending unnecessarily on hardware. Examine whether or not your server truly demands SCSI, which realizes extra speed only in short write-bursts that do not completely fill the drive's cache, or if you may be able to get along far more cheaply with a SATA disk, or perhaps a cheap ATA/SATA RAID controller. Because both plain SATA controllers and SATA RAID controllers are only on the order of $50 each, and good SCSI or SCSI RAID controllers are typically an order of magnitude more expensive, I would strongly recommend testing them side-by-side.

You will know that the sync has completed when you see this kind of output, showing that the status has changed from "SyncingAll" to "Connected":


# cat /proc/drbd

version: 0.6.10 (api:64/proto:62)



0: cs:Connected st:Primary/Secondary ns:8883924 nr:0 dw:12 dr:8883933 pe:0 ua:0

1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0

Anyway, by the time the replication completes, you now have a pair of systems with completely identical /dev/sdb1 partitions. Identical in every way, down to the block level. At this point, you can bring down drbd on each machine, and mount /dev/sdb1 and verify that the files you copied onto /dev/sdb1 on server1 are now found on both server1 and server2. To bring DRBD completely down and mount it back up as a "normal" block device:


# umount /dev/nb0 (do this only on server1)

# DRBD /dev/nb0 disconnect

# DRBD /dev/nb0 down

# rmmod drbd

# mount /dev/sdb1 /mnt/testarea

# ls -l /mnt/testarea

On either machine, your output should be identical, and both machines should show the exact same data. If you are not yet at this point, you might have incorrectly typed in the drbdsetup commands that set up the replication, or perhaps there is a problem with your network cabling, network cards, or something else. Make sure you can establish connections between the machines over the gigabit cards using either a crossover cable or gigabit switch. If you are experiencing a situation where DRBD is not talking over your gigabit cards, turn on some services such as ftp or telnet, and make sure you can connect between them. Also, make sure your ipchains/iptables configuration is such that port 7788 is open.

If you are now here, you have successfully replicated data from server1 to server2. The files that are present on server1 are also present on server2, in their entirety. If your goal was only to achieve nearly-real-time replication of data, you could stop here. You could bring up the servers in this fashion and leave them run indefinitely. All writes to server1 would transparently and rapidly be replicated on server2. We could call this a kind of "manual cluster" (isn't that an oxymoron?), in that if server1 were to fail, we could manually power off server1 and put server2 into primary mode, and then manually change its IP addresses. The next section will describe how to set this up for automated fail-over without manual interaction.

Because we're using gigabit ethernet that is substantially faster than most disk controllers, and provided the bus speeds are fast enough to do so on both machines, we can assume that even full-throughput writes to the disk of server1 are being replicated at the same speed over on server2. If we were using 100Mbit instead of gigabit, we'd be limited by a maximum of perhaps 12MB/sec, as that is theoretical maximum of 100Mbit ethernet. Instead, you should probably be realizing throughput on the order of 20-55 MB/sec, depending on hardware. At the time of this writing, I have experimented mostly with older Pentium III systems, all of which have achieved 25-40 MB/sec on older SCSI controllers and older disks. I would be extremely interested to hear of your experience if you have successfully set this up on more modern hardware such as Pentium 4/Xeon or Opteron/Athlon64 systems with faster bus speeds, faster controllers, and faster disks.

Step Three: Fail-Over

The final hurdle is establishing complete fail-over between the servers. This means that when server1 dies or is taken down, server2 detects this change and follows a pre-defined course of action :

Force server1 to reboot or physically turn off, if it's not fully "out of the way" yet.
Force DRBD to go into "primary" mode.
If needed, run fsck on the shared DRBD filesystems.
Mount the shared filesystems
Takeover the IP address of server1.
Start the services which access the data in the DRBD partitions.

To this end, I have written my own cluster manager, TKCluster, which (mostly) does just this. It has some bugs, but I'm glad to share it with everyone to see if anyone out there is interested in seeing it improve.

To install TKCluster, you will first need to install Heartbeat. Heartbeat also requires libnet from The Packet Factory. Also, depending on your distribution, you might also have to install some other packages, but libnet is probably the most important piece to have in order to compile Heartbeat.

The piece of Heartbeat I find most useful to use in TKCluster is the IPaddr script. If you do the usual "./configure && make && make install" of Heartbeat, this script is probably located in /usr/local/etc/ha.d/resource.d/. If not, it might be in /etc/ha.d/resource.d/. This script is used to takeover the IP address of the failed machine by sending ARPs and causing the other machines in the LAN to refresh their ARP caches with the new hardware address for that IP.

Until this point, we have no discussed the method by which the data would be highly available across two servers with different IP addresses. Linux supports the concept of IP aliasing: putting more than one IP address onto a single network card. To make our services appear highly available to the rest of the LAN, we need to allocate an address for other machines to access the cluster, which we will call our "cluster address". Right now, server1 and server2 live on 192.168.0.1 and 192.168.0.2 respectively within the LAN. Let's allocate 192.168.0.250 and configure up eth0 on server1 to hold that address:


# ifconfig eth0:0 192.168.0.250

It's that simple! The interface "eth0:0" is an aliased interface: the "real" interface is eth0, and allows for multiple IP's on the same card. If any machine on the LAN attempts to connect to 192.168.0.250, it will be talking to server1 directly. The reason this is necessary is to be able to bring up each cluster node, bringing up their own un-aliased addresses initially, and be accessible on the network with their un-aliased addresses. When a decision is made by either a user or the cluster manager for a node to sieze control of the "primary" role, that machine will assume the cluster address (192.168.0.250) and will send ARP packets by calling Heartbeat's IPaddr script. The aliased address can be brought up and down without interrupting the ability to connect to the server's non-aliased address. When services such as Samba or NFS are brought up on the cluster, the clients on the LAN will be told to point to the cluster address, 192.168.0.250. If the primary cluster node goes down, the secondary will assume control of that address. In an ideal fail-over condition, users on the LAN who are accessing the cluster will notice a pause while accessing the dead or dying server, and then when the secondary takes over and grabs the address, the service will resume. They should notice nothing more than a delay or pause as the secondary assumes the primary role.

Once Heartbeat is installed, you will probably want to test it and observe how it functions within the cluster environment. Bring down eth0:0 on server1:


# ifconfig eth0:0 down

Now, bring up the same address on server2 using the IPaddr script:


# /usr/local/etc/ha.d/resource.d/IPaddr 192.168.0.250 start

You will see output which shows it is sending a lot of ARPs on the LAN. All the machines which had previously cached 192.168.0.250 and mapped it to the hardware address of server1 will now have the cache refreshed to show the hardware address of server2's eth0 card. IPaddr does all the work of determining which interface it should be using for the address so that the end user doesn't need to care about that. You can bring down this interface safely in the same fashion it was brought up:


# /usr/local/etc/ha.d/resource.d/IPaddr 192.168.0.250 stop

Note that if a machine goes down unexpectedly while it is holding an aliased address that was brought up with IPaddr, there is a lockfile in /usr/local/var/lib/heartbeat/rsctmp/IPaddr/ which is the same as the alias name. I would recommend adding the following line to your initialization scripts, such as /etc/rc.sysinit or /etc/rc.local. These may or may not be the right scripts for you, depending on your distribution. What you want is to modify or create an init script which is executed after /usr/local/ is mounted (and before the cluster manager is started, which will be covered shortly) that contains the following:


rm -f /usr/local/var/lib/heartbeat/rsctmp/IPaddr/*

This will clear out all lockfiles from previous boots, so that the cluster will always get eth0:0.

To install TKCluster, untar the package and copy the files to the proper locations:


# cp storagecluster /etc/init.d/

# cp *.pm /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi/ (this location depends on your version of Perl and its @INC path)

# cp *.pl /usr/local/bin/

# cp cluster.conf /usr/local/etc/

It's now time to configure TKCluster so that it knows about your particular setup.

TKCluster Configuration

If you know Perl, you will immediately recognize the internals of cluster.conf as Perl. This config file is meant to be called by eval() from Perl. Yes, there are potential security concerns with having a config file that is directly eval'd by Perl. A malicious local user with write access to the file could put something evil in there. This is good reason to set permissions so only root can write to the file and not give out your root password to anyone. Perhaps a future release will change this behavior so that there are no security concerns.

The config file itself, cluster.conf, contains comments to explain each piece. Please remember that TKCluster is still young and in development, and so some of the configuration parameters may be relics of previous, unreleased versions. However, it has a decent start and can be customized easily. The cluster.conf found in the package is already configured with reasonably sane defaults for this article. Perhaps only paths might need to be modified to work on your particular distribution, but it's also possible that the paths are already right for you. Double-check and make sure all the paths are set up right. With this config file, the cluster will attempt to mount /dev/nb0 to /opt and start NFS sharing it. You may want to edit /etc/exportfs to contain the following line, or a similar line based on your own security policies:


/opt    *(rw,no_root_squash,async)

Note that this line, as it stands, leaves this NFS mount point "flapping in the breeze" on your LAN, so use this only for testing and then restrict it further as needed.

If you wish to experiment with a clusters containing multiple DRBD devices, all you will need to do is add extra lines to the hashes in the config file which are used on a per-device basis.

TKCluster package has several files in it. The purpose of each of these pieces is outlined here:

storagecluster is the init script which starts the clustering. It accepts "start" and "stop" parameters. During normal operation, the cluster administrator will need to only run this with the start command and the rest of the cluster is fired up automatically.
storaged.pl is the process which feeds info about the local system to the other node when asked. It is also used to set certain parameters and perform certain tasks. For example, a secondary that is coming up will attempt to contact the storaged.pl on the remote side with a "reboot" command. If the primary is in a confused state, but its copy of storaged.pl is still running, storaged.pl will process the request for reboot from the remote and reboot itself.
getstoragestatus.pl is used to dialog with a local or remote copy of storaged.pl. All interface to storaged.pl is done through getstoragestatus.pl.
becomestorage.pl is used to force the machine into a particular state. It accepts parameters of "start", "stop" and "grab". The "start" parameter is used when the machine will need to dialog with the other server and determine if it should become primary or secondary. The "stop" parameter is used to remove itself from the cluster. The "grab" parameter is used to force the machine into primary role and attempt to reboot the other machine.
hispeed.pl is run continuously in the background and issues drbdsetup commands on the primary node periodically to force maximum sync rate. It also checks up on the status of the monitorstorage.pl script and restarts it if necessary. In addition, it checks to see if any of the DRBD devices have fallen into "StandAlone" status, which means that the node lost communication with the other side and assumes it will not be receiving any more contact. "StandAlone" is a potentially-fatal situation, because future attempts by the remote side to reconnect will go unheeded. Typically it's the primary node that goes to "StandAlone". "StandAlone" means the mapping that was set up by the drbdsetup "net" command to map a DRBD device to a net address has been reset, just for that particular device. When hispeed.pl sees that a device has gone to "StandAlone", it re-issues the drbdsetup commands which re-map the device to an IP address and port number.
monitorstorage.pl monitors the other node of the cluster. It periodically calls a MOSIX utility, mosctl, and getstoragestatus.pl to find out the status of the other machine. If monitorstorage.pl detects that neither MOSIX nor getstoragestatus.pl can reach the other side, it takes ones of two courses of action. If the failed machine was the primary node, it initiates a "becomestorage.pl grab" command and assumes the primary node's identity. If the failed machine was the secondary node, it falls through the script and exits without error. The hispeed.pl process restarts monitorstorage.pl at this point. If the failure is still occurring on the secondary node, monitorstorage.pl again falls through and exits. This occurs until the secondary comes back up, when monitorstorage.pl goes into a wait-loop. Even though monitorstorage.pl doesn't seem to really "do" anything when it's running on a primary node, in reality what it's doing is waiting for the role of either machine to change. In this design of cluster, neither machine is necessarily the "primary" or the "secondary" all the time. They can switch roles as failures occur. So, if a machine boots and right now it's primary (and monitorstorage.pl is doing not much more than waiting), it may fail, get fixed, and reboot and become secondary, and have work to do in monitoring the new primary node.
TKCluster.pm is a Perl module that contains cluster-specific functions which are called by the above Perl scripts.
TKgenericfuncs.pm is a Perl module that contains generic functions I use in this and other Perl scripts.

Now that you have copied the TKCluster files to where they are supposed to go and checked over cluster.conf, you want to test the cluster by firing it up with the init.d script. This should be done first on server1. In one terminal window, touch the logfile and watch the output:


# touch /var/log/cluster.log
# tail -f /var/log/cluster.log

The "tail -f ..." command will show output to the terminal window as it is written to the file. Leave this run while you switch over to another terminal windowto start the cluster itself. You can switch back and forth between the two terminal windows to watch what the cluster is doing as it starts:


# /etc/init.d/storagecluster start

You should get output in the "tail ..." window showing that storaged.pl has started. Shortly after that, monitorstorage.pl and hispeed.pl should have started as well.

Now you may notice that monitorstorage.pl on server1 cycles once by stopping and restarting NFS. This should only happen once, as the first invocation of monitorstorage.pl dies off at a time when it is not yet sure of the internal state of this cluster node being primary or secondary. Another invocation of monitorstorage.pl will start up momentarily, and this one should recognize that the server is primary and go into its wait-loop. The original monitorstorage.pl goes zombie at this point, the "Z" status will show up if you do a "ps auxwww | grep monitorstorage.pl". This is a known bug, I should probably investigate that and make it exit more cleanly.

If all went well, you should do a "cat /proc/drbd" and see that it's in the WFConnection state. Now go over to server2 and do the same thing, running storagecluster with the start parameter.

After this, the result should be that now a "cat /proc/drbd" shows you that data is being synchronized between the servers, and that server1 is the master. Initially, sync rate will be relatively low at about 250KB/sec, but that will change within 5 minutes. If this has not been successful, go back through the steps of setting up, making sure that you tweak the configuration file for any difference you might have between this example and your own particular setup.

If you have successfully brought the cluster up to this point, you might just want to try NFS mounting it from one of your client machines in the network. If you do not have an NFS client, you may substitute Samba in. Configuration of Samba itself is beyond the scope of this document, however it should be easy for you to configure it to work in most relatively simple Windows networks. You may also wish to run both Samba and NFS at the same time, sharing the same DRBD device. This is compeltely acceptable, and all you need to do is write a script to start or stop both simultaneously and put that script into the PRIMARYSERVICES section of cluster.conf. Because of the nature of cluster.conf, it is read when the Perl scripts start, so if you make changes to it while the cluster is running, you will need to stop and restart the cluster.

One thing to watch for with NFS is that most NFS init scripts pass the "-au" option to exportfs in the "stop" section of the script. This is not what we want. By specifying "-au", the NFS server notifies connected clients that the NFS share is going down. In a normal, non-clustered system, this is the right idea, however for a cluster, the service should never be told to inform clients it is going down. As far as the client machines are concerned, NFS is merely just slow during the time when one machine dies and the other takes over.

Once the data replication has completed, the next thing to do is test the cluster and make sure it works. To do this, bring down server1. You can do this in any way you like: allocate all the memory, hit the reset button, do a "reboot -f", anything that will immediately bring an end to "normal" operation. If you are watching the cluster.log file over on server2, you should see that it detects "minor" failures (ie server1 is not responding momentarily) and then ultimately recognizes this as a "MAJOR" failure and begins its own sequence of taking control of the DRBD device.

Using the default cluster.conf settings, there will be a period of 30 seconds or so where server2 attempts to contact server1, giving "minor" error messages. If server1 were to recover during this time, server2 would go back to normal monitoring mode. However, if server1 is truly down, it only takes another 30 seconds or so (possibly less with faster hardware) to bring server2 into the primary role. This is a tunable parameter which can be altered by changing the delay values in cluster.conf. However, these shouldn't be set too low, because normal network latency due to high load during replication may cause the cluster to fail-over without there being a real problem. Based on the speed and throughput of your particular hardware, you can tune the cluster.conf timings to meet your needs.

By this time, server2 has taken over and is serving NFS and/or Samba shares to the clients on the network. If any of your clients were connected during this time and operating on the files in the share, they saw a pause of about 1 to perhaps 1.5 minutes while server2 took over. Otherwise, however, their operation was seamless, or at least should have been. Because both NFS and Samba are disconnect-tolerant to some degree, the client machines never really saw a problem. As far as they know, whatever data they were reading or writing during that time was just slow.

You now have a fully functioning cluster. You can expand on this by adding more services that share the same DRBD device, or you may add new DRBD devices to the cluster for separate services.

For use in a web environment, you can now put a webserver, or more than one webserver, out in front of the cluster, and have them mount the storage cluster via NFS and put your database engine onto the cluster as well. This can be combined with web clustering via the NFS or router "tricks" and NOW you have a truly redundant, high-availability system to bank your services on. If you've successfully reached this point, let your imagination do some exploration and think up new ways to apply this clustering to your particular application.

Things Left To Do

I will be the very first to admit that TKCluster has bugs and limitations. Some are easy to overcome, some will require some more work. Here is a not-too-comprehensive list:

Fork/exec/waitpid stuff in storaged.pl is specific to Perl 5.8.0. I originally wrote this using Perl 5.6.1, and then it broke on 5.8.0. Perhaps someone can find how to generalize this and "do the right thing" on a wider variety of Perls.
Make cluster.conf into a more "normal" config file that is parsed more normally instead of eval'd, thus removing anyone's security concerns.
Bundle up the TKCluster package into something more easily installed without having to manually cp stuff.
Test on a wider variety of hardware and distributions, and construct an install script or RPM/DEB which will install it properly into those distributions as needed.
Allow for multiple NICs in the private cluster so that if one gigabit interface fails, it allows for falling over to a standby interface. Right now, all communication goes over the single gigabit card. A dedicated 10/100 or a pair of dedicated 10/100's could be used for storaged.pl to talk over instead of the gigabit.
A web interface would be nice. It would be nice to have a third node, not necessarily a cluster member, which could be used to administrate either node of the cluster and their operation remotely. Since everything is Perl and most functionality lives in the Perl modules, building a web admin app out of it shouldn't be rocket science. This would make it more convenient for those deploying this into a rack in a server room to administrate from their desktop or a dedicated node in the rack.
Cleaning up of the code. I admit my Perl is somewhat sloppy in places. It should be beautified and cleaned up.
Better/more documentation and comments within the code.
There are probably more bugs and limitations.

Conclusion

Clusters used for high-performance computing and data-replication used to be high-margin items available only from large vendors. However, with the freedom of Linux we have the opportunity to build these clusters ourselves now. TKCluster is one attempt at merging both HPC and HA clusters into a single system that is free, easily installed, and runs on commodity PC hardware.

I welcome your comments and suggestions. If you think this might be useful, feel free to use it. If you have some ideas and code you think would improve it, go right ahead. Please send me your thoughts, suggestion, code, diffs, whatever that could be improved.

About the Author

Tom Kunz lives with his amazing wife and children in the Pocono region of northeastern Pennsylvania. Tom has recently opened a web/mail hosting and custom software business, and as always, is looking for customers. His primary goal right now is to build his business to the point where he can adopt an entire family of orphan children from Russia (international adoption is expensive!). His website is http://www.SolidRockTechnologies.com/ and can be reached via email at tkunz - at - SolidRockTechnologies.com

Table Of Contents