Contact the author.
HOWTO install OpenGFS No-pool Author: Ben Cahill This document describes how to build/install/start/stop OpenGFS so a cluster of computers can share a single storage device. It provides a simple example configuration for one shared drive with 3 partitions, shared between 2 computers, using one internal and one external journal. The external journal capability is new, and supports certain OpenGFS features, without depending on OpenGFS' legacy "pool" volume manager module. NOTE: This document applies for current (June, 2003 and later) CVS code, and for releases after that time. For release 0.2.1 (or earlier), see the HOWTO-generic document. NOTE: Use kernel 2.4.20 or later for nopool support. 2.5.x and 2.6.x are not yet supported by OpenGFS. NOTE: It is now (for CVS code since July 30, 2003, and any releases after that) a *requirement* that patches for the DM device manager are applied to your 2.4.x kernel! See step 2 in the instructions below. This document describes how to set up OpenGFS *without* using OpenGFS' legacy "pool" clustered volume manager module and utilities (see "Pool and Utilities" document, ogfs-pool). Instead of pool, you may now use other volume managers/mappers (or even raw, unmapped devices, if each computer node sees them with a consistent name). The no-pool code achieves this by supporting "external" journals on specific devices/partitions separate from the main filesystem volume, in addition to internal journals within the filesystem volume . . . Background: mkfs.ogfs, in conjunction with pool (via a private ioctl), has traditionally supported the ability to assign particular journals (associated with particular computer nodes) to particular devices within the makeup of the filesystem volume, which can consist of many physical devices. Such assignment can enhance system performance, reliability, and manageability. To do this, mkfs.ogfs needed to know the details of how the filesystem volume was assembled by the pool volume manager. The no-pool code can now create journals on particular external devices (outside of the filesystem volume), so it does not need to know the makeup of the filesystem volume at all. This avoids the need for the private ioctl (supported only by pool), and thus supports the use of "generic" volume managers/mappers. You may still use OpenGFS with pool (see OpenGFS HOWTO-generic). You may even create external journals on separate pools. After all, pool *is* a volume mapper. The OpenGFS software base, build configuration, and binaries are identical for pool vs. no-pool usage; pool vs. non-pool usage is set up by options to mkfs.ogfs when creating the filesystem, and (obviously) whether and how you use pool. However, the OpenGFS project will not be maintaining pool, so we recommend using other commonly available volume managers/mappers instead. Depending on your setup, you may want to see other documentation as well: OpenDLM (Distributed Lock Manager): This can now be used as the inter-node locking protocol for OpenGFS, avoiding the single point of failure characteristic of the OpenGFS legacy memexpd lock storage server. See the OpenGFS HOWTO-opendlm document. EVMS (Enterprise Volume Management System): This is a good cluster-aware volume manager for use with OpenGFS. See the OpenGFS HOWTO-evms document, and use in conjuction with this document for setting up OpenGFS with EVMS. IEEE-1394 (firewire-attached) devices: See the OpenGFS HOWTO-1394 document for instructions on patching and re-building the kernel, including some 1394-specific patches. Then return to this document for the remaining steps of setting up OpenGFS. iSCSI (Internet SCSI): See the OpenGFS HOWTO-iSCSI document for instructions on setting up iSCSI, then return to this document for setting up OpenGFS. UML (User-Mode Linux): See the OpenGFS HOWTO-uml document. This provides an environment for debugging kernel code in user space. All instructions for building and installing OpenGFS under UML are in the HOWTO-uml document, but you may want to return to this document in addition, for more information. No-Pool Features/Requirements: -- The filesystem device can be any any device, real (e.g. /dev/sdc1) or virtual (mapped, e.g. /dev/evms/whatever). For clustered usage, the filesystem device must appear as /dev/something identically on each computer in the cluster. This identity must stay consistent over time/reboots. You may want to use a volume manager/mapper (e.g. EVMS) to assure compliance with this requirement. -- External journals can likewise use any devices, real or virtual. You must assign each external journal to a unique device or partition (external journals cannot currently share a partition). As with the filesystem device, each journal device must appear identically on each computer in the cluster, over the course of time and reboots. -- Internal journals (within the filesystem device) are still supported. -- Pool is still supported (but not required, unless you want to keep using a pre-existing filesystem that used pool). -- Filesystem expansion (after enlarging the available space on the filesystem device by means of a volume manager) is supported via ogfs_expand utility. -- Journal addition is supported via ogfs_jadd utility. Internal journals require enlarging the available space on the filesystem device. External journals require one dedicated device or partition for each journal. Restrictions when not using pool: -- No support via pool for hardware-based (DMEP) locking. As an alternative, OpenGFS supports(?) hardware DMEP via kernel's generic SCSI support. Currently, DMEP support is not maintained in OpenGFS . . . good luck! -- No support for striping. Pool has support for that, but we'll rely on other volume managers/mappers for that from now on. INSTALLATION PRELIMINARIES ------------------------------- This document assumes that you already have set up a shared storage device. Typically, this means installing interface hardware (e.g. a fibre channel host bus adapter) and drivers (e.g. qla2300 driver for QLogic 23x0 HBA) on each computer that will be sharing the storage device(s), and connecting each computer to the storage device (e.g. with fibre cable). You do not need to partition or initialize the storage device/drives (yet). Installation success is indicated by the drive(s) appearing in /proc/partitions on each computer in the shared storage cluster. HINT: Take a look at the "big convenience" patches described in step 2A below. They contain some popular fibre channel drivers, among other things. If you use a "big" patch, your fibre channel drives will not show up in /proc/partitions until after you apply the patch and build your kernel in step 2 below. Check for success: cat /proc/partitions shows shared drive(s) This document also assumes that your cluster member computers are interconnected by a LAN, and that you know the IP address of each computer. You will *probably* want to use static IP addresses for this purpose! Most of the steps below must be executed on *all* computers in the cluster, while some (the steps that write to the shared storage device) need to be executed just once, on only one computer. You may find it easiest to follow the instructions by building the code separately on each computer in the cluster (especially if different computers run different kernel versions for some reason). However, if you want to do one build (e.g. on your development platform), and install on multiple target computers, there are two options: The opengfs.spec file in the CVS download version allows you to create an RPM package for easy installation on multiple computers. Use --prefix and --with-linux_moduledir options when running ./configure (see below), to install the build results into isolated directories, which you can copy to the other machines. BUILDING AND INSTALLING OpenGFS ------------------------------- 1. Get the OpenGFS source, either from CVS or via tarball (some releases are also available as source RPMs). Copy to each computer in the cluster. A. Retrieve (check out) the OpenGFS source from CVS: export CVSROOT=:pserver:email@example.com:/cvsroot/opengfs cvs login (just hit "enter" key for the password) cvs -z3 co opengfs (-z3 invokes compression, if desired) HINT: Use "cvs up opengfs" to retrieve subsequent updates to files already checked out, or "cvs co opengfs" to make sure that you pick up any new files. *OR* B. Download a release tarball from http://opengfs.sourceforge.net. Click on "Downloads", then on "Sourceforge Download Page", and download the release you want, then: tar -xvzf opengfs-n.n.n.tar.gz (substitute version for n.n.n) or use bunzip2 if the tarball has a .bz2 suffix. HINT: The no-pool code was checked into CVS on 08 May, 2003. Support for filesystem expansion and journal addition for no-pool was checked into CVS between 23 May and 30 May, 2003. Please keep this in mind when selecting tarballs! Release 0.2.1 (13 May, 2003) does *not* support no-pool operation; even though its release date overlaps the CVS checkin, it does not contain no-pool support! 2. Patch/rebuild kernel with the OpenGFS and DM patches (on each computer). The OpenGFS patches only modify the existing kernel code. They do not contain the OpenGFS kernel modules. Those will be built in step 3. OpenGFS now (as of July 30, 2003, and any releases after that date) requires patches for the DM device mapper to be in your kernel. You have (at least) 2 choices for obtaining them: A. Apply "big" patch. If you are going to use EVMS, or want to use QLogic or Feral drivers for a QLogic host bus adaptor with kernel 2.4.20 or later, you may wish to use one of the big "convenience" patches available from the OpenGFS download site (*not* in CVS!): http://opengfs/sourceforge.net/download.php These tarballs contain kernel patches for OpenGFS, EVMS, DM KDB (kernel debugger), and drivers for QLogic HBAs (read comments on download page for specific contents of each patch). Depending on your kernel, and features desired, look for the following, or other appropriate patches as they may appear: linux-2.4.20-ogfs3.patch.bz2, linux-2.4.21-ogfs1.patch.bz2, Use bunzip2 to uncompress, then apply patches to your kernel: cd /usr/src/linux (or wherever your kernel source lives) cat /path/to/big/patch/linux-2.4.21-ogfs1.patch | patch -p1 HINT: To avoid patch conflicts, you may want to start with pristine kernel source in a new directory. You may want to copy your .config file from your old source directory to your new one, to save a lot of reconfiguration time. The big patches place an EXTRAVERSION value (e.g. "-ogfs1") in Makefile, to differentiate the build from your normal build. HINT: When reconfiguring, select no more than one QLogic driver (2100/2200/2300). HINT: The big patches on the OpenGFS website may not contain the latest patches for various non-OpenGFS components. If you want to use the latest, you may need to round up all the patches yourself. If you do, please consider contributing your results (i.e. a new big patch) to our download page. B. OR, apply minimal (only the OpenGFS and DM) patches. If you will not use EVMS, want to keep your patching compact, or want to make sure you're using the latest patches, obtain DM patches directly from DM project website: http://sources.redhat.com/dm/ Select the latest patches for your kernel, and apply them: cd /usr/src/linux (or wherever your kernel source lives) cat /path/to/dm/patches/*.patch | patch -p1 In addition, you must apply the OpenGFS kernel patches: cd /usr/src/linux (or wherever your kernel source lives) cat /path/to/opengfs/kernel_patches/2.4.x/*.patch | patch -p1 HINT: You may want to edit your kernel Makefile to put in a value for EXTRAVERSION (e.g. "-ogfs") to differentiate this kernel from your normal kernel. Then (for A or B): Reconfigure kernel (e.g. make oldconfig) for new patches Rebuild kernel (make bzImage, make modules, make modules_install) Install kernel (e.g. mkinitrd, lilo, etc.) Reboot HINT: With, for example, linux-2.4.21, and EXTRAVERSION=-ogfs, make modules_install will place kernel modules in: /lib/modules/2.4.21-ogfs/ogfs HINT: If you don't add patches for DM to the kernel, you will get complaints during the build about "b_journal_head". (Return here from HOWTO-1394 . . . ) 3. Build and install OpenGFS kernel modules/tools/man-pages (on each computer). The bootstrap step is required only if you downloaded from CVS. It is not required for the tarball. Make sure you provide a correct path to your patched(!) kernel source, using --with-linux_srcdir, when running ./configure. Some of the patches modify kernel include files needed for building ogfs. cd /path/to/opengfs ./bootstrap (ONLY FOR CVS DOWNLOADS) ./configure [options] HINT: If the bootstrap script spits out cryptic warnings, double check that you have the right versions of autoconf and automake installed. Some versions of RedHat and Debian Linux (and maybe others), came with a flawed wrapper script that tried to allow installing old and new versions of the auto... tools in parallel. Using them usually ends in disaster. Deinstall all autoconf and automake versions from your system, download the latest autoconf tarball and the automake-1.6.x tarball and install the tools from them, then try again. HINT: ./configure --help shows options. Some interesting ones: --with-linux_srcdir=/some/path, location of patched(!) linux source --with-opendlm_includes=/some/path, OpenDLM source, see HOWTO-opendlm --prefix=/some/path, installs user binaries and man pages under here --with-linux_moduledir=/some/path, installs kernel modules under here --enable-extras, builds OpenGFS test tools --enable-*-debug, enables debug features in various OpenGFS components --enable-*-stats, enables statistics in various OpenGFS components --enable-uml, compile for User-Mode Linux, see HOWTO-uml --enable-opendlm, compile with OpenDLM lock module, see HOWTO-opendlm make su (you need root privilege for all following steps) make install Check for success*: /sbin contains "ptool" and many other ogfs tools Check for success*: /lib/modules/2.4.x/ogfs contains ogfs.o + others Check for success*: /usr/man/man8 contains ogfs.8, memexpd.8 + others * Default locations without --prefix or --with-linux_moduledir options. 4. Insert OpenGFS modules (on each computer): modprobe memexp modprobe ogfs Hint: If modprobe says it can't find memexp, try running depmod to update /lib/modules/(version)/modules.dep Check for success: cat /proc/filesystems shows "ogfs", among others Check for success: cat /proc/modules shows "memexp", "ogfs" + other modules used by memexp and ogfs, among others 5. Partition the shared drive (using only one computer), into 3 partitions. A small partition (~4MB) will be used for Cluster Information (ci). A medium partition (~128MB) will be used for an external journal. A large partition (the rest of the drive) will be used for the filesystem data and an internal journal. These instructions assume you are using a "real" device, but you may instead want to use a volume manager to partition your drive. See OpenGFS HOWTO-evms for information on doing this with EVMS. Use the dev name of your shared drive in place of "sdX" below. See the man page for sfdisk for more information: sfdisk -R /dev/sdX (make sure no drive partitions are in use) sfdisk /dev/sdX (this partitions the disk; follow the prompts) Hint: sfdisk works in units of "cylinders". When partitioning my drive, sfdisk showed that each cylinder was 1048576 bytes. So, for the first (small) partition, I entered 0,4 to start at cylinder 0, with a size of 4 cylinders (~4MB). For the second (medium) partition, I entered 4,128 to start at cylinder 4, with a size of 128 cylinders (~128MB). For the third (large) partition, I entered nothing (except the Enter key), which defaulted to use the rest of the drive. For the next one (sfdisk asks for 4 partitions), I also entered nothing (except the Enter key). This last "partition", of course, is empty. After you enter all 4 partitions, sfdisk asks you if you really want to write to the disk, so you can experiment a bit before committing. Check for success: cat /proc/partitions shows the new partitions HINT: These partitions must show up on all computers in the cluster, and must be named consistently. After you create new partitions using one machine, you may need to find a way for the other machines to re-scan for partitions, or simply reboot the other machines (don't forget to re-modprobe memexp and ogfs). HINT: If you find that the other computers see the partitions with different names, or if there is any chance you will be re-configuring your cluster computers and thereby affecting the device names, you will *need* to use a volume manager such as EVMS (see OpenGFS HOWTO-evms for information). If using EVMS, you will need to create "native volumes". These provide consistent naming from machine to machine. Check for success: cat /proc/partitions shows the new partitions on every machine, with identical names In the following instructions, we'll call the three partitions /dev/sdx1, /dev/sdx2, and /dev/sdx3, but you will need to substitute appropriate names in their stead. 6. Make the OpenGFS Filesystem (using only one computer). Use the OpenGFS tool "mkfs.ogfs" to create the file system on disk. This step writes a superblock and resource group (a.k.a. block group) headers for the filesystem, and creates journals. In the superblock, it writes the default locking protocol (e.g. "memexp") and the cluster information device, as specified in the mkfs.ogfs command line. See the man page for mkfs.ogfs for more information on options. A. Edit a configuration file named journal.cf to read like the following. You can find a copy of this file, with comments, as opengfs/docs/journal.cf. This file, and all other opengfs configuration files, may reside anywhere on your computer. Remember, you will need to substitute appropriate names for sdx3 (the filesystem device, large partition) and sdx2 (the external journal device, medium partition). fsdev /dev/sdx3 journals 2 journal 0 int 256 journal 1 ext /dev/sdx2 B. Run the following command to make sure everything is okay. The -v prints extra information, and -n prevents mkfs.ogfs from writing anything to disk. Check the output to verify device sizes, etc. The external journal should start at a *very* high number. Remember, you will need to substitute an appropriate name for sdx1 (the cluster information device, small partition). mkfs.ogfs -p memexp -t /dev/sdx1 -c journal.cf -v -n HINT: If you want even more output, try the -d option. C. Run the following command to write the filesystem and internal journal onto the filesystem device, and the external journal onto the external journal device. mkfs.ogfs -p memexp -t /dev/sdx1 -c journal.cf HINT: This can take some time to write to disk, and shows no output until it is done. Be patient. 7. Configure the OpenGFS Cluster, write it to disk (using only one computer). Now that the file system has been created on the storage media, we need to describe the computers (nodes/machines) that will be sharing the media. This includes the STOMITH ("Shoot The Other Machine In The Head") method(s) used for resetting a computer when it fails, and the heartbeat period used for detecting failure. For more information, see the man page for ogfsconf. Also see opengfs/docs/ogfscftemplate to understand the configuration file. You need to choose one computer to be the lock storage server. This computer stores the inter-node lock status of all OpenGFS locks within the cluster. We're assuming a two-computer cluster in this example. You may use one of these two computers as the lock storage server, or use a third computer instead. HINT: When using Manual STOMITH, after resetting a dead computer (but *not* now, though, while you initially set up the cluster!), you must run the following OpenGFS tool to continue the recovery process. See the man page for do_manual for more information: do_manual -s $IP_ADDR_OF_DEAD_NODE (don't do this now, though) A. Edit a configuration file named ogfscf.cf to read like the following. You'll need to substitute appropriate IP addresses for the lock storage server computer and the cluster member computers (nodes). However, you'll want to use the same port numbers (15697 and 3001). See ogfsconf man page. Remember, you will also need to substitute appropriate names for sdx3 (the filesystem device, large partition) and sdx1 (the cluster information device, small partition). datadev: /dev/sdx3 cidev: /dev/sdx1 lockdev: 192.168.0.37:15697 cbport: 3001 timeout: 30 STOMITH: manual name: manual node: 192.168.0.37 0 SM: manual 1 node: 192.168.0.203 1 SM: manual 2 B. Write the config info to the Cluster Information (ci) partition. ogfsconf -c ogfscf.cf 8. Start the Lock Server (on only one computer, the lock storage server). This daemon provides the centralized lock storage facility for the memexp locking protocol. Launch the OpenGFS memexpd lock storage server daemon (see memexpd man page): memexpd & 9. Mount the Filesystem (on each computer). The following commands mount the ogfs file system at the /ogfs mount point. For hostdata, you'll need to substitute the IP address of the computer on which you are currently mounting the filesystem (i.e. *this* computer). Remember, you will also need to substitute an appropriate name for sdx3 (the filesystem device, large partition). mkdir /ogfs mount -t ogfs /dev/sdx3 /ogfs -o hostdata=192.168.0.x HINT: If you see an error from mount like: "mount: dev/sdc3 is not a valid block device", a) make sure you are using a valid /dev/* in the command line, or b) you may be trying the mount from a machine *other* than the one you used for creating the new partitions. Try rebooting so *this* machine can detect the new partitions. That's it, you are done! You should now be able to use this like any other filesystem. SHUTTING DOWN CLEANLY -------------------- As an alternative to manually executing the steps below, look in opengfs/scripts for pool.* and ogfs.* startup scripts for Debian and Red Hat distributions. These may require modification for your particular setup. If you want to shut down your computers, you will need to unmount OpenGFS. 1. Unmount the Filesystem (on each computer). umount /ogfs (this assumes you mounted onto /ogfs) 2. If you find that you need to unload the kernel modules for some reason (usually not necessary for shutting down the computer), you may need to run the following to remove all knowledge of pools from the pool module, and reduce usage count to 0: passemble -r all (not necessary unless you're using pool) Then unload the modules: modprobe -r ogfs modprobe -r memexp 3. If you want to uninstall the tools/modules/man-pages, go to your build directory, and run: make uninstall If you deleted the tree (oops!), it can be rebuilt from scratch, but make sure you use the same paths. STARTING OpenGFS (e.g. after boot-up) ------------------------------------ As an alternative to manually executing the steps below, look in opengfs/scripts for pool.* and ogfs.* startup scripts for Debian and Red Hat distributions. These may require modification for your particular setup. Once OpenGFS has been installed on your computers and storage media, only a few steps are needed to get it going after a boot-up. The following steps assume that your storage media hardware and drivers are installed and visible by all computers in the cluster. Check for success: cat /proc/partitions shows shared drive(s) You will need root privilege for all steps below: 1. Load OpenGFS kernel modules (on each computer). modprobe memexp modprobe ogfs Check for success: cat /proc/filesystems shows "ogfs", among others Check for success: cat /proc/modules shows "memexp", "ogfs" + other modules used by memexp and ogfs, among others 2. Start the Lock Server (on only one computer, the lock storage server). This must be started before mounting the filesystem on *any* of the computers in the cluster. memexpd & (on lock storage server computer only!) 3. Mount the Filesystem (on each computer). Remember, you will need to substitute an appropriate name for sdx3 (the filesystem device, large partition). Remember also, "hostdata" is the IP address of the computer on which you are currently mounting the filesystem (i.e. *this* computer). mount -t ogfs /dev/sdx3 /ogfs -o hostdata=192.168.0.x That's it, you are done! You should now be able to use this like any other filesystem. Copyright 2002-2003 The OpenGFS Project