Contact the author.
WHAT IS OpenGFS?? The Open Global File System (OpenGFS, or OGFS) is a journaled filesystem that supports simultaneous sharing of a common storage device by multiple computer nodes. This is one way to implement a "clustered file system". As an example, consider a cluster of 3 computers that share a single array of Fibre Channel (FC) drives. Each of the 3 computers can directly access the drives (perhaps via a FC hub or switch), and pretty much treat the whole array as if the computer had sole access. OpenGFS coordinates the storage accesses so that the different computers don't clobber (overwrite) each other's data, while providing simultaneous read access for sharing data among the computers. ___________ ____|computer |\ lock | |_________| \ _________ coordination -> | \ | | via LAN | ___________ \ | | |____|computer |-----|Storage| | |_________| / | array | | / | | | ___________ / |_______| |____|computer |/ |_________| How is this different from a Network File System (NFS)? In NFS, the storage access goes through the NFS server; the client computers do not have direct access to the storage device. This adds the network communication overhead to the process of storing and retrieving data. OpenGFS provides all of the components necessary for clustered operation. The original code was written before many of these components were commonly available, either within the Linux kernel, or from other open source projects. Therefore, the original authors provided *everything* needed for clustering. Since the original authoring, many of these components have become available from other sources, and a lot of the current work on OpenGFS involves stripping away redundant functionality, and taking advantage of the newly available components. As an example, we recently (Spring '03) "liberated" OpenGFS from relying on the OpenGFS "pool" kernel module and user-space utilities. This component provided volume management and device mapping for OpenGFS. Now, OpenGFS can use virtually any volume manager (preferably cluster-aware, such as the Enterprise Volume Management System, EVMS), and/or device mapper (e.g. the DM device mapper). We have also recently (Fall '03) added the OpenDLM lock module. This locking protocol is more efficient than OpenGFS' memexp protocol, and avoids the single point of failure (SPOF) exhibited by memexp's lock storage server. Here's a high level block diagram of OpenGFS and its locking support. Note that the lock module (a.k.a. locking backend) attaches to OpenGFS via a plug-in interface. Refer to this diagram when reading the next section. Inter-node (other support ogfs.o Linux (other nodes) _____________ _______________ _________ nodes) | | Lock | | | | | | | ___________ |______| Module |--> >--| | |____| VFS | | | | | | | | g | file- | |_______| | | | | |___________| | - | system | | | |__| Shared | LAN | | | l | and |____| BlkIO | | | Storage | | ______|______ | o | journal| |_______| | | | | | Cluster | | c | | | | | |_________| |______| Manager | | k | | | Driver|----| | |___________| |____|________| |_______| | (other (other nodes) nodes) WHAT ARE THE ELEMENTS OF CLUSTERING? Sharing and journaling storage among a cluster of computers relies on: 1). A consistent view of storage from all computers. /dev names and sizes, and file names and attributes must appear identical to each computer. As a "nice-to-have" feature, it may be desirable to map several hardware storage devices to appear as a single large filesystem device. These features have been provided by the OpenGFS "pool" volume manager, and may now be provided by other volume managers. 2). Locking services, so two computers don't try to write to the same storage location at the same time, or one computer doesn't try to read a file while it is being modified by another computer. OpenGFS has an embedded locking system, and a locking harness to allow the usage of different lock protocols (e.g. OpenGFS' "memexp" and "nolock" protocols). The memexp protocol uses local area network (LAN) to communicate among the cluster's computers. The recently added OpenDLM protocol also uses LAN, and is more efficient than memexp. It also avoids memexp's single point of failure (the lock storage server). 3). Individual journals for each computer. Journals allow recovery to a consistent state when a computer node crashes in the middle of a write operation. Independent journals for each computer provide isolation so that one computer does not corrupt the journal of another, no locking conflicts occur for journal writes, and it's easier to play back a single computer's journal if it dies. Journals may be "internal" (within the filesystem device) or, with recent changes, "external" (stored on a shared device other than the filesystem device). The journaling provided in OpenGFS protects only the filesystem metadata (not file data itself), so data within a file may get corrupted by a crash, but the overall filesystem layout will remain intact. 4). Cluster membership services, to assign a journal to a given computer as it joins the cluster, to let the locking service know about other computers in the cluster, and to trigger journal and locking recovery operations when a computer leaves the cluster unexpectedly (dies). This service is only loosely coupled with the OGFS filesystem code, which responds to it by way of the locking service interface. The membership service may be implemented in different ways for different locking protocols. 5). Fencing, or "Shoot The Other Machine In The Head (STOMITH)", to protect the filesystem from corruption from a computer node that dies or goes crazy. Fencing isolates the computer from the storage device, while STOMITH powers-down or reboots the computer. OpenGFS provides several methods for doing this using a variety of hardware devices (e.g. power switches, fibre channel isolation switches), or human intervention ("meatware"). The memexp locking protocol provides hooks for triggering these: APC Masterswitch power switch (STOMITH) WTI NPS network power switch (STOMITH) VACM Nexxus (serial port IPMI?) (STOMITH) Brocade Fibre Channel Switch, port and zone based (Fencing) Vixel Fibre Channel Hub (Fencing) This service is quite loosely coupled with the OGFS filesystem code, which really has no idea that the STOMITH features exist. The cluster management/membership service is responsible for triggering STOMITH. There is no requirement to use memexp as the cluster manager. 6). Resizing of the filesystem. This is not a strict requirement, but is a "nice-to-have" that's often expected in a clustered environment. OpenGFS user-space utilities provide for expanding the filesytem data storage space (ogfs_expand, by writing new resource group headers to disk), and for adding new journals (ogfs_jadd, by writing new journal headers). OpenGFS does not currently provide for shrinking storage or removing journals. You will need to use a volume manager (e.g. pool or, more preferably, EVMS) to expand the filesystem *device* size before adding new OpenGFS data storage space or internal journals. WHAT IS IN THE CVS TREE? The CVS tree and distribution tarballs contain the code and documentation for kernel space modules and user-space utilities for all OpenGFS features. The build process uses autoconf/automake to configure the build to your computer and your preferences. See opengfs/docs/HOWTO-generic or HOWTO-nopool for more information. All code, both kernel-space and user-space, gets built with a single "make" command. The code currently maintained and in use exists in the opengfs directory within CVS. One other directory, gnbd, contains deprecated code that implements "an OGFS-friendly network block device". Within CVS opengfs directory, several directories are empty, but are there for supporting autoconf/automake when building. The other directories (in alphabetical order) contain the docs and source: 1). docs: Documentation on usage and design. Many of these are mirrored on the OpenGFS web site's Docs/Info page, as well. 2). kernel_patches: Currently, OpenGFS requires a relatively small patch to the 2.4 series of Linux kernels (2.5 kernels are not yet supported). 3). man: man pages for user-space utilities, and for mount options. You may view these pages using the "man" command, even without building/installing OpenGFS. 4). scripts: A variety of scripts for a variety of purposes, including applying patches, creating .h files for supporting debug modes in the filesystem code, setting up the EVMS volume manager, installing OGFS in Debian and Redhat environments, etc. etc. etc. Some of these are old and unmaintained. Your mileage may vary. 5). src: Source for all user-space and kernel space code. See below for tour of subdirectories. opengfs/src directory, in alphabetical order: 1). divdi: Architecture-specific division support (boring) 2). fs: Filesystem code. The heart of OpenGFS. This is the kernel module ogfs.o. Includes basic filesystem operation, journaling, locking (the part within the filesystem). Architecture-specific subdirectories include support for 2.2 and 2.4 kernels, and user space (used by certain user space utilities, e.g. ogfs_jadd). 3). gnbd: Deprecated "GFS-friendly network block device" code. 4). include: Some .h files with global relevance (user-space and kernel). 5). locking: Kernel modules (other than ogfs.o) that support locking. Also, user-space memexpd lock server daemon. See below for breakout of subdirectories. 6). pool: Kernel module, pool.o, for "pool" device mapper. User-space utilities for pool are in opengfs/src/tools/ptools directory. Pool has a few rough edges, and with recent changes to CVS tree, we are trying to get away from using it, trying other volume managers / device mappers instead (e.g. EVMS/DM). 7). stomith: Support for fencing, both kernel and user-space. "agents" subdirectory contains support for various isolation methods. "daemon" subdirectory contains user-space stomithd daemon that invokes methods when needed. "module" subdirectory contains kernel module stomith.o, which provides support for kernel components to communicate with the daemon. In the future, we would like to relegate this stomith functionality to a separate, non-OGFS, cluster manager. 8). tools: User-space utilities for a variety of purposes. See below for breakout of subdirectories. opengfs/src/locking subdirectories, in alphabetical order: 1). harness: Kernel module lock_harness.o. Registers available lock protocol implementation modules, and connects one of them to filesystem at mount time. 2). modules: Lock protocol implementation modules. Includes subdirectories for memexp (clustered), opendlm (clustered), nolock (non-clustered), and stats (stacks on top of another) protocols. 3). servers: The memexpd user-space daemon, the central lock storage server for the memexp locking protocol. Uses memory or disk-based storage. opengfs/src/tools subdirectories contain user-space utilities. Many of these have man pages in the opengfs/man directory. In alphabetical order: 1). dmep_tools: Utilities dmep_conf and do_dmep for Direct Memory Export Protocol (DMEP) hardware-based lock storage support for memexp. DMEP support exists within memexp and pool, but is not being maintained, as we don't know anyone who is using it. 2). hexedit: Editor for binary files. Seems a little buggy. hexedit on RedHat distro works better. 3). initds: Initialize Disk Store utility. Prepares disk-based storage area for memexpd lock storage server. 4). mangle_fest: A whole bunch of test and stress utilities for OpenGFS. 5). mkfs: Makes the filesystem on disk by writing the superblock, data resource group headers, journal headers, and any other needed metadata to disk. 6). ogfsck: Filesytem check utility. 7). ogfsconf: Writes the cluster configuration onto a cluster information device (cidev). Locking and journaling uses this information to know which machines (IP addresses) are potential members of the cluster. 8). ogfs_expand: Adds data resource group headers to unused device space, to increase data storage capacity of existing filesystem. 9). ogfs_jadd: Adds new journals to existing filesystem. 10). ogfs_tool: Debugging and statistics tool for communicating with filesystem during operation. 11). ptools: Utilities for creating, enlarging, and reading pool configuration. 12). test_dmep: Test utility for DMEP (hardware-based locking) device (?) 13). test_mmap: 14). ucmemexp: User-space client for memexp locking protocol. PROJECT HISTORY: OpenGFS has its roots in the GFS project, which was originally sponsored by the University of Minnesota from 1995-2000 (according to copyright notices in the code). Around 2000, U. of M. professor Matthew O'Keefe founded Sistina Software based on the U. of M. research work. Sistina kept the project open source through GFS 4.x, but decided to take GFS proprietary around 2001, and GFS is now a commercial product (and we wish Sistina well; they have contributed significantly to the Linux community, and continue to do so). OpenGFS was started shortly thereafter, based on the 4.x source. Notes in the CVS base indicate that it was imported from Compaq's SSIC-Linux 0.5.1 in August of 2001. This work was done by Linux heavy hitters such as Christoph Hellwig and Alan Cox. According to Alan, they had time to "clean it up and make it basically sane, but not to then tackle the big jobs" such as "sorting out all the pool mess". Since then, the project has been maintained mostly by Brian Jackson and Dominik Vogt. 2003 has seen a lot of effort towards understanding and documenting the code (Dominik, Stefan Domthera, and Ben Cahill), removing the dependence on pool (Ben Cahill), supporting OpenGFS with EVMS (IBM EVMS team) and developing a new lock module for OpenDLM (Stanley Wang). Many other folks are monitoring the list and providing helpful suggestions and discussions. The OpenGFS team is NOT monitoring the commercial GFS development process, and is NOT intentionally implementing similar upgrades or features. REFERENCES: OpenGFS web site: http://opengfs.sourceforge.net (see especially "Docs/Info" page, via menu) OpenGFS CVS (browsable/downloadable via OpenGFS website): opengfs/docs/* Many docs are also mirrored to http://opengfs.sourceforge.net/docs.php opengfs/man/* man pages are viewable within CVS tree, without build/install, e.g.: cd /path/to/opengfs/man man ./ogfs_mount.8 Mail lists: Archives (Sept. 2001 - present) and lists on our project page: http://sourceforge.net/mail/?group_id=34688 Older GFS mail list archives (May 2000 - present), no search facility: http://www.spinics.net/lists/gfs More recent, searchable archives, (Aug. 2003 - present): http://marc.theaimsgroup.com/?l=opengfs-users&r=1&w=2 http://marc.theaimsgroup.com/?l=opengfs-devel&r=1&w=2 http://marc.theaimsgroup.com/?l=opengfs-bugs&r=1&w=2 http://marc.theaimsgroup.com/?l=opengfs-announce&r=1&w=2 Kernel docs: Documentation/filesystems/ext2.txt Documentation/filesystems/ext3.txt These don't discuss OpenGFS, but are interesting reading anyway, due to many similarities. Documentation/DocBook/journal-api.* Kernel interface between filesystem and Linux Journal Block Device (JBD), the journaling service used for ext3. Again, this doesn't discuss OpenGFS, but is interesting reading anyway, due to many similarities. To create viewable files in the kernel source tree, do: cd /path/to/linux make pdfdocs, or make htmldocs Interesting internet resources and other projects: http://opendlm.sourceforge.net Open Distributed Lock Manager (OpenDLM) project http://oss.software.ibm.com/dlm Distributed Lock Manager documentation http://evms.sourceforge.net Enterprise Volume Management System (EVMS) project ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/journal-design.ps.gz Great paper on journaling for ext3. Very similar to journaling in OpenGFS. http://www.linuxsymposium.org/2003/audio.php Audio presentations from 2000 and 1999. A bit dated, but still helpful. Recommended: 2000, GFS for Linux (by one of the original authors) 2000, EXT3, Journalling FS (by Steven Tweedie, ext3 author) * 2000, XFS for Linux * 2000, Intelligent I/O for Linux 1999, Intermezzo: Distributed Filesystem * transcript available at http://olstrans.sourceforge.net http://www.linuxsymposium.org/2003/proceedings.php Chock full of lots of good stuff, filesystems, clustering, you name it. http://www.namesys.com/v4/v4.html Interesting paper (with distinctive graphics) on Reiser 4 filesystem. Note that OpenGFS does not work like Reiser 4, just mind-expanding reading. Books: "Understanding the Linux Kernel", 2nd edition, Bovet and Cesati, O'Reilly Good for information on Linux kernel components that support OpenGFS, notably the Virtual File System (chapter 12), block handling (in chapter 13), and the page and buffer caches (in chapter 14). It also discusses the ext2/ext3 filesystem (chapter 17), which does many things in ways similar to (but different from) OpenGFS. "Linux Device Drivers", 2nd edition, Alessandro Rubini, O'Reilly Good complement to "Understanding the Linux Kernel" (neither book tells the whole story). Good for background on Linux drivers in general, and specifically, chapter 12 "Loading Block Drivers" talks about, well, block drivers such as the OpenGFS pool kernel module. "Linux File System", Moshe Bar, (publisher?) recommendation courtesy of Andrea Glorioso "Managing RAID on Linux", Derek Vadala, O'Reilly In addition to a discussion on RAID, this book has a good (quick) overview of filesystem operation, including a little about journals, and a quick look at ext2, ext3, ReiserFS, IBM JFS, and SGI XFS.