Contact the author.
Design notes for "no-pool" OpenGFS. DESIGN GOALS: Enable successful operation of OpenGFS without the use of the pool module and its associated utilities. Provide ability to assign OpenGFS journals to specific hardware devices, while not requiring tight coupling of OpenGFS with a specific volume manager (user space) or device mapper (kernel space). Also continue legacy support for internal journals. Provide filesystem and journal re-sizing support (filesystem growth and journal addition only, unless a good way can be found to shrink the filesytem, and/or remove journals). Background and rationale: The pool module provides: 1) Mapping of one or more hardware devices (e.g. disk drive partitions) into a contiguous pool, appearing as /dev/something (e.g. /dev/pool/pool0) to the filesystem. 2) Consistent naming of the /dev/something across all computers in a cluster. 3) Stacking of the mapping with other mapping modules (e.g. kernel's md RAID). 4) Two types of "subpools": data or journals. 5) Striping for individual data subpools, for enhanced performance. 6) Information, queried by mkfs.ogfs, regarding the devices comprising the pool. This enables mkfs.ogfs to assign specific journals to specific hardware devices within the pool. 7) Ability to grow the pool, adding new partitions. 8) Hardware-based (DMEP) locking support OpenGFS was written before many of these features were commonly found in the kernel or other open source projects. Currently, however, many of these features are redundant with components such as the DM device mapper, the md software RAID support, and EVMS (Enterprise Volume Management System). These features include 1) mapping, 2) consistent names, 3) stacking, 5) striping. The primary feature not provided by other components is the tight coupling of the pool layout knowledge to mkfs.ogfs. The primary usage of this has been to provide the journal assignment to a specific hardware device, for performance or reliability reasons. By providing support for assigning each external journal to a specific device, this feature 6) is no longer required. Feature 4) subpools will continue to be supported within the filesystem, for the purpose of internal journals. However, there will be no reason for a user/admin, nor for a volume manager or a device mapper, to be aware of subpools. There will be no provision for a user/admin to place internal journals at specific locations within the filesystem. Feature 7) re-sizing will be a phase-2 venture, after basic operation of external journals is secured. Feature 8) hardware DMEP will not be addressed. There may be support for DMEP built into OpenGFS already, using kernel "sd" (SCSI Device) component as an alternative to the pool support, but I haven't really looked into this. THEORY OF OPERATION: Internal to OpenGFS, the filesystem is seen as a linear 64-bit space of blocks. In legacy OpenGFS, this space has been contiguous from 0 to the total capacity of the filesystem. In no-pool OpenGFS, the data and internal journals still appear this way, but external journals appear at the top of the 64-bit space. As an example, assume that there is a shared drive called /dev/fsdrive, that contains the filesystem, including 30 GB (7,864,320 blocks, assuming block size = 4096 bytes) of data, and 2 internal journals of 128 MB (32,768 blocks) each. Assume also that there are two partitions on another drive, called /dev/jrnl0 and /dev/jrnl1, each of which is 1 GB (262144 blocks), to be used as external journals. The filesystem drive will appear to OpenGFS as blocks: 0 - 7,929,856 (7,864,320 + 2 * 32,768). One of the external journals will appear as blocks: 18446744073709289472 (2^64 - 262,144) - 18446744073709551615 (2^64 - 1). The other will appear as blocks: 18446744073709027328 (2^64 - 524,288) - 18446744073709289471 (2^64-262,144-1). IMPLEMENTATION: Journal indexes: The legacy journal index (one for each journal) contains information on the start address (in blocks) and the length (in journal segments) of a journal. For external journals, these fields will continue to be used in the same way, but the start address will be a very large number (2^64 - n). In addition, the journal index will contain an ASCII string representing the /dev/something name of the external journal device. This makes use of 48 bytes of "reserved" space in the legacy journal index, leaving 16 bytes unused. Internal journals will leave all 64 of these bytes at 0. New journal index structure: #define OGFS_DEVNAME_LEN 48 struct ogfs_jindex { uint64 ji_addr; /* starting block of the journal */ uint32 ji_nsegment; /* number of segments in journal */ uint32 ji_pad; char ji_device[OGFS_DEVNAME_LEN]; /* external journal dev name */ char ji_reserved[64 - OGFS_DEVNAME_LEN]; }; typedef struct ogfs_jindex ogfs_jindex_t; mkfs.ogfs: mkfs.ogfs is responsible for writing the journal indexes out to the filesystem device, and for writing segment headers into the journal storage. For external journals, mkfs.ogfs will write segment headers to the external journal devices or partitions, starting at sector 0 of each device/partition. mkfs.ogfs will use a journal configuration file to accept user/administrator input regarding the journal layout, but will also continue to accept all command line options used in legacy OpenGFS, checking for conflicts between command line and configuration file entries. The configuration file will contain the following types of entries: fsdev -- the /dev/something name of the filesystem device (this can also be entered on the command line, as with legacy mkfs.ogfs). journals -- the total number of journals, internal and external, used for sanity checking the journal descriptions, and allocating a bit of memory before reading the journal descriptions (this can also be entered on the command line with the -j option, as with legacy mkfs.ogfs). journal -- a description of a single journal, in one of the following formats: journal 0 ext /dev/journal0 0 is the journal index, used to map a journal to a cluster member computer. ext declares that this journal is external. /dev/journal0 represents the /dev name of the journal device. journal 1 int 256 1 is the journal index, used to map a journal to a cluster member computer. int declares that this journal is internal. 256 is the journal size, in MBytes. Filesystem: The filesystem is responsible for accessing the journals appropriately. During operation, the filesystem must determine whether a given block belongs within the main filesystem, or within one of the external journals. If external, the filesystem must determine which device, and translate the block number into a suitable block number within the device. During mount, the filesystem sets up several in-core values that support these translation operations. The mount sequence is driven from _ogfs_read_super(), in src/fs/arch_linux_2_4/super_linux.c. Filesystem mounting and unmounting: When mounting, the filesystem reads all of the journal indexes, and determines the total number of journals, both internal and external (fs/super.c, ogfs_ji_update()). It allocates an array of struct block_device pointers, copies the pointer into the in-core superblock structure's sdp->sd_jdevs, and fills each array element with either a valid pointer (for an external journal device) or a NULL (for an internal journal). In more detail, when ogfs_ji_update() encounters an external journal (i.e. a journal with a non-zero value at ji_device[0]), it calls ogfs_get_jdev_info() (src/fs/arch_linux_2_4/dio_arch.c). Based on code stolen from the kernel's get_sb_bdev() (kernel's fs/super.c), ogfs_get_jdev_info() does the following: 1) path_lookup() on the device name, to find inode for /dev/something. 2) bd_acquire() on the inode, to find a struct block_device pointer. 3) inserts the pointer into the array of journal device pointers (sd_jdevs[]). 4) blkdev_get() on the block_device, to reserve it for ongoing use. 5) path_release() to release the path obtained with path_lookup() Continuing, ogfs_ji_update() compares the base block addresses of the external journals, and sets the lowest of these as the threshold for deciding whether a given block is stored within the filesystem (block number < threshold), or in one of the external journals (block number >= threshold). This block number is stored in the in-core superblock structure, sdp->sd_jext_base. _ogfs_read_super() fills in two other superblock values, once it has mounted a locking protocol (and thereby knows which cluster member *this* machine is), and has read in the journal info. sdp->sd_my_jdesc, and sdp->sd_my_jdev point to the jindex and block_device for *this* machine. sdp->sd_my_jdev duplicates info within the sdp->sd_jdevs block_device array for journals, but was added as a convenience to the block translation code (see below). When unmounting, for each external journal, the filesystem calls ogfs_put_jdev_info(), which does: 1) blkdev_put() on the journal's block_device, to release it And, of course, the filesystem de-allocs the sd_jdevs[] array. Filesystem operation (block translation): Discernment of filesystem vs. external journal blocks is built into the lowest level of interface between the filesystem and the operating system, just before the filesystem requests a buffer head (bh) from the OS. There are three points where this occurs, each of which is in src/fs/arch_linux_2_4/arch_dio.c: 1) ogfs_dgetblk_arch(), the general purpose block "getter" routine 2) try_readahead(), as the name implies, called from ogfs_dreread, the general purpose read routine. 3) ogfs_init_temp_bh(), which gets and inits a "temporary" bh used for writing journal data (apparently a duplicate of the file data). Each of these routines compares the requested block number against the sdp->sd_jext_base value. If the block number is higher than this threshold value, then it belongs in an external journal, and the routine calls translate_block(), a static function within arch_dio.c. translate_block() first checks *this* machine's journal index info at sdp->sd_my_jdesc to determine if the requested block is within "my" journal. If so, it copies the journal device's kdev_t into the passed "dev" parameter, and subtracts the journal's base address from the "blkno" parameter, to translate the "ogfs linear" address into a device-local block number. If the block is not in "my" journal, translate_block() looks through the list of journals sdp->sd_jindex[] to find the right journal, and fills in the dev and blkno parameters. Filesystem implementation summary: src/fs/super.c: ogfs_ji_update() -- Refreshes in-core info about journals. Added: - alloc an array of kernel block_device pointers - loop to call ogfs_get_jdev_info() for each external journal - determine lowest base address, set as external journal threshold ogfs_clear_journals() -- Clears in-core info about journals. Added: - hook to call ogfs_put_jdev_info() if a journal is external. src/fs/arch_linux_2_4/dio_arch.c: ogfs_get_jdev_info() -- Finds a struct block_device corresponding to an external journal's /dev/something. Reserves device for ongoing use. New function. ogfs_put_jdev_info() -- releases device from ongoing use. New function. translate_block() -- translates external journal block, replacing dev and blkno. New function. ogfs_dgetblk_arch() -- general purpose "get bh" function. Added: - hook to call translate_block() if external journal try_readahead() -- tries to read ahead, called by ogfs_dreread(). Added: - hook to call translate_block() if external journal ogfs_init_temp_bh() -- gets and inits bh for journal data. Added: - hook to call translate_block() if external journal