Contact the author.
OpenGFS Pool and PTools copyright 2003 The OpenGFS Project. Introduction ------------ This document contains details of the internals of the OpenGFS "pool" kernel module, and describes how it works with some of the OpenGFS PTool utilities. It also maps these and other utilities to ioctls supported by the pool module. This document is aimed at developers, potential developers, students, and anyone who wants to know about the low level details of the pool module. This document is not intended as a user guide to OpenGFS. Look in the OpenGFS HOWTO-generic for details of configuration and setting up OpenGFS. At the time of this writing (March 2003), there is a desire to replace the OpenGFS pool module with another mapping layer, namely dm device mapper. This document will serve as a checklist for pool functionality, as we ponder the replacement. See DESIGN-nopool for the results of the pondering. Terminology ----------- The terms "device", "storage device", "disk", "drive", and "partition" are used somewhat interchangeably in this document. Depending on context, they may mean an actual piece of hardware, a portion of the storage area provided by the hardware, or a Linux /dev node that provides access to the storage. In most cases, the /dev node meaning is appropriate. Much of the OpenGFS source code uses the term "device" to indicate a chunk of storage (a "partition", which may or may not be an entire "disk", "drive", or "storage device"), accessed via a /dev node. Concepts Behind Pool -------------------- When sharing common external storage devices among several computers in a cluster, there is a very strong possibility that the computers may view a given external device in different ways. For example, computer node A may see a given shared drive as "sdc", while node B sees it as "sdd", and node C sees it as "sdb". The pool kernel module, in conjunction with the "ptool" family of utility apps (passemble, ptool, pgrow, pinfo; see man 8 pages for each utility), provides an abstraction layer so that all computers in a cluster will view the shared storage in a consistent way. On each computer in an OpenGFS cluster, the "passemble" tool: 1). Scans all drive partitions to discover OpenGFS pool information. passemble reads the first sector of each disk partition listed in /proc/partitions, looking for OpenGFS drive pool partition headers/labels. These headers/labels must be previously written to drives by either the "ptool" or the "pgrow" tool, when configuring OpenGFS pools (see OpenGFS HOWTO-generic). The label information includes, among other things, the name of the pool to which the drive belongs. This name provides the key to the common view shared by the computers. Of course, several drives/partitions may belong to the same pool, and some drives/partitions may not belong to any OpenGFS pool at all. 2). Creates "/dev/pool/xxxx" device nodes, one for each OpenGFS pool discovered ("xxxx" is the pool name discovered in step 1). 3). Sets up the pool kernel module to map the abstract pool devices to the physical drive/partition devices (or perhaps to another layer of abstraction, if such a layer showed up in /proc/partitions). A pool is viewed by the file system as a single device containing a contiguous group of sectors. A single pool can encompass anywhere from a small partition on a single drive (e.g. a 4 MB OpenGFS cluster information partition; see OpenGFS HOWTO-generic) up to a sizeable group of drives (e.g. a data storage array). The mapping can concatenate drives, or it can stripe data across several drives (within a subpool, see below), which may provide higher storage performance. Subpools are groups of drives or partitions within a pool. Each subpool is viewed by the file system as a contiguous group of sectors within a pool, beginning at a particular sector, and ending at the sector immediately before the next subpool. There are no gaps allowed between subpools, and a single subpool cannot be scattered into non-contiguous pieces within the pool, as viewed by the filesystem. The physical view may be scattered, however, with several drives in a subpool, either concatenated or striped. The sector offset is the only method that the filesystem uses to access a particular subpool. The pool module, in turn, uses the sector offset to map I/O to the particular device partitions that comprise the subpool (see "pool Kernel Module" section below for more information). Subpools can be used to (see opengfs/docs/pooltemplate): 1). Subdivide a large number of drives into smaller striping groups. 2). Isolate file system journals from file system data, by assigning, for example, one drive to a "ogfs_journal" subpool type, and assigning other drives to a "ogfs_data" subpool type. "ogfs_data" subpools contain data exclusively. "ogfs_journal" subpools contain journals exclusively, and the filesystem requires that one journal subpool be created for *each* of the computer nodes in a cluster (e.g. create 3 journal subpools if you expect 3 computers to share the pool). When making the filesystem, mkfs.ogfs normally queries the pool module to find out about locations and types of subpools within the pool (after passemble has been run, the pool module contains all this information in kernel memory, much more convenient than having mkfs.ogfs scan all the disks for labels). See more about this in the Pool Label Details section. At this point, you may be wondering about the example data pool created by the pool0.cf configuration file in the HOWTO-generic. It breaks the rules, creating only a single data subpool, and no journal subpools. The example works however, by calling mkfs.ogfs with two important options: -i tells mkfs.ogfs to ignore (*not* query for) subpool layout information -j 2 tells mkfs.ogfs to create 2 journal subpools mkfs.ogfs then creates its own subpool layout, without regard to any physical device mapping (i.e. it doesn't care where the subpools fall on physical drives). It sets up the filesystem to write journals to the self-created journal subpools, and data to the self-created data subpools (see opengfs/docs/ogfs-ondisk-layout). This special method provides no ability to isolate journals onto particular partitions, because mkfs.ogfs does not know (and does not care, in this case) where the particular partitions are. When writing journals into their own subpool, the filesystem uses the sector offset as the only indication of where the journals belong within a pool. The pool module uses no knowledge of which subpools contain journal vs. data files when performing a mapping. This is true in all cases, whether or not the -i and -j options are used for mkfs.ogfs. Pool Label Details ------------------ If a drive partition is a member of a pool, it must contain a pool label in the first sector (512 bytes) of the partition. This label identifies the pool and subpool to which the partition belongs. Note that a single partition (and it must be the entire partition) can belong to only one subpool within only one pool. The pool labels are written to storage devices by the "ptool" utility, when initially configuring a pool, or the "pgrow" utility, when adding subpools to an existing pool. Both tools rely on input from a pool configuration file. This file describes the following for each pool: 1). pool name 2). number of subpools in pool And for each subpool: 1). subpool index (ID) (start from 0, increment by 1) 2). stripe size (in 512-byte sectors) (0 for no striping) 3). type (ogfs_data or ogfs_journal) 4). number of devices in subpool And for each device/partition: 1). device index (ID) (start from 0, increment by 1) 2). subpool to which you are assigning this device 3). physical partition (e.g. /dev/sdb1) as seen by the computer running ptool 4). DMEP weight for this device (see opengfs/docs/pooltemplate) The subpool and device indexes control the order of the pool layout, as seen by the filesystem. Device 0 of subpool 0 will contain sector 0 of the pool. The last device in the last subpool will contain the last sector of the pool. Journals can be directed to a separate physical device by creating a journal subpool and assigning the desired device to it. There is no rule governing (at least from the perspective of pool or the ptool utilities) the order (determined by subpool index) in which journal subpools appear within a pool. They may be first, last, or in the middle. Most of the values (with the exception of the physical device) from the configuration file are reflected directly in the pool label written to disk. The ptool and pgrow utilities add the other fields (magic, version, pool ID, and partition size, see below) to the label before writing to disk. Pool labels contain the following information, in the following categories. All values are 32-bit, unless otherwise indicated: 1) General a). pool magic number -- 64-bit, says this partition is a pool partition b). pool version -- indicates label version 2) Pool a). pool ID number -- 64-bit unique pool identifier b). pool name -- 256-bytes ASCII, this is the xxxx in /dev/pool/xxxx 3) Subpool a). total number of subpools within this pool b). index number of this subpool within this pool c). striping size (0 if no striping) within this subpool d). subpool type (data or journal) 4) Partition a). total number of partitions within this subpool b). index number of this partition within this subpool c). partition size -- 64-bit # of (512-byte?) blocks in this partition 5) DMEP a). total number of DMEP devices in this subpool b). index number of this DMEP device within this subpool c). DMEP weight -- an indication of preference for using this device Please refer to appendix A for detailed information about the data structure on disk. Note that each pool label will be different, since each partition will have a unique combination of subpool, partition, and perhaps DMEP information. Labels are read by passemble (when initializing a cluster member computer), pgrow (when growing a pool while in use), or by the pool module (when growing a pool while in use, after another computer ran pgrow). Once all the labels are gathered from all the storage devices, this information is sufficient for "assembling" the entire pool into one contiguous mapping of pool sector number to partition sector number. This information is also sufficient to detect missing partitions, by checking the subpool and partition index numbers for contiguity, and by checking to see that the greatest index numbers correlate with the expected number of subpools and devices. Note that there is no information that self-names the physical device as seen by the computer. It is up to each computer to read the pool label and associate that label with the physical device on which the label exists. When growing pools, existing subpools must be left the way they are, and growth is done only by adding subpools at the end of the pool. Utility Apps -- ptool family ---------------------------- The ptools utilities manipulate and serve as a bridge between two entities: -- On-disk pool labels. The labels provide persistant storage and the central (albeit distributed) repository for pool configuration. -- Pool kernel module. This driver provides mapping between the abstract pool devices and the physical storage devices. A pool is considered "active" when this driver knows about it and is mapping it. Note that *pool* configuration information is different from *cluster* configuration information; pool information describes the *drives/partitions* that can be shared by a cluster, while cluster information describes the *computers* that are members of an OpenGFS cluster. Cluster information is written to disk by the "ogfsconf" utility, which is not a pool utility. The ptool family includes the following (see man 8 pages for more information on each one): 1). ptool writes the partition labels to storage devices (and does not deal with the pool kernel module). Only one computer needs to run ptool, when initially configuring a pool. 2). passemble reads partition labels from storage devices, and sets up the pool kernel module to map pools to physical partitions (as described in "Concepts Behind Pool" in this document). Each computer in a cluster must run passemble each time the computer reboots, to discover available pools. 3). pgrow adds subpools to an active pool by writing partition labels to storage devices, *and* by setting up the pool kernel module to map the new subpools (note that pgrow is kind of a ptool and passemble all-in-one). Only one computer needs to run pgrow, when initially growing a pool. The kernel module mapping update occurs initially *only* on the computer that runs pgrow. The pool drivers on each of the other computers "automagically" re-map themselves if/when I/O requests fall beyond their old capacity. This is the only case in which the pool driver accesses the pool labels on the storage devices, to scan for and read the new pool labels. After growing a pool, one computer must also run ogfs_expand to make the filesystem on the new subpool devices. (Do other member computers need to run a utility to tell their filesystems that things have grown?) 4). pinfo queries the pool kernel module for currently active pool information (and does not read/write any pool labels on storage devices). Caution: there is also a "pinfo" utility that is an info page browser: man 8 pinfo -- OpenGFS pinfo, default OpenGFS installation puts it in /sbin man 1 pinfo -- info doc browser, found as /usr/bin/pinfo in RedHat 8.0. Note that none of the ptool utilities reads/writes disks via the pool kernel module. The utilites read/write partition headers directly from/to the physical devices. See elsewhere in this document for a list of pool ioctls and the utility tools that use each one. Utility Apps -- others ------------------------ The pool kernel module is used by a few other tools that are not members of the ptool family. See man 8 pages for more info on each one: "dmep_conf" configures DMEP (Device Memory Export Protocol) segments of certain types of shared storage devices for purposes of hardware-based locks. "do_dmep" executes DMEP commands on the storage devices. (No man page). "mkfs.ogfs" makes an OpenGFS file system on a block device (e.g. a pool). "ogfs_expand" expands an OpenGFS file system after a block device (e.g. a pool) expands. "ogfs_jadd" adds journals to an OpenGFS file system after a block device (e.g. a pool) expands. See elsewhere in this document for a list of pool ioctls and the utility tools that use each one. The "pool" Kernel Module ------------------------ The pool kernel module has three main purposes: 1). Mapping blocks from /dev/pool devices to real devices. In this mode, pool is hooked into the kernel's block handling subsystem, and translates block requests as they are being "made". 2). Implementing ioctls for setup, control, statistics monitoring, etc. In this mode, pool responds to ioctl calls from ptool and other utility apps. 3). Sending DMEP (hardware-based file locking) commands to pool devices. In this mode, pool responds to ioctl calls from the dmep utility apps. Note that many (probably most) devices supported by OpenGFS do not need or use this capability. Note that the pool module has facility for duplicating much of what is done by the passemble utility (i.e. assembling pools). However, this capability is used only in a special situation, when a mapping request exceeds the capacity of a given pool. In this case, the pool module re-assembles the pool, hoping to discover that the pool capacity has grown to a larger size. This is the method used to propagate knowledge of grown pools to cluster member computers *other* than the computer that ran the pgrow utility (see info on pgrow elsewhere in this doc). Descriptions below relate to pool behavior with the 2.4 kernels. Pool code can be found in OpenGFS source: opengfs/src/pool/ opengfs/src/pool/arch_linux_2_4/ Pertinent kernel code can be found in kernel source: drivers/block/ll_rw_blk.c ** Mapping ** The pool module intercepts kernel block requests before they are placed onto kernel request queues. This interception is set up by the following call in pool_init() (.../pool/arch_linux_2_4/pool_linux.c): blk_queue_make_request(BLK_DEFAULT_QUEUE(POOL_MAJOR), pool_make_request_fn); This tells the kernel to use pool_make_request_fn() (also in pool_linux.c), instead of the kernel's default __make_request(), when preparing a buffer head for placement in the kernel request queue for the pool major device. The kernel calls the substituted function from within generic_make_request(), which contains a loop to iteratively pass the buffer head between stacked block drivers, to perform layers of device mapping. pool_make_request_fn() does the following for each buffer head (bh): 1) Checks pool status to make sure it is currently valid to access. 2) Checks bh start/size to see if it extends beyond end of pool. If so, calls assemble_pools() (.../pool/common.c) to reassemble the pool in the hope that the pool has grown. 3) Calls pool_blkmap() (.../pool/common.c) to do the actual mapping (linear or striped). 4) Substitutes the newly mapped "real" device and sector number into the buffer head structure. 5) Returns 1 to the kernel, to tell it to pass the bh to the next driver in the stack (e.g. the driver for the "real" disk drive). With this mapping scheme, pool should never receive a request from a queue to do actual I/O . . . all requests should be intercepted and mapped (by pool) before they get onto the pool's queue. Nonetheless, pool needs to register a "dummy" I/O request handler function. It does so in pool_init(), with the following: #define DEVICE_REQUEST do_pool_request blk_init_queue(BLK_DEFAULT_QUEUE(POOL_MAJOR), DEVICE_REQUEST); do_pool_request() is the dummy function, which printks an error message if called. ** Ioctls ** The pool module supports ioctls via the /dev/poolcomm device (pool minor # 0), or via the /dev/pool/xxxx devices (minor numbers > 0). The /dev/poolcomm device node is created by one of two means: 1) Pool module creates it via devfs during module initialization (but only if devfs, not required, is compiled into your kernel). 2) The "passemble" utility creates it each time it executes (whether or not it successfully assembles a pool). The /dev/pool/xxxx nodes are created by "passemble" as it discovers and successfully assembles pools. The following lists all IOCTLs supported by the pool kernel module. It includes the call used within the kernel module to execute each ioctl, and an attempt at an exhaustive list of tools that use each ioctl (this largely for evaluating portability to a new device mapper): POOL_MEMEXP: Sends a DMEP command to DMEP hardware, using pool_memexp(). Used by: do_dmep, dmep_conf POOL_CLEAR_STATS: Clears pool statistics Used by: pinfo POOL_DUMP_TOTALSTATS: Copies pool stats to user space Used by: pinfo BLKGETSIZE: Copies pool capacity (in blocks) to user space. (note: capacity is a 64-bit value internal to pool, but interface passes only 32-bits). Used by: mkfs.ogfs, ogfs_expand, ogfs_add (Not used by: initds, ptool, pgrow, which access real devices) POOL_ADD: Brings new pool to the attention of the pool module, using uadd_pool() -> add_pool(). Used by: passemble POOL_REMOVE: Removes pool from the attention of the pool module. using remove_pool(). Used by: passemble POOL_GROW: Grows pool already known by the pool module, using ugrow_pool() -> grow_pool(). Used by: pgrow POOL_COUNT: Copies number of pools, known by pool module, to user space, using count_pool(). Used by: passemble, ptool, pgrow, pinfo POOL_LIST: Copies list of pools, known by pool module, to user space, using list_pool(). Used by: passemble, ptool, pgrow, pinfo POOL_SPLIST: Copies list of subpools in a given pool to user space, using list_spool() Used by: passemble, pgrow, pinfo, dmep_conf, do_dmep, mkfs.ogfs POOL_DEVLIST: Copies list of devices in a given pool to user space, using list_pooldevs() Used by: passemble, pgrow, pinfo, dmep_conf, do_dmep POOL_IND_CACHE: Syncs and flushes, using fsync() and invalidate_buffers(). Used by: nothing POOL_DEBUG_ON: Sets flags for enabling pool module debugging, using turn_on_pool_debug_flag(). Used by: ptool POOL_DEBUG_OFF: Resets flags for enabling pool module debugging, using turn_off_pool_debug_flag(). Used by: ptool HDIO_GETGEO: No-op, printk. Used by: nothing default: blk_ioctl(), the kernel's ioctl handler. Porting to Device Mapper ------------------------ There is currently a desire to use Device Mapper (DM) as the kernel component (replacing the pool module) to do the mapping between the /dev/pool/xxxx abstract devices and the physical devices. This should be do-able by one of two methods: 1). Porting ptools utilities to use DM 2). Using other tools that currently sit on top of DM. The preferred method is #2, of course! We're looking into EVMS to provide this functionality. There may be other tools that could provide this as well. Issues: 1). Device mapper does not know how to "automagically" scan and read pool labels when growing. We will need to provide this ability in user space. (Does EVMS provide this capability?) 2). Device mapper does not support sending DMEP commands to devices. (we may drop DMEP support entirely, or use another channel to send DMEP commands to DMEP SCSI hardware. Source code for OpenGFS DMEP utilities sez that commands can be sent via sg, scsi-generic). 3). mkfs.ogfs uses the POOL_SPLIST (get subpool list) ioctl provided by the pool kernel module, but not provided by Device Mapper. (We'll need another way to provide subpool info to the filesystem). APPENDIX A. Structure Details -------------------- This section contains a list of the structures stored in the filesystem. Unless otherwise stated, all types are unsigned integers of the given length. 1). Pool labels (pool_label_t) 0 +-------------------+ | 0x011670 | Magic number 8 +-------------------+ | Pool id | Unique pool identifier 16 +-------------------+ | Pool name | Name of pool (char) 272 +-------------------+ | Pool version | Pool version 276 +-------------------+ | Num. subpools | Number of subpools in this pool 280 +===================+ | Subpool number | Subpool number within pool 284 +-------------------+ | SP Num. data part.| Number of data partitions in this subpool 288 +-------------------+ | SP Partition num. | Partition number within subpool 292 +-------------------+ | Subpool type | Data (ogfs_data) or journal (ogfs_journal) 296 +-------------------+ | Num. part. blocks | Number of blocks in this partition 304 +-------------------+ | Striping blocks | Stripe size within subpools in blocks (0 = off) 308 +-------------------+ | SP Num. DMEP dev. | Number of DMEP devices in this subpool* 312 +-------------------+ | SP DMEP dev. num. | DMEP device number within subpool* 316 +-------------------+ | SP DMEP weight | Preference weight for using this device 320 +-------------------+ (see opengfs/docs/pooltemplate) * If the number of DMEP devices is zero, then the DMEP device is the subpool id of a subpool that handles DMEP instead.