Contact the author.
OpenGFS Block Allocation Copyright 2003 The OpenGFS Project Author: Ben Cahill (bc), ben.m.cahill@intel.com 1.0 Introduction ----------------- This document contains details of the OpenGFS on-disk block allocation methods. This document is aimed at developers, potential developers, students, and anyone who wants to know about the details of shared file system locking in OpenGFS. This document is not intended as a user guide to OpenGFS. Look in the OpenGFS WHATIS-ogfs document for an overview of OpenGFS. Look in the OpenGFS HOWTO-generic or HOWTO-nopool for details of configuring and setting up OpenGFS. This document may contain inaccurate statements, based on the author's limited understanding. Please contact the author (bc) if you see anything wrong or unclear. 1.1. Terminology ---------------- "Locks" and "g-locks" refer to locks supported by the OpenGFS intra- and inter-node locking system. See ogfs-locking document for details. 2.0. Block Allocation Overview ------------------------------ OpenGFS must allocate disk blocks for two major "worlds": -- filesystem data and metadata -- journal data Most of this document will focus on the filesystem, because that is a lot more complex than the journal. Consider that the filesystem may eventually develop a lot of holes of unused blocks across a disk drive, and that filesystem block allocation must coordinate with all of the other computers in the cluster, since they all share the filesystem space. In contrast, the journal block allocation is (almost) strictly sequential, without regard to other computers, since each journal is dedicated to a single machine. 2.1. Filesystem block allocation basics ---------------------------------------- 2.1.0. Resource groups ----------------------- Filesystem block allocation relies on dividing the filesystem into a number of "resource groups". Each resource group (rgrp or RG) is a contiguous area of blocks within the filesystem space. Each rgrp begins with an rgrp header block, which includes the rgrp header along with a bitmap describing the usage of each data block in the rgrp. Additional bitmap blocks may follow the header block, if the rgrp contains more data blocks than can be represented within the header block's bitmap bytes. See the ogfs-ondisk doc for details. Resource groups are similar to, for example, ext2 block groups, but with at least one important difference: each resource group keeps its own set of block usage statistics in its header. In contrast, ext2 keeps all block use statistics within its single superblock. A central statistics bank, such as that, would be a bottleneck for a shared filesystem, since *all* computers in the cluster could be competing for write access to the same information, as they allocate data blocks and change the statistics. In a sense, then, each rgrp can be viewed as a "mini-filesystem", providing distributed block usage statistics. If each computer in the cluster can focus its block allocation on a single resource group that is different from all other computers in the cluster, the computers will avoid interfering with one another, almost as if they were working in separate filesystems. OGFS tries to do just that. At mount time, each computer assigns itself a unique resource group in which to allocate new directories. After that, each computer tries to allocate blocks, for file data and metadata, within the parent directory's rgrp. With luck, the directory will have been allocated by *this* machine, and *this* machine will already have the rgrp's inter-node glock in its glock cache, along with current rgrp statistics in its memory, and things will go quickly and smoothly. This is not always possible, of course, and that's where things get more interesting. Once the filesystem has accumulated a number of directories, different computers will likely be using each others' rgrps as they need to add new files to each others' directories. This requires exchanging inter-node locks, and reading new (current) block statistics and block usage bitmaps from disk. And, as the filesystem grows, rgrps may become full, causing block allocation to look for another rgrp. OGFS has three algorithms for changing a computer's target directory allocation rgrp. One is based on rgrp capacity, while the other two use new directory creation as the event to trigger an rgrp change: -- Single: Use an rgrp until full, then change to the next rgrp with space. -- Round-Robin: Use a new rgrp (increment) for each new directory -- Random: Use a new rgrp (random) for each new directory. The user may choose one of these methods when mounting the filesystem. "Single" is the default. This method relies on the block reservation logic to determine when an rgrp is "full", and then search for the next rgrp that contains "enough" space. "Full" and "enough" depend on the size of the object being written, so a particular rgrp might be found "full" when trying to write a large file, but might have sufficient capacity for a smaller file. Note that, when used with a volume manager, a single rgrp may straddle two or more disk drives, especially if striping is used. The filesystem has no requirement for a resource group to align with a disk drive boundary. Especially note that rgrps are used *only* for block allocation and de-allocation; this means *writes* (including rm and mv, of course). *Reads* of files, directories, etc., do not need to allocate or de-allocate blocks, so they do not need to read or modify rgrp information, and do not need locks on rgrps; they get all that they need (i.e. which blocks belong to which directory entries or files) from the inodes in the directory trees, and the associated metadata blocks containing block pointer lists. Specifically, this means that computers in the cluster will not conflict with one another over rgrp locks when reading; any reading conflict would occur only at the inode (file or directory) level. 2.1.1. Usage statistics ------------------------ An rgrp's usage statistics include the following numbers: -- total free blocks (any type of block) -- used dinodes -- free dinodes -- used meta blocks (not including dinodes) -- free meta blocks Note that there is no number for free data blocks, and no number indicating the total number of blocks within the rgrp. Some rgrps may be different sizes than others. When creating the filesystem, the first rgrp may be a different size than the rest. When expanding the filesystem, the new rgrps may be a different size than the original filesystem's rgrps. Usage statistics are used for two purposes: -- block allocation, when searching for an rgrp with sufficient capacity. Stored in rgrp descriptor's rd_rs member. -- statfs() operations, when queried by the operating system about block usage. Stored in rgrp descriptor's rd_rg member. The statfs() operations are supported by Lock Value Blocks (LVBs), which can quickly and efficiently share usage data between all nodes (see section 2.1.3. Lock Value Blocks). OGFS block allocation are *not* supported by the LVBs. Block allocation requires not only the usage statistics, but the block allocation bitmaps as well. The block bitmaps are much too large to share via LVB. Therefore, OGFS uses the on-disk information, reading the header and bitmap blocks when locking an rgrp, then flushing them back to disk when releasing the inter-node lock. The rationale for this seems to be that when reading/flushing the bitmap blocks, we might as well read/flush the header block (which contains some bitmap data, anyway). However, there would seem to be an untaken opportunity to use the LVB data when planning block allocation (see section 2.1.4.0., Reservations). 2.1.2. Block types and bitmap ------------------------------ There are 3 categories of block types within OpenGFS: -- data (file contents) -- meta (OGFS-specific data describing filesystem layout) -- dinode (one dinode block for each inode) When meta blocks are needed, OGFS creates them in a 64-block (not necessarily contiguous) "clump" by converting data blocks to meta blocks (by toggling the block type bit in the bitmap). See clump_alloc() in fs/blklist.c. If 64 data blocks are not available, it will convert as many blocks as are available. When dinode blocks are needed, OGFS looks in the rgrp's list of free dinodes (see next paragraph), or creates them from meta blocks. If not enough meta blocks are available, it will attempt to create a new clump of meta blocks, then allocate dinodes from them. When dinodes are freed, they are placed into an on-disk linked list of free dinodes. A member of the on-disk rgrp header structure contains the block number of the first free dinode block. This dinode contains the block number of the next, and so forth. This is on-disk, so it persists and is shareable. The list keeps these blocks separate from regular free meta blocks. There is no difference in representation within the rgrp's block bitmap between dinodes and meta blocks. The block bitmap contains 2 bits for each block in the rgrp (except the header and bitmap blocks): -- "data/metadata" -- "used" See ogfs-ondisk-layout and fs/ogfs_ondisk.h for details. 2.1.3. Lock Value Blocks (LVBs) -------------------------------- For efficiency, OpenGFS passes rgrp block usage statistics back and forth among the compute nodes by use of a Lock Value Block that is supported by the locking system. The LVB data is the same as the statistics flushed to disk in the rgrp header blocks. However, it can be shared much more quickly via the locking system (typically LAN based), than by writing to disk (then reading it from another node), particularly when a number of nodes need the data. The LVB data is used for supporting statfs() calls from the operating system, supported by the ogfs_statfs(), ogfs_stat_ogfs(), and ogfs_stat_rgrp() calls. statfs() can happen fairly often, and requires adding the statistics from each and every rgrp each time it occurs on a node. This would require a cumbersome amount of disk reads to get the latest data, especially if the cluster is large (lots of nodes doing statfs() calls) or the filesystem contains a lot of rgrps (lots of disk reads per statfs()). Sharing via LVB is far more efficient. Note that the LVB data is *not* used for supporting block allocation itself, since block allocation requires the latest block bitmaps in addition to usage data. See section 2.1.1. Usage Statistics. The g-lock layer provides a glops hook, acquire_rgrp(), to read in the LVB data from the lock's LVB when a node acquires an rgrp lock. See ogfs-locking doc for details on glops. The LVB data resides in the in-core resource group descriptor structure's rd_rs member, separate from the from-disk data used for block allocation, stored in the rd_rg member. The LVB (rd_rs) data may not be valid if: -- the rgrp has never been locked before (LVB never initialized) -- another node owned the rgrp lock in EX mode, but crashed before successfully writing updated data to the lock's LVB. In either case, OGFS calls ogfs_rgrp_lvb_init(), which will: -- lock the rgrp, causing rd_rg to be filled with on-disk rgrp header data (see lock_rgrp() glops function) -- copy the rd_rg data to rd_rs, in ogfs_rgrp_lvb_fill() -- copy rd_rs out to the glock's LVB, in ogfs_rgrp_save_out() -- sync LVB data out to the inter-node lock system, in ogfs_sync_lvb() -- unlock the rgrp The "other node crashed" situation is currently sensed by an "owner" field in the LVB data. A node sets itself as the owner after it locks an rgrp in EX mode, expecting to do some block alloc/de-alloc. It resets the owner field to zero just before synching the LVB. If a node finds another node's ownership (node) ID set within the LVB, it knows that the other node crashed before successfully synching the LVB. When using OpenDLM as a lock manager, the LVB data will be reset to 0 when a node crashes. Current OGFS code would interpret this as "no owner", a valid situation. Therefore, we will need another method to check sanity of the LVB data. OpenGFS currently uses the test of rs_magic==OGFS_MAGIC within ogfs_rgrp_lvb_init() to check whether LVB content is valid. But, this function is called only after testing for LVB sanity, and therefore would not be called if the owner were "0". We will need to use the same test elsewhere to verify LVB content. 2.1.4. Allocation Procedure ---------------------------- Block allocation is done in two passes: -- Reservation, looks at (but doesn't change) current block usage statistics -- Actual allocation, modifies block usage statistics and block bitmap 2.1.4.0. Reservations ---------------------- The reservation pass finds an rgrp that has capacity for the greatest number of blocks that might be needed for a given write operation. This includes both data and metadata (e.g. inode block list) blocks. Reservation Structure --------------------- The reservation structure, ogfs_ipreserv_t, contains 3 sections: -- Request, filled in by caller to ogfs_inplace_reserv() -- Reservation (fulfillment plan), filled in by ogfs_inplace_reserv() -- Allocation stats, filled in by actual block alloc routines The request includes the max number of blocks that the caller expects it could possibly need, based on the size of the write, in 3 categories: -- data blocks -- meta blocks (excluding dinode blocks) -- dinode blocks (one block for each distributed inode) The request also includes the "target" rgrp. This is the rgrp in which the caller thinks the blocks should be reserved. For a new inode, this is the rgrp of the new inode's parent directory. For a pre-existing inode, this is the last rgrp used for its block allocation. For a new directory inode, it might be a new rgrp altogether, depending on the rgrp assignment scheme. The reservation includes the actual rgrp found that will fit (or best fit) the request. If there is enough room in the target rgrp, that's the one it will use. If not, it will search for another. The reservation also includes the number of meta (including dinodes) and data blocks now reserved in the rgrp. The allocation stats are usually incremented one by one, as blocks are actually allocated by the routines doing the write. Reservation Logic ----------------- The reservation logic locks, in exclusive mode, each rgrp that it examines. Usually, if the rgrp is too full, the reservation logic releases the lock immediately. Once the right rgrp is found, the reservation logic keeps that rgrp locked. The lock stays held until all block allocation for the write operation is complete. This assures that no other node will allocate blocks from this rgrp while we examine statistics, allocate blocks, and update statistics within this rgrp. Note that the rgrp lock does not prevent other nodes from reading files within the rgrp (the lock protects only the rgrp header and block bitmap), nor even from writing to files that stay the same size. The reservation logic is implemented in fs/blklist.c. ogfs_inplace_reserv() is the high level reservation call. It tries a few different methods for finding an appropriate rgrp, the easiest first: -- look in the target rgrp (contained in the reservation structure). -- look in the next rgrp, then the next, etc. -- if *no* rgrp can be found to fit the entire request, look for a "best fit". The second method tries to find an rgrp that can fit the entire request. However, since it may be that *no* rgrp could fit the entire request, it also keeps track of which rgrp might provide a "best fit". Currently, this means that the rgrp can fit the entire data block request, but might not be able to handle the entire meta block request. Currently (November 2003), there may be room for improvement in the block allocation scheme, including: -- partial allocs for data (that won't fit entirely in a single rgrp) -- automatic recovery of metadata and dinode blocks (back to data blocks) in rgrps that are getting full 2.1.4.1. Actual allocation --------------------------- Actual allocation involves searching through the rgrp's block bitmap for the individual blocks to allocate, then changing bit state within the block bitmap. Allocation is done by the ogfs_blkalloc() and ogfs_metaalloc() routines, both of which are in fs/blklist.c. Allocation routines are responsible for updating the allocation stats in the reservation structure, as well as in the incore image of the rgrp header. Once a write operation is complete, the rgrp's statistics must be updated within the rgrp header block and LVB. The rgrp header and bitmap blocks must be flushed to disk, to support alloc/de-alloc by other nodes. The LVB must be flushed to the locking back end, to support stat() functions in all nodes.