Block_Allocation_and_rgrps (Jul 29 2004)

Contact the author.

                     OpenGFS Block Allocation

Copyright 2003 The OpenGFS Project

Ben Cahill (bc),

1.0  Introduction

This document contains details of the OpenGFS on-disk block allocation methods.

This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of shared file system locking in

This document is not intended as a user guide to OpenGFS.  Look in the OpenGFS
WHATIS-ogfs document for an overview of OpenGFS.  Look in the OpenGFS
HOWTO-generic or HOWTO-nopool for details of configuring and setting up OpenGFS.

This document may contain inaccurate statements, based on the author's limited
understanding.  Please contact the author (bc) if you see anything wrong or

1.1. Terminology

"Locks" and "g-locks" refer to locks supported by the OpenGFS intra- and
inter-node locking system.  See ogfs-locking document for details.

2.0. Block Allocation Overview

OpenGFS must allocate disk blocks for two major "worlds":

-- filesystem data and metadata
-- journal data

Most of this document will focus on the filesystem, because that is a lot more
complex than the journal.  Consider that the filesystem may eventually develop
a lot of holes of unused blocks across a disk drive, and that filesystem
block allocation must coordinate with all of the other computers in the
cluster, since they all share the filesystem space.

In contrast, the journal block allocation is (almost) strictly sequential,
without regard to other computers, since each journal is dedicated to a single

2.1.  Filesystem block allocation basics

2.1.0.  Resource groups

Filesystem block allocation relies on dividing the filesystem into a number
of "resource groups".  Each resource group (rgrp or RG) is a contiguous area of
blocks within the filesystem space.  Each rgrp begins with an rgrp header block,
which includes the rgrp header along with a bitmap describing the usage of each
data block in the rgrp.  Additional bitmap blocks may follow the header block,
if the rgrp contains more data blocks than can be represented within the
header block's bitmap bytes.  See the ogfs-ondisk doc for details.

Resource groups are similar to, for example, ext2 block groups, but with at
least one important difference:  each resource group keeps its own set of
block usage statistics in its header.  In contrast, ext2 keeps all block
use statistics within its single superblock.  A central statistics bank, such
as that, would be a bottleneck for a shared filesystem, since *all* computers
in the cluster could be competing for write access to the same information,
as they allocate data blocks and change the statistics.

In a sense, then, each rgrp can be viewed as a "mini-filesystem", providing
distributed block usage statistics.  If each computer in the cluster can focus
its block allocation on a single resource group that is different from all
other computers in the cluster, the computers will avoid interfering with one
another, almost as if they were working in separate filesystems.

OGFS tries to do just that.  At mount time, each computer assigns itself a
unique resource group in which to allocate new directories.  After that,
each computer tries to allocate blocks, for file data and metadata, within
the parent directory's rgrp.  With luck, the directory will have been
allocated by *this* machine, and *this* machine will already have the rgrp's
inter-node glock in its glock cache, along with current rgrp statistics in its
memory, and things will go quickly and smoothly.

This is not always possible, of course, and that's where things get more
interesting.  Once the filesystem has accumulated a number of directories,
different computers will likely be using each others' rgrps as they need to
add new files to each others' directories.  This requires exchanging inter-node
locks, and reading new (current) block statistics and block usage bitmaps
from disk.

And, as the filesystem grows, rgrps may become full, causing block allocation
to look for another rgrp.

OGFS has three algorithms for changing a computer's target directory
allocation rgrp.  One is based on rgrp capacity, while the other two use
new directory creation as the event to trigger an rgrp change:

-- Single:  Use an rgrp until full, then change to the next rgrp with space.
-- Round-Robin:  Use a new rgrp (increment) for each new directory
-- Random:  Use a new rgrp (random) for each new directory.

The user may choose one of these methods when mounting the filesystem.
"Single" is the default.  This method relies on the block reservation logic to
determine when an rgrp is "full", and then search for the next rgrp that
contains "enough" space.  "Full" and "enough" depend on the size of the object
being written, so a particular rgrp might be found "full" when trying to write
a large file, but might have sufficient capacity for a smaller file.

Note that, when used with a volume manager, a single rgrp may straddle two or
more disk drives, especially if striping is used.  The filesystem has no
requirement for a resource group to align with a disk drive boundary.

Especially note that rgrps are used *only* for block allocation and
de-allocation; this means *writes* (including rm and mv, of course).  *Reads*
of files, directories, etc., do not need to allocate or de-allocate blocks, so
they do not need to read or modify rgrp information, and do not need locks on
rgrps; they get all that they need (i.e. which blocks belong to which directory
entries or files) from the inodes in the directory trees, and the associated
metadata blocks containing block pointer lists.  Specifically, this means that
computers in the cluster will not conflict with one another over rgrp locks
when reading; any reading conflict would occur only at the inode (file or
directory) level.

2.1.1.  Usage statistics

An rgrp's usage statistics include the following numbers:

-- total free blocks (any type of block)
-- used dinodes
-- free dinodes
-- used meta blocks (not including dinodes)
-- free meta blocks

Note that there is no number for free data blocks, and no number indicating
the total number of blocks within the rgrp.  Some rgrps may be different
sizes than others.  When creating the filesystem, the first rgrp may be a
different size than the rest.  When expanding the filesystem, the new rgrps
may be a different size than the original filesystem's rgrps.

Usage statistics are used for two purposes:

-- block allocation, when searching for an rgrp with sufficient capacity.
   Stored in rgrp descriptor's rd_rs member.

-- statfs() operations, when queried by the operating system about block usage.
   Stored in rgrp descriptor's rd_rg member.

The statfs() operations are supported by Lock Value Blocks (LVBs), which can
quickly and efficiently share usage data between all nodes (see section 2.1.3.
Lock Value Blocks).

OGFS block allocation are *not* supported by the LVBs.  Block allocation
requires not only the usage statistics, but the block allocation bitmaps as
well.  The block bitmaps are much too large to share via LVB.  Therefore,
OGFS uses the on-disk information, reading the header and bitmap blocks when
locking an rgrp, then flushing them back to disk when releasing the inter-node

The rationale for this seems to be that when reading/flushing the bitmap blocks,
we might as well read/flush the header block (which contains some bitmap data,
anyway).  However, there would seem to be an untaken opportunity to use the LVB
data when planning block allocation (see section, Reservations).

2.1.2.  Block types and bitmap

There are 3 categories of block types within OpenGFS:

-- data (file contents)
-- meta (OGFS-specific data describing filesystem layout)
-- dinode (one dinode block for each inode)

When meta blocks are needed, OGFS creates them in a 64-block (not necessarily
contiguous) "clump" by converting data blocks to meta blocks (by toggling the
block type bit in the bitmap).  See clump_alloc() in fs/blklist.c.  If 64
data blocks are not available, it will convert as many blocks as are available.

When dinode blocks are needed, OGFS looks in the rgrp's list of free dinodes
(see next paragraph), or creates them from meta blocks.  If not enough
meta blocks are available, it will attempt to create a new clump of meta
blocks, then allocate dinodes from them.

When dinodes are freed, they are placed into an on-disk linked list of free
dinodes.  A member of the on-disk rgrp header structure contains the block
number of the first free dinode block.  This dinode contains the block number
of the next, and so forth.  This is on-disk, so it persists and is shareable.

The list keeps these blocks separate from regular free meta blocks.  There is no
difference in representation within the rgrp's block bitmap between dinodes and
meta blocks.

The block bitmap contains 2 bits for each block in the rgrp (except the header
and bitmap blocks):

-- "data/metadata"
-- "used"

See ogfs-ondisk-layout and fs/ogfs_ondisk.h for details.

2.1.3.  Lock Value Blocks (LVBs)

For efficiency, OpenGFS passes rgrp block usage statistics back and forth
among the compute nodes by use of a Lock Value Block that is supported by the
locking system.  The LVB data is the same as the statistics flushed to disk
in the rgrp header blocks.  However, it can be shared much more quickly via
the locking system (typically LAN based), than by writing to disk (then
reading it from another node), particularly when a number of nodes need the

The LVB data is used for supporting statfs() calls from the operating system,
supported by the ogfs_statfs(), ogfs_stat_ogfs(), and ogfs_stat_rgrp() calls.
statfs() can happen fairly often, and requires adding the statistics from each
and every rgrp each time it occurs on a node.  This would require a cumbersome
amount of disk reads to get the latest data, especially if the cluster is large
(lots of nodes doing statfs() calls) or the filesystem contains a lot of rgrps
(lots of disk reads per statfs()).  Sharing via LVB is far more efficient.

Note that the LVB data is *not* used for supporting block allocation itself,
since block allocation requires the latest block bitmaps in addition to usage
data.  See section 2.1.1. Usage Statistics.

The g-lock layer provides a glops hook, acquire_rgrp(), to read in the LVB data
from the lock's LVB when a node acquires an rgrp lock.  See ogfs-locking doc
for details on glops.  The LVB data resides in the in-core resource group
descriptor structure's rd_rs member, separate from the from-disk data used for
block allocation, stored in the rd_rg member.

The LVB (rd_rs) data may not be valid if:

-- the rgrp has never been locked before (LVB never initialized)
-- another node owned the rgrp lock in EX mode, but crashed before successfully
   writing updated data to the lock's LVB.

In either case, OGFS calls ogfs_rgrp_lvb_init(), which will:

-- lock the rgrp, causing rd_rg to be filled with on-disk rgrp header data
     (see lock_rgrp() glops function)
-- copy the rd_rg data to rd_rs, in ogfs_rgrp_lvb_fill()
-- copy rd_rs out to the glock's LVB, in ogfs_rgrp_save_out()
-- sync LVB data out to the inter-node lock system, in ogfs_sync_lvb()
-- unlock the rgrp

The "other node crashed" situation is currently sensed by an "owner" field
in the LVB data.  A node sets itself as the owner after it locks an rgrp
in EX mode, expecting to do some block alloc/de-alloc.  It resets the owner
field to zero just before synching the LVB.  If a node finds another node's
ownership (node) ID set within the LVB, it knows that the other node crashed
before successfully synching the LVB.

When using OpenDLM as a lock manager, the LVB data will be reset to 0 when
a node crashes.  Current OGFS code would interpret this as "no owner", a valid
situation.  Therefore, we will need another method to check sanity of
the LVB data.  OpenGFS currently uses the test of rs_magic==OGFS_MAGIC
within ogfs_rgrp_lvb_init() to check whether LVB content is valid.  But, this
function is called only after testing for LVB sanity, and therefore would
not be called if the owner were "0".  We will need to use the same test
elsewhere to verify LVB content.

2.1.4.  Allocation Procedure

Block allocation is done in two passes:

-- Reservation, looks at (but doesn't change) current block usage statistics
-- Actual allocation, modifies block usage statistics and block bitmap  Reservations

The reservation pass finds an rgrp that has capacity for the greatest number
of blocks that might be needed for a given write operation.  This includes
both data and metadata (e.g. inode block list) blocks.

Reservation Structure

The reservation structure, ogfs_ipreserv_t, contains 3 sections:

-- Request, filled in by caller to ogfs_inplace_reserv()
-- Reservation (fulfillment plan), filled in by ogfs_inplace_reserv()
-- Allocation stats, filled in by actual block alloc routines

The request includes the max number of blocks that the caller expects it could
possibly need, based on the size of the write, in 3 categories:

-- data blocks
-- meta blocks (excluding dinode blocks)
-- dinode blocks (one block for each distributed inode)

The request also includes the "target" rgrp.  This is the rgrp in which the
caller thinks the blocks should be reserved.  For a new inode, this is the rgrp
of the new inode's parent directory.  For a pre-existing inode, this is the
last rgrp used for its block allocation.  For a new directory inode, it might
be a new rgrp altogether, depending on the rgrp assignment scheme.

The reservation includes the actual rgrp found that will fit (or best fit) the
request.  If there is enough room in the target rgrp, that's the one it will
use.  If not, it will search for another.

The reservation also includes the number of meta (including dinodes) and data
blocks now reserved in the rgrp.

The allocation stats are usually incremented one by one, as blocks are actually
allocated by the routines doing the write.

Reservation Logic

The reservation logic locks, in exclusive mode, each rgrp that it examines.
Usually, if the rgrp is too full, the reservation logic releases the lock
immediately.  Once the right rgrp is found, the reservation logic keeps that
rgrp locked.  The lock stays held until all block allocation for the write
operation is complete.  This assures that no other node will allocate blocks
from this rgrp while we examine statistics, allocate blocks, and update
statistics within this rgrp.

Note that the rgrp lock does not prevent other nodes from reading files 
within the rgrp (the lock protects only the rgrp header and block bitmap), nor
even from writing to files that stay the same size.

The reservation logic is implemented in fs/blklist.c.  ogfs_inplace_reserv()
is the high level reservation call.  It tries a few different methods for
finding an appropriate rgrp, the easiest first:

-- look in the target rgrp (contained in the reservation structure).
-- look in the next rgrp, then the next, etc.
-- if *no* rgrp can be found to fit the entire request, look for a "best fit".

The second method tries to find an rgrp that can fit the entire request.
However, since it may be that *no* rgrp could fit the entire request, it
also keeps track of which rgrp might provide a "best fit".  Currently, this
means that the rgrp can fit the entire data block request, but might not be
able to handle the entire meta block request.

Currently (November 2003), there may be room for improvement in the block
allocation scheme, including:

--  partial allocs for data (that won't fit entirely in a single rgrp)
--  automatic recovery of metadata and dinode blocks (back to data blocks)
      in rgrps that are getting full  Actual allocation

Actual allocation involves searching through the rgrp's block bitmap for the
individual blocks to allocate, then changing bit state within the block bitmap.
Allocation is done by the ogfs_blkalloc() and ogfs_metaalloc() routines,
both of which are in fs/blklist.c.

Allocation routines are responsible for updating the allocation stats in the
reservation structure, as well as in the incore image of the rgrp header.

Once a write operation is complete, the rgrp's statistics must be updated
within the rgrp header block and LVB.  The rgrp header and bitmap blocks
must be flushed to disk, to support alloc/de-alloc by other nodes.  The LVB
must be flushed to the locking back end, to support stat() functions in
all nodes.

SourceForge Logo