NoPool_Design_and_Implementation

NoPool_Design_and_Implementation (May 9 2003)
Contact the author.
Design notes for "no-pool" OpenGFS.


DESIGN GOALS:

Enable successful operation of OpenGFS without the use of the pool module
and its associated utilities.  

Provide ability to assign OpenGFS journals to specific hardware devices, 
while not requiring tight coupling of OpenGFS with a specific volume manager 
(user space) or device mapper (kernel space).  Also continue legacy support
for internal journals.

Provide filesystem and journal re-sizing support (filesystem growth and journal
addition only, unless a good way can be found to shrink the filesytem, and/or
remove journals).


Background and rationale:

The pool module provides: 

1) Mapping of one or more hardware devices (e.g. disk drive partitions) into 
   a contiguous pool, appearing as /dev/something (e.g. /dev/pool/pool0) to 
   the filesystem.

2) Consistent naming of the /dev/something across all computers in a cluster.

3) Stacking of the mapping with other mapping modules (e.g. kernel's md RAID).

4) Two types of "subpools":  data or journals.

5) Striping for individual data subpools, for enhanced performance.

6) Information, queried by mkfs.ogfs, regarding the devices comprising the pool.
   This enables mkfs.ogfs to assign specific journals to specific hardware
   devices within the pool.

7) Ability to grow the pool, adding new partitions.

8) Hardware-based (DMEP) locking support

OpenGFS was written before many of these features were commonly found in the
kernel or other open source projects.  Currently, however, many of these
features are redundant with components such as the DM device mapper, the md
software RAID support, and EVMS (Enterprise Volume Management System).  These
features include 1) mapping, 2) consistent names, 3) stacking, 5) striping.

The primary feature not provided by other components is the tight coupling of
the pool layout knowledge to mkfs.ogfs.  The primary usage of this has been to
provide the journal assignment to a specific hardware device, for performance
or reliability reasons.  By providing support for assigning each external 
journal to a specific device, this feature 6) is no longer required.  

Feature 4) subpools will continue to be supported within the filesystem,
for the purpose of internal journals.  However, there will be no reason for
a user/admin, nor for a volume manager or a device mapper, to be aware of
subpools.  There will be no provision for a user/admin to place internal 
journals at specific locations within the filesystem.

Feature 7) re-sizing will be a phase-2 venture, after basic operation of
external journals is secured.

Feature 8) hardware DMEP will not be addressed.  There may be support for 
DMEP built into OpenGFS already, using kernel "sd" (SCSI Device) component
as an alternative to the pool support, but I haven't really looked into this.


THEORY OF OPERATION:

Internal to OpenGFS, the filesystem is seen as a linear 64-bit space of blocks.
In legacy OpenGFS, this space has been contiguous from 0 to the total capacity
of the filesystem.  In no-pool OpenGFS, the data and internal journals still
appear this way, but external journals appear at the top of the 64-bit space. 

As an example, assume that there is a shared drive called /dev/fsdrive, that
contains the filesystem, including 30 GB (7,864,320 blocks, assuming block size
= 4096 bytes) of data, and 2 internal journals of 128 MB (32,768 blocks) each.
Assume also that there are two partitions on another drive, called /dev/jrnl0
and /dev/jrnl1, each of which is 1 GB (262144 blocks), to be used as external 
journals.  
The filesystem drive will appear to OpenGFS as blocks:
0 - 7,929,856 (7,864,320 + 2 * 32,768).
One of the external journals will appear as blocks:
18446744073709289472 (2^64 - 262,144) - 18446744073709551615 (2^64 - 1).
The other will appear as blocks:
18446744073709027328 (2^64 - 524,288) - 18446744073709289471 (2^64-262,144-1).


IMPLEMENTATION:

Journal indexes:

The legacy journal index (one for each journal) contains information on the 
start address (in blocks) and the length (in journal segments) of a journal.
For external journals, these fields will continue to be used in the same way,
but the start address will be a very large number (2^64 - n).

In addition, the journal index will contain an ASCII string representing
the /dev/something name of the external journal device.  This makes use of
48 bytes of "reserved" space in the legacy journal index, leaving 16 bytes
unused.  Internal journals will leave all 64 of these bytes at 0.

New journal index structure:

#define OGFS_DEVNAME_LEN	48

struct ogfs_jindex {
	uint64 ji_addr;		/* starting block of the journal */
	uint32 ji_nsegment;	/* number of segments in journal */
	uint32 ji_pad;

	char ji_device[OGFS_DEVNAME_LEN];	/* external journal dev name */
	char ji_reserved[64 - OGFS_DEVNAME_LEN];
};
typedef struct ogfs_jindex ogfs_jindex_t;


mkfs.ogfs:

mkfs.ogfs is responsible for writing the journal indexes out to the filesystem
device, and for writing segment headers into the journal storage.  For
external journals, mkfs.ogfs will write segment headers to the external
journal devices or partitions, starting at sector 0 of each device/partition.

mkfs.ogfs will use a journal configuration file to accept user/administrator
input regarding the journal layout, but will also continue to accept all
command line options used in legacy OpenGFS, checking for conflicts between
command line and configuration file entries.

The configuration file will contain the following types of entries:

fsdev -- the /dev/something name of the filesystem device (this can also be
   entered on the command line, as with legacy mkfs.ogfs).

journals -- the total number of journals, internal and external, used for
   sanity checking the journal descriptions, and allocating a bit of memory
   before reading the journal descriptions (this can also be entered on the
   command line with the -j option, as with legacy mkfs.ogfs).

journal -- a description of a single journal, in one of the following formats:

journal  0  ext  /dev/journal0

   0 is the journal index, used to map a journal to a cluster member computer.
   ext declares that this journal is external.
   /dev/journal0 represents the /dev name of the journal device.

journal  1  int  256

   1 is the journal index, used to map a journal to a cluster member computer.
   int declares that this journal is internal.
   256 is the journal size, in MBytes.



Filesystem:

The filesystem is responsible for accessing the journals appropriately.

During operation, the filesystem must determine whether a given block belongs
within the main filesystem, or within one of the external journals.  If
external, the filesystem must determine which device, and translate the
block number into a suitable block number within the device.  

During mount, the filesystem sets up several in-core values that support these
translation operations.  The mount sequence is driven from _ogfs_read_super(),
in src/fs/arch_linux_2_4/super_linux.c.


Filesystem mounting and unmounting:

When mounting, the filesystem reads all of the journal indexes, and determines
the total number of journals, both internal and external (fs/super.c, 
ogfs_ji_update()).  It allocates an array of struct block_device pointers,
copies the pointer into the in-core superblock structure's sdp->sd_jdevs,
and fills each array element with either a valid pointer (for an external
journal device) or a NULL (for an internal journal).

In more detail, when ogfs_ji_update() encounters an external journal (i.e.
a journal with a non-zero value at ji_device[0]), it calls ogfs_get_jdev_info()
(src/fs/arch_linux_2_4/dio_arch.c).  Based on code stolen from the kernel's 
get_sb_bdev() (kernel's fs/super.c), ogfs_get_jdev_info() does the following:

1)  path_lookup() on the device name, to find inode for /dev/something.
2)  bd_acquire() on the inode, to find a struct block_device pointer.
3)  inserts the pointer into the array of journal device pointers (sd_jdevs[]).
4)  blkdev_get() on the block_device, to reserve it for ongoing use.
5)  path_release() to release the path obtained with path_lookup()

Continuing, ogfs_ji_update() compares the base block addresses of the 
external journals, and sets the lowest of these as the threshold for
deciding whether a given block is stored within the filesystem
(block number < threshold), or in one of the external journals
(block number >= threshold).  This block number is stored in the in-core
superblock structure, sdp->sd_jext_base.

_ogfs_read_super() fills in two other superblock values, once it has mounted
a locking protocol (and thereby knows which cluster member *this* machine is),
and has read in the journal info.  sdp->sd_my_jdesc, and sdp->sd_my_jdev point
to the jindex and block_device for *this* machine.  sdp->sd_my_jdev duplicates
info within the sdp->sd_jdevs block_device array for journals, but was added 
as a convenience to the block translation code (see below).

When unmounting, for each external journal, the filesystem calls 
ogfs_put_jdev_info(), which does:

1)  blkdev_put() on the journal's block_device, to release it

And, of course, the filesystem de-allocs the sd_jdevs[] array.


Filesystem operation (block translation):

Discernment of filesystem vs. external journal blocks is built into the
lowest level of interface between the filesystem and the operating system,
just before the filesystem requests a buffer head (bh) from the OS.
There are three points where this occurs, each of which is in
src/fs/arch_linux_2_4/arch_dio.c:

1)  ogfs_dgetblk_arch(), the general purpose block "getter" routine
2)  try_readahead(), as the name implies, called from ogfs_dreread, the
      general purpose read routine.
3)  ogfs_init_temp_bh(), which gets and inits a "temporary" bh used for
      writing journal data (apparently a duplicate of the file data).

Each of these routines compares the requested block number against the
sdp->sd_jext_base value.  If the block number is higher than this threshold
value, then it belongs in an external journal, and the routine calls 
translate_block(), a static function within arch_dio.c.

translate_block() first checks *this* machine's journal index info at
sdp->sd_my_jdesc to determine if the requested block is within "my" journal.
If so, it copies the journal device's kdev_t into the passed "dev" parameter,
and subtracts the journal's base address from the "blkno" parameter, to
translate the "ogfs linear" address into a device-local block number.

If the block is not in "my" journal, translate_block() looks through the
list of journals sdp->sd_jindex[] to find the right journal, and fills in
the dev and blkno parameters.



Filesystem implementation summary:

src/fs/super.c:

  ogfs_ji_update() -- Refreshes in-core info about journals.  Added:
    - alloc an array of kernel block_device pointers
    - loop to call ogfs_get_jdev_info() for each external journal
    - determine lowest base address, set as external journal threshold

  ogfs_clear_journals() -- Clears in-core info about journals.  Added:
    - hook to call ogfs_put_jdev_info() if a journal is external.

src/fs/arch_linux_2_4/dio_arch.c:

  ogfs_get_jdev_info() -- Finds a struct block_device corresponding to an
    external journal's /dev/something.  Reserves device for ongoing use.
    New function.

  ogfs_put_jdev_info() -- releases device from ongoing use.  New function.

  translate_block() -- translates external journal block, replacing dev
    and blkno.  New function.

  ogfs_dgetblk_arch() -- general purpose "get bh" function.  Added:
    - hook to call translate_block() if external journal
  
  try_readahead() -- tries to read ahead, called by ogfs_dreread().  Added:
    - hook to call translate_block() if external journal
  
  ogfs_init_temp_bh() -- gets and inits bh for journal data.  Added:
    - hook to call translate_block() if external journal