Pool_and_Utilities (Aug 29 2003)


Contact the author.

		OpenGFS Pool and PTools

copyright 2003 The OpenGFS Project.

Introduction
------------

This document contains details of the internals of the OpenGFS "pool" kernel
module, and describes how it works with some of the OpenGFS PTool utilities.
It also maps these and other utilities to ioctls supported by the pool module.

This document is aimed at developers, potential developers, students,
and anyone who wants to know about the low level details of the pool module.

This document is not intended as a user guide to OpenGFS. Look in the OpenGFS
HOWTO-generic for details of configuration and setting up OpenGFS.

At the time of this writing (March 2003), there is a desire to replace
the OpenGFS pool module with another mapping layer, namely dm device mapper.
This document will serve as a checklist for pool functionality, as we ponder
the replacement.  See DESIGN-nopool for the results of the pondering.

Terminology
-----------

The terms "device", "storage device", "disk", "drive", and "partition" are
used somewhat interchangeably in this document.  Depending on context, they
may mean an actual piece of hardware, a portion of the storage area provided
by the hardware, or a Linux /dev node that provides access to the storage.  

In most cases, the /dev node meaning is appropriate.  Much of the OpenGFS
source code uses the term "device" to indicate a chunk of storage (a
"partition", which may or may not be an entire "disk", "drive", or "storage
device"), accessed via a /dev node.


Concepts Behind Pool
--------------------

When sharing common external storage devices among several computers in
a cluster, there is a very strong possibility that the computers may view
a given external device in different ways.  For example, computer node A
may see a given shared drive as "sdc", while node B sees it as "sdd", and
node C sees it as "sdb".

The pool kernel module, in conjunction with the "ptool" family of utility apps
(passemble, ptool, pgrow, pinfo; see man 8 pages for each utility), provides
an abstraction layer so that all computers in a cluster will view the shared
storage in a consistent way.

On each computer in an OpenGFS cluster, the "passemble" tool:

1).  Scans all drive partitions to discover OpenGFS pool information.

     passemble reads the first sector of each disk partition listed in
     /proc/partitions, looking for OpenGFS drive pool partition headers/labels.
     These headers/labels must be previously written to drives by either
     the "ptool" or the "pgrow" tool, when configuring OpenGFS pools
     (see OpenGFS HOWTO-generic).

     The label information includes, among other things, the name of the pool
     to which the drive belongs.  This name provides the key to the common
     view shared by the computers.

     Of course, several drives/partitions may belong to the same pool, and
     some drives/partitions may not belong to any OpenGFS pool at all.

2).  Creates "/dev/pool/xxxx" device nodes, one for each OpenGFS pool
     discovered ("xxxx" is the pool name discovered in step 1).

3).  Sets up the pool kernel module to map the abstract pool devices to the
     physical drive/partition devices (or perhaps to another layer of
     abstraction, if such a layer showed up in /proc/partitions).

A pool is viewed by the file system as a single device containing a contiguous
group of sectors.  A single pool can encompass anywhere from a small partition
on a single drive (e.g. a 4 MB OpenGFS cluster information partition; see
OpenGFS HOWTO-generic) up to a sizeable group of drives (e.g. a data storage
array).  The mapping can concatenate drives, or it can stripe data across
several drives (within a subpool, see below), which may provide higher
storage performance.

Subpools are groups of drives or partitions within a pool.  Each subpool
is viewed by the file system as a contiguous group of sectors within a pool,
beginning at a particular sector, and ending at the sector immediately
before the next subpool.  There are no gaps allowed between subpools, and
a single subpool cannot be scattered into non-contiguous pieces within
the pool, as viewed by the filesystem.  The physical view may be scattered,
however, with several drives in a subpool, either concatenated or striped.

The sector offset is the only method that the filesystem uses to access a
particular subpool.  The pool module, in turn, uses the sector offset to
map I/O to the particular device partitions that comprise the subpool
(see "pool Kernel Module" section below for more information).

Subpools can be used to (see opengfs/docs/pooltemplate):

1).  Subdivide a large number of drives into smaller striping groups.

2).  Isolate file system journals from file system data, by assigning,
     for example, one drive to a "ogfs_journal" subpool type, and assigning
     other drives to a "ogfs_data" subpool type.

     "ogfs_data" subpools contain data exclusively.  

     "ogfs_journal" subpools contain journals exclusively, and the filesystem
     requires that one journal subpool be created for *each* of the computer
     nodes in a cluster (e.g. create 3 journal subpools if you expect 3 
     computers to share the pool).

When making the filesystem, mkfs.ogfs normally queries the pool module to
find out about locations and types of subpools within the pool (after
passemble has been run, the pool module contains all this information in
kernel memory, much more convenient than having mkfs.ogfs scan all the 
disks for labels).  See more about this in the Pool Label Details section.

At this point, you may be wondering about the example data pool created by the
pool0.cf configuration file in the HOWTO-generic.  It breaks the rules,
creating only a single data subpool, and no journal subpools.  The example 
works however, by calling mkfs.ogfs with two important options:

-i tells mkfs.ogfs to ignore (*not* query for) subpool layout information

-j 2 tells mkfs.ogfs to create 2 journal subpools

mkfs.ogfs then creates its own subpool layout, without regard to any physical
device mapping (i.e. it doesn't care where the subpools fall on physical
drives).  It sets up the filesystem to write journals to the self-created
journal subpools, and data to the self-created data subpools
(see opengfs/docs/ogfs-ondisk-layout).

This special method provides no ability to isolate journals onto particular
partitions, because mkfs.ogfs does not know (and does not care, in this case)
where the particular partitions are.

When writing journals into their own subpool, the filesystem uses the sector
offset as the only indication of where the journals belong within a pool.  
The pool module uses no knowledge of which subpools contain journal vs. 
data files when performing a mapping.  This is true in all cases, whether or 
not the -i and -j options are used for mkfs.ogfs.


Pool Label Details
------------------

If a drive partition is a member of a pool, it must contain a pool label in
the first sector (512 bytes) of the partition.  This label identifies the
pool and subpool to which the partition belongs.  Note that a single partition
(and it must be the entire partition) can belong to only one subpool within
only one pool.

The pool labels are written to storage devices by the "ptool" utility,
when initially configuring a pool, or the "pgrow" utility, when adding
subpools to an existing pool.

Both tools rely on input from a pool configuration file.  This file describes
the following for each pool:

1).  pool name
2).  number of subpools in pool

And for each subpool:

1).  subpool index (ID) (start from 0, increment by 1)
2).  stripe size (in 512-byte sectors) (0 for no striping)
3).  type (ogfs_data or ogfs_journal)
4).  number of devices in subpool

And for each device/partition:

1).  device index (ID) (start from 0, increment by 1)
2).  subpool to which you are assigning this device
3).  physical partition (e.g. /dev/sdb1) as seen by the computer running ptool
4).  DMEP weight for this device (see opengfs/docs/pooltemplate)

The subpool and device indexes control the order of the pool layout, as seen by
the filesystem.  Device 0 of subpool 0 will contain sector 0 of the pool.
The last device in the last subpool will contain the last sector of the pool.

Journals can be directed to a separate physical device by creating a journal
subpool and assigning the desired device to it.  There is no rule governing
(at least from the perspective of pool or the ptool utilities) the order
(determined by subpool index) in which journal subpools appear within a pool.
They may be first, last, or in the middle.

Most of the values (with the exception of the physical device) from the
configuration file are reflected directly in the pool label written to disk.
The ptool and pgrow utilities add the other fields (magic, version, pool ID,
and partition size, see below) to the label before writing to disk.

Pool labels contain the following information, in the following categories.
All values are 32-bit, unless otherwise indicated:

1)   General

     a).  pool magic number -- 64-bit, says this partition is a pool partition
     b).  pool version -- indicates label version

2)   Pool

     a).  pool ID number -- 64-bit unique pool identifier
     b).  pool name -- 256-bytes ASCII, this is the xxxx in /dev/pool/xxxx

3)   Subpool

     a).  total number of subpools within this pool
     b).  index number of this subpool within this pool
     c).  striping size (0 if no striping) within this subpool
     d).  subpool type (data or journal)

4)   Partition

     a).  total number of partitions within this subpool
     b).  index number of this partition within this subpool
     c).  partition size -- 64-bit # of (512-byte?) blocks in this partition

5)   DMEP

     a).  total number of DMEP devices in this subpool
     b).  index number of this DMEP device within this subpool
     c).  DMEP weight -- an indication of preference for using this device

Please refer to appendix A for detailed information about the data structure
on disk.

Note that each pool label will be different, since each partition will
have a unique combination of subpool, partition, and perhaps DMEP information.

Labels are read by passemble (when initializing a cluster member computer),
pgrow (when growing a pool while in use), or by the pool module (when growing
a pool while in use, after another computer ran pgrow).

Once all the labels are gathered from all the storage devices, this information
is sufficient for "assembling" the entire pool into one contiguous mapping
of pool sector number to partition sector number.

This information is also sufficient to detect missing partitions, by checking
the subpool and partition index numbers for contiguity, and by checking to
see that the greatest index numbers correlate with the expected number of
subpools and devices.

Note that there is no information that self-names the physical device as
seen by the computer.  It is up to each computer to read the pool label
and associate that label with the physical device on which the label exists.

When growing pools, existing subpools must be left the way they are, and
growth is done only by adding subpools at the end of the pool.




Utility Apps -- ptool family
----------------------------

The ptools utilities manipulate and serve as a bridge between two entities:

--  On-disk pool labels.  The labels provide persistant storage and the
    central (albeit distributed) repository for pool configuration.

--  Pool kernel module.  This driver provides mapping between the abstract
    pool devices and the physical storage devices.  A pool is considered
    "active" when this driver knows about it and is mapping it.

Note that *pool* configuration information is different from *cluster*
configuration information; pool information describes the *drives/partitions*
that can be shared by a cluster, while cluster information describes the
*computers* that are members of an OpenGFS cluster.  Cluster information is
written to disk by the "ogfsconf" utility, which is not a pool utility.

The ptool family includes the following (see man 8 pages for more information
on each one):

1).  ptool writes the partition labels to storage devices (and does not deal
     with the pool kernel module).  Only one computer needs to run ptool,
     when initially configuring a pool.

2).  passemble reads partition labels from storage devices, and sets up the
     pool kernel module to map pools to physical partitions (as described
     in "Concepts Behind Pool" in this document).  Each computer in a cluster
     must run passemble each time the computer reboots, to discover available
     pools.

3).  pgrow adds subpools to an active pool by writing partition labels to
     storage devices, *and* by setting up the pool kernel module to map the
     new subpools (note that pgrow is kind of a ptool and passemble all-in-one).
     Only one computer needs to run pgrow, when initially growing a pool.

     The kernel module mapping update occurs initially *only* on the computer
     that runs pgrow.  The pool drivers on each of the other computers
     "automagically" re-map themselves if/when I/O requests fall beyond their
     old capacity.  This is the only case in which the pool driver accesses
     the pool labels on the storage devices, to scan for and read the new pool
     labels.

     After growing a pool, one computer must also run ogfs_expand to make the
     filesystem on the new subpool devices.  (Do other member computers need
     to run a utility to tell their filesystems that things have grown?)

4).  pinfo queries the pool kernel module for currently active pool information
     (and does not read/write any pool labels on storage devices).

     Caution:  there is also a "pinfo" utility that is an info page browser:

     man 8 pinfo -- OpenGFS pinfo, default OpenGFS installation puts it in /sbin

     man 1 pinfo -- info doc browser, found as /usr/bin/pinfo in RedHat 8.0.


Note that none of the ptool utilities reads/writes disks via the pool kernel
module.  The utilites read/write partition headers directly from/to the
physical devices.

See elsewhere in this document for a list of pool ioctls and the utility
tools that use each one.


Utility Apps -- others
------------------------

The pool kernel module is used by a few other tools that are not members
of the ptool family.  See man 8 pages for more info on each one:

"dmep_conf" configures DMEP (Device Memory Export Protocol) segments of
certain types of shared storage devices for purposes of hardware-based locks.

"do_dmep" executes DMEP commands on the storage devices.  (No man page).

"mkfs.ogfs" makes an OpenGFS file system on a block device (e.g. a pool).

"ogfs_expand" expands an OpenGFS file system after a block device (e.g.
a pool) expands.

"ogfs_jadd" adds journals to an OpenGFS file system after a block device
(e.g. a pool) expands.

See elsewhere in this document for a list of pool ioctls and the utility
tools that use each one.


The "pool" Kernel Module
------------------------

The pool kernel module has three main purposes:

1).  Mapping blocks from /dev/pool devices to real devices.  In this mode,
pool is hooked into the kernel's block handling subsystem, and translates
block requests as they are being "made".

2).  Implementing ioctls for setup, control, statistics monitoring, etc.
In this mode, pool responds to ioctl calls from ptool and other utility apps.

3).  Sending DMEP (hardware-based file locking) commands to pool devices.
In this mode, pool responds to ioctl calls from the dmep utility apps.
Note that many (probably most) devices supported by OpenGFS do not need or
use this capability.

Note that the pool module has facility for duplicating much of what is done
by the passemble utility (i.e. assembling pools).  However, this capability
is used only in a special situation, when a mapping request exceeds the
capacity of a given pool.  In this case, the pool module re-assembles the pool,
hoping to discover that the pool capacity has grown to a larger size.  This
is the method used to propagate knowledge of grown pools to cluster member
computers *other* than the computer that ran the pgrow utility (see info
on pgrow elsewhere in this doc).

Descriptions below relate to pool behavior with the 2.4 kernels.  Pool code
can be found in OpenGFS source:

opengfs/src/pool/
opengfs/src/pool/arch_linux_2_4/

Pertinent kernel code can be found in kernel source:

drivers/block/ll_rw_blk.c


** Mapping **

The pool module intercepts kernel block requests before they are placed
onto kernel request queues.  This interception is set up by the following
call in pool_init() (.../pool/arch_linux_2_4/pool_linux.c):

	blk_queue_make_request(BLK_DEFAULT_QUEUE(POOL_MAJOR),
			pool_make_request_fn);

This tells the kernel to use pool_make_request_fn() (also in pool_linux.c),
instead of the kernel's default __make_request(), when preparing a
buffer head for placement in the kernel request queue for the pool major
device.

The kernel calls the substituted function from within generic_make_request(),
which contains a loop to iteratively pass the buffer head between stacked
block drivers, to perform layers of device mapping.

pool_make_request_fn() does the following for each buffer head (bh):

1)  Checks pool status to make sure it is currently valid to access.

2)  Checks bh start/size to see if it extends beyond end of pool.
    If so, calls assemble_pools() (.../pool/common.c) to reassemble
    the pool in the hope that the pool has grown.

3)  Calls pool_blkmap() (.../pool/common.c) to do the actual mapping
    (linear or striped).

4)  Substitutes the newly mapped "real" device and sector number into
    the buffer head structure.

5)  Returns 1 to the kernel, to tell it to pass the bh to the next driver
    in the stack (e.g. the driver for the "real" disk drive).

With this mapping scheme, pool should never receive a request from a queue
to do actual I/O . . . all requests should be intercepted and mapped (by pool)
before they get onto the pool's queue.  Nonetheless, pool needs to register
a "dummy" I/O request handler function.  It does so in pool_init(), with
the following:

#define DEVICE_REQUEST do_pool_request

	blk_init_queue(BLK_DEFAULT_QUEUE(POOL_MAJOR), DEVICE_REQUEST);

do_pool_request() is the dummy function, which printks an error message if
called.


** Ioctls **

The pool module supports ioctls via the /dev/poolcomm device (pool minor # 0),
or via the /dev/pool/xxxx devices (minor numbers > 0).

The /dev/poolcomm device node is created by one of two means:

1)  Pool module creates it via devfs during module initialization
    (but only if devfs, not required, is compiled into your kernel).

2)  The "passemble" utility creates it each time it executes (whether or
    not it successfully assembles a pool).

The /dev/pool/xxxx nodes are created by "passemble" as it discovers and
successfully assembles pools.

The following lists all IOCTLs supported by the pool kernel module.  It
includes the call used within the kernel module to execute each ioctl,
and an attempt at an exhaustive list of tools that use each ioctl (this
largely for evaluating portability to a new device mapper):

POOL_MEMEXP:  Sends a DMEP command to DMEP hardware, using pool_memexp().
		Used by:  do_dmep, dmep_conf

POOL_CLEAR_STATS:  Clears pool statistics
		Used by:  pinfo

POOL_DUMP_TOTALSTATS:  Copies pool stats to user space
		Used by:  pinfo

BLKGETSIZE:   Copies pool capacity (in blocks) to user space.
		(note:  capacity is a 64-bit value internal to pool,
			but interface passes only 32-bits).
		Used by:  mkfs.ogfs, ogfs_expand, ogfs_add
		(Not used by:  initds, ptool, pgrow, which access real devices)

POOL_ADD:     Brings new pool to the attention of the pool module,
		using uadd_pool() -> add_pool().
		Used by:  passemble

POOL_REMOVE:  Removes pool from the attention of the pool module.
		using remove_pool().
		Used by:  passemble

POOL_GROW:    Grows pool already known by the pool module,
		using ugrow_pool() -> grow_pool().
		Used by:  pgrow

POOL_COUNT:   Copies number of pools, known by pool module, to user space,
		using count_pool().
		Used by:  passemble, ptool, pgrow, pinfo

POOL_LIST:    Copies list of pools, known by pool module, to user space,
		using list_pool().
		Used by:  passemble, ptool, pgrow, pinfo

POOL_SPLIST:  Copies list of subpools in a given pool to user space,
		using list_spool()
		Used by:  passemble, pgrow, pinfo, dmep_conf, do_dmep, mkfs.ogfs

POOL_DEVLIST: Copies list of devices in a given pool to user space,
		using list_pooldevs()
		Used by:  passemble, pgrow, pinfo, dmep_conf, do_dmep

POOL_IND_CACHE:  Syncs and flushes, using fsync() and invalidate_buffers().
		Used by:  nothing

POOL_DEBUG_ON:  Sets flags for enabling pool module debugging,
		using turn_on_pool_debug_flag().
		Used by:  ptool

POOL_DEBUG_OFF: Resets flags for enabling pool module debugging,
		using turn_off_pool_debug_flag().
		Used by:  ptool

HDIO_GETGEO:    No-op, printk.
		Used by:  nothing

default:        blk_ioctl(), the kernel's ioctl handler.



Porting to Device Mapper
------------------------

There is currently a desire to use Device Mapper (DM) as the kernel component
(replacing the pool module) to do the mapping between the /dev/pool/xxxx
abstract devices and the physical devices.  This should be do-able by one
of two methods:

1).  Porting ptools utilities to use DM

2).  Using other tools that currently sit on top of DM.

The preferred method is #2, of course!  We're looking into EVMS to provide
this functionality.  There may be other tools that could provide this as well.

Issues:

1).  Device mapper does not know how to "automagically" scan and read
     pool labels when growing.  We will need to provide this ability in
     user space.  (Does EVMS provide this capability?)

2).  Device mapper does not support sending DMEP commands to devices.
     (we may drop DMEP support entirely, or use another channel to send
     DMEP commands to DMEP SCSI hardware.  Source code for OpenGFS DMEP
     utilities sez that commands can be sent via sg, scsi-generic).

3).  mkfs.ogfs uses the POOL_SPLIST (get subpool list) ioctl provided by
     the pool kernel module, but not provided by Device Mapper.
     (We'll need another way to provide subpool info to the filesystem).


APPENDIX

A. Structure Details
--------------------

This section contains a list of the structures stored in the filesystem.
Unless otherwise stated, all types are unsigned integers of the given length.

1).  Pool labels (pool_label_t)

  0   +-------------------+
      |	0x011670          | Magic number
  8   +-------------------+
      |	Pool id           | Unique pool identifier
  16  +-------------------+
      |	Pool name         | Name of pool (char[])
  272 +-------------------+
      |	Pool version      | Pool version
  276 +-------------------+
      |	Num. subpools     | Number of subpools in this pool
  280 +===================+
      |	Subpool number    | Subpool number within pool
  284 +-------------------+
      |	SP Num. data part.| Number of data partitions in this subpool
  288 +-------------------+
      |	SP Partition num. | Partition number within subpool
  292 +-------------------+
      |	Subpool type      | Data (ogfs_data) or journal (ogfs_journal)
  296 +-------------------+
      |	Num. part. blocks | Number of blocks in this partition
  304 +-------------------+
      | Striping blocks   | Stripe size within subpools in blocks (0 = off)
  308 +-------------------+
      |	SP Num. DMEP dev. | Number of DMEP devices in this subpool*
  312 +-------------------+
      |	SP DMEP dev. num. | DMEP device number within subpool*
  316 +-------------------+
      | SP DMEP weight    | Preference weight for using this device
  320 +-------------------+   (see opengfs/docs/pooltemplate)

  * If the number of DMEP devices is zero, then the DMEP device is the subpool
    id of a subpool that handles DMEP instead.
SourceForge Logo