(May 29 2004)


                OpenGFS Locking Mechanism

Copyright 2003 The OpenGFS Project

Original authors, early 2003:
Stefan Domthera (sd)
Dominik Vogt (dv)

Updates since June, 2003:
Ben Cahill (bc), ben.m.cahill@intel.com

Introduction
------------

This document contains details of the OpenGFS locking system.  The first part
provides and overview of the software layers involved in the locking system.
The second part describes the design of the G-Lock locking layer, and the final
part explains the details of how the G-Locks are used in OpenGFS.

This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of shared file system locking in
OpenGFS.

This document is not intended as a user guide to OpenGFS.  Look in the OpenGFS
HOWTO-generic or HOWTO-nopool for details of configuring and setting up OpenGFS.

This document may contain inaccurate statements.  Please contact the author
(bc) if you see anything wrong or unclear.


Terminology
-----------

Throughout this document, the combination of a mounted lock module, and - if
applicable - a lock storage facility (e.g. memexpd) is sometimes called "locking
backend" for simplicity.

The OpenGFS filesystem and locking code contains many uses of terms such as
"get", "put", "hold", "acquire", "release", etc., that may be inconsistent
or vague.  I (bc) have tried to write enough detail to clearly explain the
true meaning of these terms, depending on the context.  Let me know if anything
is unclear or inaccurate.

"Machines", "computers", "nodes", "cluster members", and sometimes "clients",
mean pretty much the same thing:  a computer (machine) that is a compute node
within a cluster of computers sharing the filesystem storage device(s).  If
using the memexp locking protocol, the memexp module on the computer will be
a "client" of the memexpd lock storage server.

"dinodes" are the OGFS version of an "inode".  "inode" is often used to mean
"dinode", but may also mean a struct inode defined by the kernel's Virtual
File System (VFS).


Requirements
------------

In a distributed file system, at least three types of locking are required:

1). Inter-node locking must guarantee file system and data consistency when
    multiple computer nodes try to read or write the shared file system in
    parallel.  In OpenGFS, these mechanisms are implemented by various "lock
    modules" that link into the filesystem code with the help of the "lock
    harness".

2). The file system code must protect the file system and the data structures
    in memory from parallel access by multiple processes on the node.  The
    "G-Lock" software layer takes care of this protection, and also decides
    whether communication with other nodes or a central lock server, via the
    lock module, is necessary.

3). The file system structures in kernel memory must be protected from
    concurrent access by multiple CPUs (Symmetric Multi-Processing) on the
    same node.  Linux spinlocks and/or other mutual exclusion (mutex)
    methods are used to achieve this.


Overview
--------

The following gives an overview of the locking hierarchy.  The left side
shows lock module loading/unloading (registering/unregistering with the
lock harness) and lock protocol mounting, and the right side shows all
other operations:

   +--------------------------------------------------------------------------+
   |                       File System Module (ogfs.o)                        |
   |                                           +-------------------+ +--------+
   |                                           |superblock, inodes,| |        |
   |        sdp->sd_lockstruct                 |flocks, journals,  |-|        |
   |         /                                 |resource grps, etc.| | glops  |
   |        /                                  +---------|---------+ |        |
   |-------/-----+             +-----------+   +---------|---------+ |        |
   |   mount     |             |   misc.   |   |      G-Lock       |-|        |
   +-------------+-------------+-----------+---+-------------------+-+--------+
            |                        |                    |
   +--------------------------+      |                    |
   |Harness Module (harness.o)|      | others_may_mount   | lock & LVB
   +--------------------------+      | reset_expired      | operations
                              \      | unmount            |
                     register  \     |                    |
                     mount      \    |                    |
                     unregister  \   |                    |
                                  +--------------------------------------+
                                  |     Lock Module (e.g. memexp.o)      |
                                  +--------------------------------------+
                                                      |
                                  +--------------------------------------+
                                  +     Lock Storage (e.g. memexpd)      +
                                  +--------------------------------------+

                                     
  Loading, mounting, unloading <--|--> Ongoing filesystem operations, unmount

   Fig. 1 -- Software modules and layers involved in locking


The current locking system is a big clump of different software layers that,
in addition to locking, execute several tasks that are not locking, per se.
Historically, these tasks were clumped together because they all relate to
coordinating the behavior of the cluster member nodes, and the lock server
has served double duty as the cluster membership service.  

Enhancement:  It might be a good idea to split off the functionality unrelated
to locking into independent components.

The lock harness serves two simple purposes:

 - maintaining a list of available-to-mount lock modules
 - connecting a selected module to the filesystem at lock protocol mount time
     (one of the first things done during the filesystem mount).

After protocol mount time, the harness module's job is done.

The locking modules and lock storage facility take care of:

 - Managing and storing inter-node locks and lock value blocks (LVBs)
 - Lock expiration (lock request timeout) and deadlock detection
 - Heartbeat functionality (are other nodes alive and healthy?)
 - Fencing nodes, recovering locks, and triggering journal replay in case
      of a node failure

The G-Lock software layer is a part of the file system code.  It handles:

 - Coordinating and cacheing locks and LVBs among processes on *this* node
 - Communication with the locking backend (lock module) for inter-node locks
 - Executing glops when appropriate (see below)
 - Journal replay in case of a node failure

The glops (G-Lock Operations) layer is also part of filesystem code.  It
implements the filesystem-specific, architecture-specific, and protected-item-
specific operations that must occur after locking or before unlocking, such as:

 - Reading items from disk, or from another node via Lock Value Block (LVB),
      after locking a lock
 - Flushing items to disk, or to other nodes via LVB, before unlocking a lock
 - Invalidating kernel buffers, once flushed to disk, so a node can't read
      them while another node is changing their contents.

Each lock has a type-dependant glops attached to it.  This attachment point is
the key to porting the locking system to other environments, and/or creating
different types of locks, and defining their associated behavior.


2. Lock Harness and Lock Modules
--------------------------------

At filesystem mount time, the filesystem uses the lock harness to mount a
locking protocol (see "Lock Harness", below).  The harness' mount call,
lm_mount(), fills in the lm_lockstruct contained in the filesystem's incore
superblock structure as sdp->sd_lockstruct.  This exposes the chosen lock
module's services to the filesystem.

struct lm_lockstruct contains:

  ls_jid           - journal ID of *this* computer
  ls_first         - TRUE if *this* computer is the first to mount the protocol
  ls_lockspace     - ptr to protocol-specific lock module private data structure
  ls_ops           - ptr to struct lm_lockops, described below

"ls_jid" indicates the journal ID that should be used for *this* computer.
It is currently the lock module's job to discover this journal ID.  The "memexp"
lock module does this by reading information from a cluster information device
(cidev), which is a small disk partition dedicated for that purpose.  The
"nolock" lock module provides either "0", or the value of a "jid=" entry (if
used) in the command line for mounting the filesystem.  This is some of the
functionality that might be split off from the locking module, although journal
assignment might, as an alternative possibility, be handled dynamically through
the use of locks.

"ls_first" tells the filesystem code, at mount time, that *this* is the first
machine in the cluster to mount the filesystem.  If TRUE, this machine
replays *all* journals for the whole cluster, before allowing other machines
to complete their lock protocol mounts (and therefore their filesystem mounts).
If FALSE, the filesystem replays only the journal for *this* computer.  See
discussion in Appendix G, under "Calls to ogfs_glock_num()", recovery.c,
ogfs_recover_journal().  This is also some functionality that might be split
off from the locking component, although this functionality could also be
handled through the use of locks.

"ls_lockspace" points to a private data structure contained in, and for use by,
the lock module itself.  The filesystem/glock code includes this pointer in
certain calls to the lock module, but never accesses the structure directly.
The private data structure is typically the lock module's incore "superblock"
(as called, perhaps inappropriately, by some of the code), i.e. its master data
structure.  Not to be confused with private module data structure relating to
each lock.

"ls_ops" provides the big hook through which the filesystem code (including
G-Lock layer) accesses the lock module.  Every lock module must implement
"struct lm_lockops" (see src/include/lm_interface.h) that contains the
following fields.  Most (but not all) of these are function calls implemented
by the locking module, and called by filesystem code.  See ogfs-memexp document
for details on one implementation of lm_lockops{}:

data fields:

  proto_name       - unique protocol name for this module, e.g. "nolock"
                       or "memexp".
  local_fs         - set to TRUE by nolock, FALSE by other protocols.
                       Filesystem code checks this when mounting, to enable
                       "localcaching" and "localflocks" for more efficient
                       operation in non-cluster (nolock) environment.
                       See man page for ogfs_mount.
  list             - element of protocol list maintained by lock harness.


cluster/locking/journaling functions called by lock harness or filesystem:

  mount            - initialize, start the lock module's locking functionality.
                       Called by lock harness' lm_mount() when mounting a
                       protocol onto the file system.  This call will block
                       for non-first-to-mount machines, until the first-to-
                       mount machine has replayed all journals, and has called
                       others_may_mount().  This does not apply to the nolock
                       protocol, but must work this way for any clustered
                       protocol (e.g. memexp).
  others_may_mount - indicate to other nodes that they may mount the filesystem.
                       Called by filesystem's _ogfs_read_super(), the fs mount
                       function, after this first-to-mount machine in the
                       cluster has replayed *all* journals, thus making the
                       on-disk filesystem ready to use by all nodes.  Other
                       machines will block within the mount() call, until
                       others_may_mount() is called by the first-to-mount node.
  unmount          - stop, clean up the lock module's locking functionality.
                       Called by filesystem's ogfs_unmount_lockproto() when
                       unmounting lock protocol from file system.
  reset_exp        - reset expired client node (from EXPIRED to USED / NOTUSED).
                       Called from filesystem's journal subsystem's
                       ogfs_recover_journal(), after this node replays journal
                       of expired node.


locking/LVB (Lock Value Block) functions called by G-Lock layer:

  get_lock         - allocate and initialize a new lm_lock_t (lock module
                       per-lock private data) struct on this node.  Does *not*
                       look for pre-existing structure.  Does *not* access
                       lock storage, or make lock known to other nodes.
  put_lock         - de-allocate an lm_lock_t struct on this node, release
                       usage of (perhaps de-allocate) an attached LVB
                       (memexp internally calls memexp_unhold_lvb(), its own
                       implementation of unhold_lvb, see below).
                       Accesses lock storage only if LVB action is required.
  lock             - lock an inter-node lock (alloc lock storage buff if needed)
  unlock           - unlock an inter-node lock (de-alloc storage buff if posbl)
  reset            - reset an inter-node lock (unlock if locked)
  cancel           - cancel a request on an inter-node lock (ends retry loop)
  hold_lvb         - find an existing, or allocate and initialize a new,
                       Lock Value Block (LVB)
  unhold_lvb       - release usage of (perhaps de_allocate) an LVB
  sync_lvb         - synchronize LVB (make its contents visible to other nodes)

The call prototypes can be found in "src/include/lm_interface.h".

See the ogfs-memexp document for more detail on the use and implementation of
these calls.


Lock Harness
------------

The lock harness is a fairly thin abstraction within the OpenGFS locking
hierarchy.  Simply spoken, it is a plug for different locking protocols.
To the lock modules, the harness offers services for protocol registration
and unregistration:

lm_register_proto()   -- adds module's lm_lockops->list to harness' list of
                           available modules
lm_unregister_proto() -- removes module from harness' list of available modules

These are called by the lock modules' module_init() or module_exit() functions.
Typically, these calls are the *only* things done by module_init() (called when
the kernel loads the module) or module_exit() (when unloading).  There is no
need to initialize any of the module's locking functionality until the kernel
mounts the filesystem (and the filesystem in turn selects and mounts the
particular module/protocol).


To the filesystem code, the harness offers a locking protocol mounting service:

lm_mount() -- initializes module's locking functionality (via lm_lockops mount),
              fills in an lm_lockstruct, exposing the lock module's lm_lockops
                functions and a private data pointer, and indicating journal ID
                and first-to-mount status.

The protocol is mounted onto a specific file system during the filesystem
mount.  You can use different locking protocols, and/or different "lockspaces"
for the same protocol, for different OpenGFS filesystems on the same computer.
"Lockspace" seems to be a combination of Cluster Information Device (cidev) and
protocol instance.  A computer node mounting 3 separate OpenGFS filesystems,
each with memexp protocol, would need 3 different cidevs to describe the
clusters, one for each filesystem.  This is enforced by code in the memexp
module's memexp_mount(), but is not enforced, nor used at all, by the nolock
module.  cidevs are called out by "locktable" when mounting the filesystem
(see man page for ogfs_mount), or "table_name" or "table" within harness and
memexp code.

The calling sequence when mounting is:

 _ogfs_read_super()       -- mount filesystem, src/fs/arch_*/super_linux.c
  ogfs_mount_lockproto()  -- select lock protocol, src/fs/locking.c
  lm_mount()              -- mount lock protocol, src/locking/harness/harness.c

_ogfs_read_super() mounts the filesystem, and is architecture-dependent code.

ogfs_mount_lockproto() determines which lock protocol to mount, either by mount
options set by the user (see man page for ogfs_mount), or by reading the
filesystem superblock (for values set by mkfs.ogfs, see man page for mkfs_ogfs).

lm_mount() calls a module's lm_lockops->mount() function to initialize the
module's locking functionality.  It supplies the following parameters in the
call:

table                 -- Cluster Information Device (cidev) for memexp
data                  -- "hostdata" for protocol. For memexp: node's IP address
cb                    -- Callback pointer for module to call fs' G-Lock layer.
fsdata                -- Private filesystem data (the incore superblock pointer,
                            sdp) to be attached to callbacks from module to
                            G-Lock layer.
&lockstruct->ls_jid   -- Journal ID for *this* computer, to be filled in by
                            module.
&lockstruct->ls_first -- First-to-mount status for *this* computer, to be
                            filled in by module.

"lockstruct" is actually sdp->sd_lockstruct, contained in the filesystem's
in-core superblock structure.  lm_mount() fills in two other other members
of sdp->sd_lockstruct, so that the filesystem can access the newly mounted
locking module's capabilities:

->ls_lockspace -- value returned from module's lm_lockops->mount() call,
                     module-private data (typically a pointer to the module's
                     in-core "superblock" structure).

->ls_ops       -- pointer to module's lm_lockops structure


The protocol is unmounted during the file system unmount process.  The calling
sequence is:

 ogfs_put_super()         -- src/fs/arch_*/super_linux.c
 ogfs_unmount_lockproto() -- src/fs/locking.c
 sdp->sd_lockstruct.ls_ops->unmount()  -- the lock module's unmount function

Note that the lock harness is not involved here!  Its job was done after it
filled in the module information in sdp->sd_lockstruct.  After that point,
the filesystem can reach the module directly, without using the harness.

The following use diagram gives a complete overview of using a lock module.
It covers all calls to the module from all parts of filesystem and harness code.
Some calls have no functionality for the nolock module (see discussion on
Lock Modules elsewhere in this document, or ogfs-memexp for more details
on how the memexp module works).

    +-------------------------------------+
    |   super_linux.c (fs mount/unmount)  |
    +-------------------------------------+
             |                      |
             | lock protocol        |others_may_mount
             | mount/unmount        |
             v                      |
    +---------------------------+   |   +---------+  +-----------+
    | locking.c (lock mnt/umnt) |   |   | glock.c |  | journal.c |
    +---------------------------+   |   +---------+  +-----------+
         |                |         |      |             |
         |mount           |unmount  |      |all          |reset_exp(ired
         v                |         |      |lock/LVB     |  node)
    +--------------+      |         |      |operations   |
    |  harness.c   |      |         |      |             |
    +--------------+      |         |      |             |
     ^            |       |         |      |             |
     |register    |mount  |         |      |             |
     |unregister  |       |         |      |             |
     |            v       v         v      v             v
    +--------------------------------------------------------+
    |         lock module (memexp, nolock, or stats)         |
    +--------------------------------------------------------+

    Fig. 2 -- complete register/unregister, and lm_lockops usage

Pertinent source files are:

src/fs/arch_*/super_linux.c (architecture dependent filesystem code)
src/fs/locking.c, glock.c, journal.c (architecture independent filesystem code)
src/locking/harness/harness.c (lock harness kernel module code)
src/locking/modules/*/* (lock module source code)


Lock Modules
------------

Lock modules are kernel implementations of G-Locks (see below).  Each provides
a distinct locking protocol that can be used in OpenGFS:

locking/modules/memexp -- provides inter-node locking, for cluster use
locking/modules/nolock -- provides "fake" inter-node locks, not for cluster use
locking/modules/stats -- provides statistics, stacks on top of another protocol

The "memexp" protocol supports clustered operation, and is fairly sophisticated.
The memexp modules, one on each node, work with each other to keep track of
cluster membership, and which member nodes own which locks.

The memexp protocol relies on a central repository of lock data that is shared
among all nodes, but is completely separate from filesystem and journals.
The repository can be either one or more DMEP (Device Memory Export Protocol)
devices (e.g. certain SCSI drives), usually those in an OpenGFS pool (see
ogfs-pool doc), or a "fake" DMEP server, the memexpd server.

The memexpd server runs on one computer, and emulates DMEP operation, but
stores data in memory or local disk storage, rather than shared disk storage,
and communicates with cluster members via LAN rather than SCSI.  Most users
use the memexpd server, rather than DMEP devices.  Source code is in
locking/servers/memexp directory.

For lots more information on memexp, see the ogfs-memexp document.

The "nolock" protocol supports filesystem operations on a single node, and is
much simpler than the memexp protocol.  Many of the lm_lockops functions are
stubbed out.  There is no central lock storage, but the module does store
a structure for each lock locally in a hash table.

The "stats" protocol provides statistics (e.g. number of calls to each
lm_lockops function, current and peak values of numbers of locks on inodes,
metadata, etc., and lock latency statistics) for a protocol stacked below it
(the "lower" protocol).  It looks like stats are printk()ed when the module
is *unmounted* ... I haven't found any other reporting mechanism.

To mount the stats on top of memexp, try the following options when mounting
the filesystem (see man ogfs_mount):

lockproto=lockstats  locktable=memexp:/dev/pool/cidev

(cidev is the device containing the cluster information for memexp, e.g.
/dev/pool/cidev, if you are using pool ... see HOWTO-generic, or HOWTO-nopool).

Code for parsing the lower protocol (e.g. memexp) from the locktable option,
and mounting it is in src/locking/modules/stats/ops.c, stats_mount().


3. G-Locks (global locks)
-------------------------

The G-Lock layer is an abstract locking mechanism that is used by the file
system layer.  It provides a service interface that conveniently supports:

  -- using one of several available locking protocols (lock modules)
  -- executing filesystem- and protected-entity-type-specific actions (glops)
        before or after acting on a lock.
  -- dependencies and parent/child relationships that the filesystem may wish
        to impose on locks

The G-Lock layer's interfaces include:

  -- G-Lock services, presented to the filesystem code
  -- Lock module interface, between G-Lock layer and lock module
          -- Lock and LVB commands to lock module
          -- Callback from lock module to request lock release, journal replay
  -- glops hook for installing filesystem- and type-specific actions on
        each lock

In theory, the G-Lock layer should be usable in any other software, too.  The
glops "socket" provides the opportunity to use G-Lock with other filesystems,
and define new lock types and associated actions. 

Enhancement:  Pull the G-Lock code from the file system sources and put it into
a separate module compiled as a library.


Lock instances
----------------

The G-Lock layer interfaces between the file system layer and the locking
backend.  The G-Lock layer decides whether the lock is already within the
node (perhaps owned by another process, perhaps unowned), or whether it needs
to get the lock from "outside", that is, from the inter-node locking protocol.
If going "outside", G-Lock uses the lock module (true inter-node locking for
memexp, or "fake" inter-node for the nolock protocol).

A lock lives in (at least) two instances:

1.  In the locking backend, outside of the file system.  This inter-node lock
    may get passed around between the cluster member nodes by way of a central
    lock storage facility (in the case of memexp) or perhaps other methods,
    e.g. passing between nodes directly via LAN (for OpenDLM, a distributed
    lock manager, if/when a locking module is developed/integrated for OpenDLM).

    The backend's lock implementation can vary for different protocol modules.
    There are several data types defined in src/include/lm_interface.h as
    "void", to support this variability.  This allows the G-Lock layer to
    pass these private structures around in a generic way, but not to actually
    access them:

    lm_lock_t       -- generic lock.  Identifies instance of a lock within 
                         the module.  Current implementations:
                         me_lock_t    -- for memexp
                         nolock_lock  -- for nolock
                         stats_lock   -- for stats

    lm_lockspace_t  -- generic lock "space".  Identifies instance of lock
                         module (there can be several instances of the module
                         on a given node, one instance for each OGFS filesystem
                         mounted on the node).  Typically, it is the lock
                         module's "superblock" structure.
                         Current implementations:
                         memexp_t     -- for memexp
                         nolock_space -- for nolock
                         stats_space  -- for stats

    On the opposite side of the interface, the lock module carries an ID for
    the filesystem it is mounted on.  Just as filesystem code never accesses
    lock-module-specific structures, the lock module never accesses this data:

    lm_fsdata_t     -- generic filesystem data.  Identifies instance of
                         filesystem (there can be several OGFS filesystems
                         using the same module on a given node), when module
                         does a callback to G-Lock layer.  OGFS sets
                         this to be the filesystem's incore superblock
                         structure, usually seen as "sdp" in fs code.

    The representation of a lock within a locking backend is significantly more
    primitive than the G-Lock layer's representation; the interface between
    G-Lock and locking modules exchanges only a few basic parameters for each
    lock, thus limiting the knowledge that a lock module can have about it:

    -- lockname (64-bit lock number and 32-bit lock type)
    -- lock state (unlocked/shared/deferred/exclusive)
    -- attached lock value block (LVB), if any, and PERManent status of LVB
    -- cancellation (request from G-Lock to end a lock retry loop)
    -- flags attached to lock request from G-Lock:
        -- TRY (do not block if lock request can't be immediately granted)
        -- NOEXP no expiration (allows dead node's lock to be held by this node)
    -- release state (request from backend for G-Lock layer to release lock)

    Other data relating to the lock, within the backend, is private to the
    lock module, and is used for implementing the locking management of the
    specific lock protocol.  Note that there is no awareness by the backend
    of inter-lock dependencies, parent/child relationships, process ownership,
    recursive locking, lock cacheing, glops actions, filesystem transactions, 
    all of which are handled by, and confined to, the G-Lock layer (see below).

2.  Within the filesystem module, as a struct ogfs_glock (ogfs_glock_t) in
    kernel memory of a given node.  The file system layer knows only about the
    ogfs_glock_t structure (and nothing about the representation of a lock
    within a locking module).  Within the node, the structure is protected by
    some code similar to semaphores.

    The G-Lock layer handles locks at a significantly more sophisticated
    level than does a lock module.  It includes support for inter-lock
    dependencies, parent/child relationships, process ownership, recursive
    locking, lock cacheing, glops actions, filesystem transactions, and more.

    This data type is defined independently of the locking protocol, with
    no variability in its definition.  For a detailed description of this
    data type, please refer to "G-Lock Structure" below.


G-Lock Cache, G-Lock Daemon
-----------------------------------

The G-Lock cache stores and organizes ogfs_glock_t structures on *this* computer
node.  The cache is implemented as a hash table with 3 chains:

  perm     -- glocks are currently locked by a process on this node, are
              expected to be locked for a long time, and are locked at
              inter-node scope.

              Filesystem code may request this chain by using the GL_PERM flag
              in a call to ogfs_glock() or any of its wrappers.  Used for:
              OGFS_MOUNT_LOCK, OGFS_LIVE_LOCK, journal index, flocks,
              plocks (POSIX locks), and dinodes (when reading into inode cache).

              This chain allows searches for glocks to be more efficient.
              Some searches start with glock chains ("notheld" or "held") that
              are more likely to hold the search target, and leave the "perm"
              chain until last.  A search for an unlocked glock can skip the
              "perm" chain altogether.

              Glocks in this chain move immediately to the "held" chain when
              unlocked for the last time (recursive locking) by a process.

              Glocks in this chain have the GLF_HELD and GLF_PERM flags set.


  held     -- glocks are currently or recently locked by a process on this node,
              and are locked at inter-node scope.

              Glocks typically stay in this cache chain for 5 minutes after
              being unlocked for the last time (recursive locking) by a
              process.  This node retains the lock at inter-node scope, so
              the glock is ready to be quickly locked again by a process,
              without negotiating with the lock module.  After the 5 minute
              timeout, the glockd() cleanup daemon releases the inter-node
              lock, and moves the glock to the "notheld" chain.

              The inter-node lock may be released before the 5 minute timeout,
              by request of a NEEDS or DROPLOCKS callback from another node.
              When the inter-node lock is released, the glock moves to the
              "notheld" chain.

              Locks in this chain have the GLF_HELD flag set, GLF_PERM unset.

              !!!??? (dv) Holding locks may be harmful on systems that write
              data more often than they read it.  Should this be tuneable?


  notheld  -- glocks are not locked by any process on this node, and are
              *not* locked at inter-node scope.

              if gl_count == 1, some process has some interest in the glock,
              even though it is not locked (process could be getting ready
              to lock a glock, etc.).

              if gl_count == 0, no process has an interest in the lock,
              contents of lock structure are meaningless,
              and the structure is free to be re-used for another glock
              (see new_glock() in glock.c) or be de-allocated.

              Locks in this chain have the GLF_HELD and GLF_PERM flags unset.


Glock structures are first allocated and placed into the notheld cache via the
ogfs_get_glstruct() call.  For kernel-space code, glock structures are
allocated using the following kernel call:

kmem_cache_alloc(ogfs_glock_cachep, GFP_NOFS);

G-Locks are "promoted" from notheld -> held -> perm, and "demoted" from
perm -> held -> notheld, always one step at a time (never moving directly
between notheld and perm).  ogfs_glock() handles a GL_PERM request in two
stages, first putting the lock into "held", then bumping it to "perm".
ogfs_gunlock() moves it back down to "held" when a GL_PERM glock is unlocked.

ogfs_put_glstruct() decrements gl->gl_count, the reference/usage/access count
for code accessing the structure contents.  ogfs_put_glstruct() does nothing
more than that.  The system relies on periodic garbage collection, performed
by the G-Lock kernel daemon, ogfs_glockd(), to de-allocate these structures.
_ogfs_read_super() launches it during filesystem mount, and schedules it to
run once every 5 seconds.

ogfs_glockd() is implemented in src/fs/arch_*/daemon.c, since the hooks to
the daemon scheduling mechanism are architecture-dependent.  However, the
real work is done by architecture-independent ogfs_glockd_scan(), in glock.c,
which calls:

   -- ogfs_pitch_inodes().  This is arch-independent, in glock.c.  It scans
      through the "held" and "notheld" glock cache chains, looking to destroy
      inactive inode structures that were under the protection of glocks that
      are no longer held by any process on this node.

      The glock cache scan happens every time ogfs_pitch_inodes() is called
      (typically every 5 seconds, when called from glockd daemon).  In
      addition, no more often than once every 60 seconds, ogfs_pitch_inodes()
      calls ogfs_drop_excess_inodes(), which cleans the *kernel's* directory
      cache and inode cache.  

      ogfs_drop_excess_inodes() is an arch-specific (kernel vs. user-space)
      routine, see src/fs/arch_linux_2_4.  It calls kernel functions:
      -- shrink_dcache_sb(), toss out unused(?) directories from kernel's dcache
      -- shrink_icache_sb(), toss out unused(?) inodes from the kernel's icache.
      Order is important, since a directory uses an inode.  Freeing a
      directory makes its inode unused, so it in turn can be freed.

   -- scan_held_glocks().  This gets called once for every glock cache hash
      bucket (all 512 of them).  Scans the "held" cache chain for glocks which:
      a).  Have dependencies.  Mark these with GLF_DEPCHECK for a bit later.
      b).  Are no longer locked (i.e. gl_locked == 0) by any process on this
           node.  There are two possible situations in this case:

           1) (timeout threshold reached || GLF_DEMOTEME) && !GLF_STICKY
              In this situation, the glock is free to be dropped, so we drop
              it via drop_glock() (see section "Caching G-Locks, Callbacks").
              This releases this node's inter-node lock corresponding to the
              glock, and moves the glock structure into the "notheld" chain.

              A glock normally sits in the "held" chain for a while after
              all processes on this node have unlocked it.  While held, the
              lock does not change state with regard to the locking module
              (i.e. the inter-node lock status stays the same).  This keeps
              the glock ready for use should a process in this node need it
              again.  The timeout, however, allows the lock to be automatically
              dropped (i.e. this node gives up its inter-node lock), if it
              hasn't been recently used.

              The glock structure's gl_stamp member is used to remember when
              major changes of state occur to the glock.  G-Lock code marks
              the time when it:
              -- gets a glock structure via ogfs_get_glstruct()
              -- unlocks the lock via ogfs_gunlock()
              -- moves lock from "held" to "notheld" cache chain via
                    unhold_glock()
              The unlock and ogfs_get_glstruct() situations are the ones that
              apply here (get_glstruct() can find the glock in the held chain),
              and this seems to be the only place in which gl_stamp is ever
              tested for a timeout.

              The timeout is passed as a parameter to scan_held_glocks(),
              passed in turn from ogfs_glockd_scan(), passed in turn from:

              -- ogfs_glockd(), the glock cleanup daemon, with
                 timeout = sdp->sd_gl_heldtime.  This is set by
                 _ogfs_read_super(), the filesystem mount function,
                 to be 300 seconds.  So, normally, a no-longer-used lock
                 will stay in a node's glock "held" chain for 5 minutes.

              -- ogfs_glock_cb(), the lock module callback function to the
                 glock layer, for LM_CB_DROPLOCKS, with timeout = 0.  This
                 will cause any unheld lock to be dropped from the "held" cache
                 chain, regardless of how long it has been there.

              The GLF_DEMOTEME flag is used by *this* node (other nodes use the
              DROPLOCKS or the NEED callbacks) to remove a no-longer-used glock
              from the "held" cache before it times out.  OGFS code sets it in
              only one situation.  rm_bh_internal() (see fs/arch_*/dio_arch.c)
              sets it when it has removed the last buffer from the
              arch-specific list, attached to the glock, that is used for
              invalidating buffers (when releasing a lock?).

              The GLF_STICKY flag is used to keep a glock in the glock cache,
              even after it times out after 5 minutes of non-use by this node.
              OGFS uses it for only 3 locks that get used throughout an OGFS
              session:
              -- resource index inode
              -- journal index inode
              -- transaction

           2) Not timed out, or STICKY flag is set.  In this situation, we
              cannot drop the lock, but we check here for the GLF_DEPCHECK
              flag that was set earlier in the function.  If set, we sync to
              disk all data protected by the locks that are dependent on this
              lock, via sync_dependencies().

              Note that for the DROPLOCKS callback, the timeout is 0, so 
              a non-STICKY glock will always be dropped rather than having
              dependencies sync'd.
              
After calling ogfs_pitch_inodes() and scan_held_glocks(), all excess inter-node
locks have been released to the cluster, and all corresponding glocks have been
moved to the "notheld" glock cache chain.  ogfs_glockd_scan() then:

   -- scans through "notheld" chain, looking for any glock with no access
      interest from any process (i.e. gl_count == 0).  It removes any such
      structures from the glock cache altogether, and calls release_glstruct()
      which calls:
        -- ls_ops->put_lock() to tell lock module to reclaim its private
           data structure attached to the glock.
        -- ogfs_free_glock() to de-allocate the glock structure via kernel's
           kmem_cache_free() call


The G-Lock cache needs cleaning up in a couple of other situations, as well,
but these are handled outside of the ogfs_glockd daemon, per se:

1. LM_CB_DROPLOCKS callback (see ogfs_glock_cb()) from lock module, when the
   lock module's lock storage becomes rather (but before it becomes completely)
   full.  See ogfs-memexp doc.  The callback asks the glock layer to free any
   glocks that are in the glock cache, but not actually being locked by any
   process.  The callback routine calls the following:

   -- ogfs_drop_excess_inodes(), to clean out kernel's directory cache and
      inode cache (as described above).

      Note that ogfs_drop_excess_inodes() is conditionally called by
      ogfs_pitch_inodes() (see directly below).  This is no more often than
      every 60 seconds, however, hard coded in ogfs_pitch_inodes().  Calling
      ogfs_drop_excess_inodes() explicitly from the DROPLOCKS callback
      immediately cleans up as many inodes as possible, without regard to
      how recently it has been done before.

   -- ogfs_pitch_inodes(), to clean out inactive inodes attached to glocks
      in the "held" and "notheld" cache chains (as described above).

      Enhancement:  We might be able to leave this call out of the callback
      code, since ogfs_glockd_scan() (see directly below) also calls it.

   -- ogfs_glockd_scan().  Also arch-independent, in glock.c.  This is
      the same function called by the glock daemon to clean up no-longer-held
      glocks from the glock cache (see above).

2. Filesystem unmount (called from super_linux.c, just prior to lock module
   unmount), using:

    -- ogfs_clear_gla()


Caching G-Locks, Callbacks
--------------------------

One performance-critical feature of the G-Lock layer is holding (or "caching")
locks.  After a node has acquired an inter-node lock, and a process has taken
ownership of the glock, does the needed job, then releases the glock internally,
the inter-node lock is not released immediately to the locking backend (unless
there is a request for the lock from another node).  Instead, the G-Lock layer
retains the glock in the G-Lock cache for a while (see "G-Lock Cache, G-Lock
Daemon").

In case some *other* node needs an incompatible lock (e.g. needs a shared lock,
when this node holds an exclusive lock, or needs an exclusive lock, when this
node holds a lock of any sort), the other node's locking backend calls *this*
node, via this node's backend, and thence via the ogfs_glock_cb() function in
glock.c, to ask *this* node to yield the lock.

  LM_CB_NEED_E    - need exclusive lock
  LM_CB_NEED_D    - need deferred lock
  LM_CB_NEED_S    - need shared lock

Note that there is no way for a node to request of the filesystem on another
node to suspend or "hurry up" a current operation (e.g. a write transaction or
a read operation) and give up a lock.  The operation continues until the
filesystem completes it and unlocks the lock at the normal time.

The "NEED" messages are simply advisory as to how the other (requesting) node
will use the lock, once the filesystem code on this node is done with it.
The NEED callbacks do the following (see ogfs_glock_cb() in src/fs/glock.c):

1).  Search "held" and "perm" glock cache chains for requested lock.
     -- If not found, this node doesn't hold the lock any more, simply
        return from call.  

        Caveat:  If, somehow, this node thinks it
        doesn't hold the lock, but lock storage *does* show this node as a
        holder, there is an infinite loop created as the other node keeps
        requesting that this node release the lock.  This shouldn't happen,
        of course, but it actually does seem to happen occasionally.

        Enhancement:  Allow this node to repeat the lock release attempt,
        to eliminate the infinite loop.

     -- If found, continue to next step.

2).  Mark glock's gl_flags field with GLF_RELEXCL, GLF_RELDFRD, or GLF_RELSHD,
     as appropriate, based on the request from the other node.  Note that
     the flag gets set regardless of whether any process has exclusive access
     to the glock structure (via the GLF_LOCK bit in gl_flags).  GLF_LOCK has
     nothing to do with lock state, but just means that some process is in
     the middle of manipulating the glock structure's contents at the moment.

3).  Tries to get exclusive access to the structure via GLF_LOCK.
     -- If cannot, simply return from call (after decrementing gl_count field,
        which was incremented when the glock was found in the glock cache).
        We cannot try to release the lock while a process manipulates it.
     -- If can, continue to next step.

4).  Checks gl_locked to see if any process on this node has the lock locked.
     -- If so, we cannot release the lock to another node.  We must wait until
        all processes on this node have unlocked the lock.  Simply return
        from call (after releasing exclusive access via GLF_LOCK, and
        decrementing gl_count field).
     -- If not, continue to next step.

5).  If no process (on this node) has the lock locked, we can immediately
     proceed to make the lock available to the other node, by releasing our
     node's hold on the lock:

     a).  Check *inter-node* lock state (as we see it in our local glock struct)
          to make sure it is indeed locked.
          -- if *not* locked, we figure that the other node should already
             have what it needs ... simply return from call (after releasing
             the exclusive access to the structure, and decrementing gl_count).

             Caveat:  If, somehow, this node thinks the lock is unlocked,
             but lock storage thinks it *is* locked, and shows this node as a
             holder, there is an infinite loop created as the other node keeps
             requesting that this node release the lock.  This shouldn't happen,
             of course, but it actually does seem to happen occasionally.

             Enhancement:  Allow this node to repeat the lock release attempt,
             to eliminate the infinite loop.

          -- If locked (as expected), proceed to next step.

6).  Release this node's hold on the lock at inter-node scope.  This is done
     in one of two ways, depending on the NEED request:

     -- EXCLUSIVE calls drop_glock() function, to unlock this node's lock at
        inter-node scope, and remove this lock from this node's glock cache.
        drop_glock() does the following, some via cleanup_glock() and
        sync_dependencies(), and their calls to glops:

        a).  sync data associated with glock's dependent glocks, via
             gl_ops->go_sync(), to disk.
        b).  drop_glock() any of this glock's children glocks.  This includes
             syncing any of their associated data to disk, and that of their
             dependencies and children, etc.
        c).  sync glock's data to disk via gl_ops->go_release(), which also
             writes LVB info if glock is on a resource group.
        d).  call lock module, via ls_ops->unlock(), to unlock this node's hold
             on the inter-node lock.
        e).  move glock to "unheld" glock cache chain in this node.

        --  SHARED or DEFERRED calls xmote_glock() function, to change inter-
            node lock state, and update the state in this node's glock cache.
            xmote_glock() does the following, some via cleanup_glock() and
            sync_dependencies(), and their calls to glops:

        a).  Check if inter-node lock state (as we see it in our local glock
             structure) is already in the requested state.
             -- if so, we're done.  Simply return from the call.  (Caveat:
                does this have the same problems mentioned above about glock
                cache and lock storage disagreeing on state??).
             -- if not, proceed to next step.
        b).  call cleanup_glock() to sync dependents' and children's data to
             disk.  (same as steps a, b, c above).
        c).  call lock module, via ls_ops->lock() to change inter-node lock
             state to requested one.
        d).  update cached glock to reflect new status returned from lock
             module (including setting GLF_RELEXCL/DFRD/SHRD if locking module
             knows of another queued request(?)).
        e).  call gl_ops->acquire() to load fresh LVB data from locking module
             if needed.

If the callback function could not immediately satisfy the request of the other
node, the GLF_RELEXCL/DFRD/SHRD bits store the fact that another node wants the
lock.  When the filesystem unlocks a lock, the ogfs_gunlock() function checks
the following in the glock structure:

-- gl_count to see if any other processes have a hold on the lock.  If not, we
   can release the lock to another requesting node, if there is one.

-- gl_flags field for the GLF_RELEXCL/DFRD/SHRD bits.  If so, it calls either
   drop_glock() (for exlusive) or xmote_glock() (for deferred or shared).
   These are the same functions called by the callback, described above.


G-Lock Structure
----------------

The following paragraphs describe each member of struct ogfs_glock.  One
such structure exists for each G-lock.


1.  G-Lock cache hash table

  struct list_head gl_list   -- Hash table hook
  unsigned int     gl_bucket -- Hash bucket that we inhabit

  See "G-Lock Cache", above.


2.  Lock name

  lm_lockname_t    gl_name   -- Unique "name" (but not a string!) for lock

  The lockname structure has two components:

    uint64         ln_number -- lock number
    unsigned int   ln_type   -- type of protected entity

  For most locks, the lock number is the block number (within the filesystem's
  64-bit linear block space, which can span many storage devices) of the
  protected entity, left shifted to be equivalent to a 512-byte sector.
  Details are in src/fs/glock.c, ogfs_blk2lockname().

  As an example, if we wanted to protect an inode at block 0x100, and we
  are using 4-kByte blocks, the lock number would be 0x0800 (0x100 << 3).

  I believe the block-to-sector conversion is for support of hardware-based
  DMEP protocols, which address the DMEP storage space in terms of 512-byte
  sectors.  This could turn out to be problematic in *very large* 64-bit
  filesystems, if they want to use the upper 3 bits of the 64-bit block
  number.

  There is a special lock for the disk-based superblock, defined in
  src/fs/ogfs_ondisk.h.  Note that this lock is not based on the block
  number (the superblock is *not* stored in block 0):

    OGFS_SB_LOCK      (0) -- protects superblock read accesses from fs upgrades

  In addition to the block-based number assignments, OpenGFS uses some
  special, non-disk lock numbers.  They are defined in src/fs/ogfs_ondisk.h
  (even though they don't show up on disk!):

    OGFS_MOUNT_LOCK   (0) -- allows only one node to mount at a time.
                               Note: same lock number as OGFS_SB_LOCK,
                               but different type, so different lock!
    OGFS_LIVE_LOCK    (1) -- protects??
    OGFS_TRANS_LOCK   (2) -- protects journal recovery from journal transactions
    OGFS_RENAME_LOCK  (3) -- protects file/directory renaming/moving operations

  See "Special Locks" below for more details.

  The lock type is determined by the glops attached to the ogfs_glock()
  call to request the lock.  See "glops", elsewhere in this document.  Lock
  types are defined in src/include/lm_interface.h:

    LM_TYPE_RESERVED     (0x00) -- not used by OpenGFS
    LM_TYPE_CIDBUF       (0x01) -- cluster information device, used by memexp
    LM_TYPE_MOUNT        (0x02) -- mount, used by memexp
    LM_TYPE_NONDISK      (0x03) -- special locks
    LM_TYPE_INODE        (0x04) -- inodes
    LM_TYPE_RGRP         (0x05) -- resource groups
    LM_TYPE_META         (0x06) -- metadata
    LM_TYPE_NOPEN        (0x07) -- n-open
    LM_TYPE_FLOCK        (0x08) -- Linux flock
    LM_TYPE_PLOCK        (0x09) -- POSIX file lock
    LM_TYPE_PLOCK_HEAD   (0x0A) -- POSIX file lock head
    LM_TYPE_LVB_MASK     (0x80) -- Lock Value Block, ORd with other type number

  Note that there is no lock type for individual data blocks.  The glock
  layer inserts individual data blocks into a list of protected blocks
  associated with each glock.  For example, a locked inode may have many
  data blocks attached to its glock.

  Since the lock name is dependent on *both* the lock number and the type,
  ogfs can request more than one unique lock (each of a different type) on
  the same filesystem block or static lock number.

  As an example, ogfs_createi() (create a new inode), locks two locks on
  the same lock number (current OpenGFS implementation sets
  inum.no_formal_ino = inum.no_addr), but different lock types/glops:

  -- (inum.no_formal_ino), in exclusive (0) mode, using ogfs_inode_glops
  -- (inum.no_addr), in shared (GL_SHARED) mode, using ogfs_nopen_glops

  Even though one of them is exclusive, they will both succeed, since they
  are, indeed, different locks.


3.  Reference count

  atomic_t         gl_count  - Reference/usage count of ogfs_glock structure

  This represents a depth of reference/usage/access for code reading or
  writing the structure contents.  It does *not* represent anything regarding
  lock state, recursive locking, or exclusive access to a glock structure.

  ogfs_get_glstruct()  increments gl_count if structure found in glock cache,
                       or sets gl_count = 1 if new alloc (and does lots more!)
  ogfs_put_glstruct()  decrements gl_count() (and does nothing more!)
  ogfs_hold_glstruct() increments gl_count() (and does nothing more)

  gl_count > 0 keeps the glockd daemon from removing a glock from the
  "notheld" glock cache chain and de-allocating its structure.


4.  Flags

  unsigned long    gl_flags  - Flags

  These appear to be mostly (except for LOCK, SYNC, DIRTY, POISONED,
  RECOVERY(?)) for glock cache maintenance.

  GLF_HELD     - lock is held by a process (in "held" or "perm" glock cache)
                 set/reset only within glock.c by:
                 hold_glock(), unhold_glock()
  GLF_PERM     - lock is expected to be held for a long time (in "perm" cache)
                 set/reset only within glock.c by:
                 perm_glock(), unperm_glock()
  GLF_LOCK     - mutex for exclusive access to all glock structure fields.
                 set/reset only within glock.c by:
                 try_lock_on_glock(), lock_on_glock(), unlock_on_glock()
  GLF_SYNC     - sync data and metadata to disk when process releases lock
  GLF_DIRTY    - the incore data/metadata !!!??? has changed
  GLF_POISONED - transaction failed
  GLF_RELEXCL  - another computer node needs this lock in exclusive mode.
                 don't cache it (just drop it) when process releases it.
  GLF_RELDFRD  - another computer node needs this lock in deferred mode.
                 keep it cached in this node's "held" chain when process
                 releases it, in case this node needs it again.
  GLF_RELSHRD  - another computer node needs this lock in shared mode.
                 keep it cached in this node's "held" chain when process
                 releases it, in case this node needs it again.
  GLF_STICKY   - don't demote this glock.  Used only in glocks for riinode,
                 jiinode, and transaction.
  GLF_DEMOTEME - demote this glock.  Used by arch-specific code to indicate
                 that there are no more buffers covered by this glock.
  GLF_DEPCHECK - indicates that lock has dependencies.  Used only within
                 scan_held_glocks().
  GLF_RECOVERY - Set by ogfs_glock() when lock request has LM_FLAG_NOEXP.
                 Normally, ogfs_glock() resets this before returning.
                 In some error cases, though, it does not.


5.  Lock Structure Ownership (Locking by a process)

  long           gl_pid - Process ID of process, if any, that owns the struct,
                          NULL if no owner, -1 if GL_DISOWN
  atomic_t       gl_locked - recursive count of process ownership
  spinlock_t     gl_head_lock - spinlock that covers above 2 fields (only)

  An audit shows that gl_pid is always covered by spinlock gl_head_lock.
  gl_locked is sometimes covered by GLF_LOCK (which covers *entire* struct)
  instead of gl_head_lock.

  Once a node has acquired a lock, it must prevent corruption of its protected
  resource (inode, block, etc.) by multiple processes on the node (which can
  have more than one CPU).  This protection is achieved through the concept of
  ownership of the ogfs_glock_t structure.

  The requesting process can ask that it is not recorded as the owner of the
  structure with the GL_DISOWN flag.  This effectively prevents the same process
  from further requesting recursive ownership of the structure, and allows other
  processes to unlock the lock (is this sharing, or not? !!!investigate how
  this is really used).

  Caveat:  There is no concept of a shared (read only) ownership of the
  structure within a node.  Thus, all read operations on the protected resource
  are serialised within the node.  !!!Investigate how much of a performance
  penalty this is.

  Caveat:  Because of a race condition between the request_glock_ownership() and
  request_glock_wait_or_abort() functions, requests for ownership can be
  processed out of order, i.e. a process that requests ownership later than
  another process may be granted ownership first.  !!!Investigate if this can
  cause deadlocks.  !!!Investigate if a simple semaphore could be used instead.

  Caveat:  A deadlock occurs if a process requests ownership with GL_DISOWN and
  later requests the same ownership again.  !!!Investigate if this can happen.


6.  Waiting for process' exclusive access to structure

  wait_queue_head_t gl_wait_lock - Wait queue for exclusive access to glock
    fields (see GLF_LOCK)

  gl_wait_lock is a wait queue used for inter-process (not inter-node)
  coordination.  This is used with the GLF_LOCK bit in gl_flags, which provides
  exclusive access to the fields of the glock structure, but does *not*
  indicate anything relating to lock state!


7.  Waiting for process' ownership of lock

  wait_queue_head_t gl_wait_unlock - Wait queue for glock to be unlocked
         by another process.

  This is internal to this node, and does not relate to the inter-node lock
  state, which must be locked if a process owns it, and will continue to be
  locked as the new process takes ownership.

  This has nothing to do with gl_wait_lock!  This is the wait queue for a
  process to wait until another process is done with the lock.


8.  Lock operations

  ogfs_glock_operations_t *gl_ops - Operations which get called at certain
         events over the lifetime of a glock (e.g. just after locking
         a lock, or just before unlocking one).

  See separate section on glops.


9.  Inter-node Lock State

  unsigned int      gl_state - The inter-node state of the lock

On each node, a lock can be in one of the following states:

LM_ST_UNLOCKED -- the node has not acquired the lock.

LM_ST_EXCLUSIVE -- the node has acquired the lock and no other node
may own or acquire the lock before it is released (write lock).

LM_ST_SHARED -- the node has acquired the lock and other nodes may own or
acquire it while this node owns it (read lock).

LM_ST_DEFERRED -- another shared mode, but cannot be shared with LM_ST_SHARED.
Note:  It is unclear to me if and how this mode is used.  If it is used,
the memexpd server seems to be the one to request that mode on its own,
without being told to do so by a node.

Currently, lock modes refer only to inter-node (not inter-process nor SMP)
locking.  Therefore, a node may own the lock and hold it in exclusive, shared,
or deferred state, even though no process on the node currently has the glock
locked.  The glock will be on the "held" glock cache chain in this situation.


10. Lock Module Private data

  lm_lock_t       *gl_lock - Per-lock private data for the lock module

  The private data is never accessed by glock or filesystem layer code,
  but these layers may pass around the pointer for use by the lock module.
  This pointer is included in almost every call from the G-Lock layer to
  the lock module (usually seen as gl->gl_lock).  Not to be confused with
  module private data pointer saved as sdp->sd_lockstruct.ls_lockspace,
  which is not per-lock data, but rather module instance data.


11. Lock Relationships

  struct ogfs_glock  *gl_parent - This lock's parent lock (NULL if no parent)
  struct list_head    gl_parlist - Parent's list of children
  struct list_head    gl_children - List of children of this lock

  Locks may be attached to a parent when allocating a lock with
  ogfs_get_glstruct().  This fills the gl_parent member of this lock's
  glstruct, and adds this lock to the parent's gl_children list.

  Locks with identical gl_name values (i.e. identical lock number and type),
  but attached to different parents, are considered unique and separate locks.
  See find_glstruct().

  I (bc) haven't been able to find where this is used by OpenGFS.


12. Lock Value Blocks

  unsigned int        gl_lvb_count - Number of LVB references held on this glock
  lm_lvb_t           *gl_lvb - LVB descriptor (which points to data)

The inter-node lock has a data area that can be used to store global data and
communicate that data to other nodes that acquire the lock.  This "lock
value block" (LVB) currently has a size of 32 bytes.  The G-Lock layer provides
a function interface to attach and detach data to/from a lock's LVB.

LVBs are used with inter-node locks on resource groups, to pass resource usage
statistics from node to node, when exchanging locks (see "Locking Resource
Groups").  LVBs are also used for plocks (POSIX locks).


13. Version number

  uint64             gl_vn - Version number (incremented when cache is not valid any more)


14. Timestamp

  osi_clock_ticks_t  gl_stamp - Time of create or last unlock

  The glock structure's gl_stamp member is used to remember when
  major changes of state occur to the glock.  G-Lock code marks
  the time when it:
  -- gets a glock structure via ogfs_get_glstruct()
  -- unlocks the lock via ogfs_gunlock()
  -- moves lock from "held" to "notheld" cache chain via
        unhold_glock()


15. Protected Object

  void *gl_object - The object the glock is protecting


16. Transaction being built

/*  Modified under the glock  (i.e. gl_locked > 0)  */

  struct list_head gl_new_list - List of glocks in transaction being built
  struct list_head gl_new_bufs - List of buffers for this lock in transaction
                                 being built
  ogfs_trans_t     gl_trans_t  - The transaction being built


17. In-core Transaction

/*  Modified under the log lock  */

  struct list_head gl_incore_list - List of glocks in incore transaction
  struct list_head gl_incore_bufs - List of buffers for this lock in
                                    incore transaction
  ogfs_trans_t     gl_incode_tr   - The incore transaction


18. Dependent G-Locks

  atomic_t         gl_num_dep - The number of glocks that need to be synced
                                 before this one can be released
  struct list_head gl_depend  - The list of glocks that need to be synced
                                 before this one can be released

  OGFS uses this to make sure that all inodes (with all associated data pages
  and buffers) in a resource group are flushed to disk before the resource group
  can be released.  These fields are set by ogfs_add_gl_dependency(), which is
  called only from blklist.c functions:

  ogfs_blkfree()  -- free a piece of data
  ogfs_metafree() -- free a piece of metadata
  ogfs_difree()   -- free a dinode


19. Architecture-specific (i.e. kernel 2.4 vs. user-space) data

  ogfs_glock_arch_t gl_arch - Pointer to struct ogfs_glock_arch

Kernel-space (src/fs/arch_linux_2_4) uses this for a list of filesystem buffers
associated with the glock, for the purpose of interacting with the kernel
buffer cache.  The list contains entries of type ogfs_bufdata_t, which is a
private data structure that filesystem code attaches to Linux kernel buffer
heads.

struct ogfs_glock_arch {
	struct list_head gl_bufs;	/* Buffer list for caching */
};
typedef struct ogfs_glock_arch ogfs_glock_arch_t;

User-space (src/fs/arch_user) defines ogfs_glock_arch as empty.



Expiring Locks and the Recovery Daemon
--------------------------------------

The lock module is responsible for detecting dead (expired) nodes.  The memexp
protocol does this with a heartbeat counter for each client node (see
ogfs-memexp for more info).  Note that there is no timeout on individual locks,
and no time restriction for how quickly a filesystem operation must complete.

Once a node is detected as "expired", each of the locks that it held in
shared (read) mode is freed, and each of the locks that it held in exclusive
(write) mode is marked as "expired".  This is done by another node (the
"cleaner" node, assigned by the lock module).  After freeing/marking,
recovery of the dead node's journal may be performed.

The ogfs_glock_cb() function provides an interface for the locking module
to inform the G-Lock layer, via the LM_CB_EXPIRED callback command, to replay
a dead node's journal.  When the callback occurs, ogfs_glock_cb() sets a bit
in sdp->sd_dirty_j, a bitmap that indicates which journal needs recovery, and
then wakes up the process of the journal recovery daemon, ogfs_recoverd().

The recovery daemon normally runs every 60 seconds, and normally finds, when
checking sdp->sd_dirty_j, that no journals need to be replayed.  The callback
is the only place where code sets a bit in sdp->sd_dirty_j, thus the callback
is the only method for triggering journal recovery of an expired node (is
there a need for the periodic daemon, then?).

There is no need for more than one node to replay the dead node's journal.
The assignment to replay the journal (that is, the recipient(s) of the
LM_CB_EXPIRED callback) depends on the implementation of the locking backend.

When replaying a dead node's journal, the dead node's "expired" (i.e. exclusive
lock held by the dead node) journal lock is needed by the "cleaner" node to
write journal replay results to the filesystem.  The special flag LM_FLAG_NOEXP,
contained in a call to the backend's lock function, allows the backend to grant
the lock, even though the lock is "expired".  See comments in src/fs/glock.h.

LM_FLAG_NOEXP is also used during filesystem mount to obtain some special
locks that are absolutely needed at mount time, and which may be expired
due to the death of this or another node.

LM_FLAG_NOEXP is used to obtain the following locks:

during filesystem mount:
  OGFS_MOUNT_LOCK  -- exclusive lock owned only when mounting filesystem
                         (recovers lock from any node that died while mounting)
  OGFS_LIVE_LOCK   -- shared lock owned through lifetime of filesystem on this
                         node (recovers lock from any node that died while
                         mounting).
                         (since this lock is always shared, never exclusive,
                          would it ever be put in "expired" state?  might
                          depend on implementation of locking module?)
  journal lock     -- exclusive lock for *this* machine's journal, owned
                         through lifetime of filesystem (NOEXP needed only
                         if this node died and fs is being remounted).

during journal recovery for any node's journal, *this* or other:
  transaction lock -- exclusive lock when doing journal recovery,
                         keeps all other machines from writing to filesystem.
  journal lock     -- exclusive lock when doing journal recovery,
                         allows node to use and modify the journal.

Note that journal recovery is performed without regard to locks on any of the
recovered items.  ogfs_recover_journal() grabs only the journal and transaction
locks mentioned above, then calls replay_metadata(), which writes to the
filesystem without grabbing locks on anything it is writing.  This is why
it is important to stop all writes across the filesystem before doing a journal
replay.



G-Lock Interfaces
-----------------------

The G-Lock layer defines a set of operations which an underlying locking
protocol must implement.  These were described in section 2, "Lock Harness
and Lock Modules".

The G-Lock layer also offers a set of services that can be used by the file
system, independent of the underlying architecture and mounted locking
protocol:

Basic lock functions:
  ogfs_get_glstruct  - locate a pre-existing glock struct in G-Lock cache,
                         *or* allocate a new one from kernel, init it,
                         link with parent glock (if parent is in call),
                         call lock module to allocate a per-lock private data
                         structure, attach private data to glock, and place
                         into "notheld" chain in glock cache.  Note that this
                         does not make this lock visible to other nodes, nor
                         does it fill in any current lock status.
  ogfs_put_glstruct  - decrement process ownership count (gl_count)
                         Note that this does *not* de-allocate the structure,
                         even if count decrements to 0.  This is *not* the
                         opposite of ogfs_get_glstruct.  De-alloc relies on
                         ogfs_glockd() daemon, which runs once every 5 seconds,
                         or LM_CB_DROPLOCKS callback from lock module,
                         to perform garbage collection.
  ogfs_hold_glstruct - increment process ownership count (gl_count)
  ogfs_glock         - lock a lock
  ogfs_gunlock       - unlock a lock

"Wrappers" for basic lock functions.  All except ogfs_glock_num() require
that glock structure has already been allocated via ogfs_get_glstruct():
  ogfs_glock_i       - lock an inode
  ogfs_gunlock_i     - unlock an inode
  ogfs_glock_rg      - lock a resource group
  ogfs_gunlock_rg    - unlock a resource group

  ogfs_glock_num     - lock a lock, given its number
  ogfs_gunlock_num   - unlock a lock, given its number
  ogfs_glock_m       - lock multiple locks, given a list
  ogfs_gunlock_m     - unlock multiple locks, given a list

LVB functions:
  ogfs_hold_lvb      - attach lock value block (LVB) to a glock
  ogfs_unhold_lvb    - detach lock value block (LVB) from a glock
  ogfs_sync_lvb      - sync a LVB (to lock storage, visible to other nodes)

Lock Dependency functions:
  ogfs_add_gl_dependency - make release ordering dep. between two glocks
  sync_dependencies      - sync out dependent locks (to lock storage? fs?)

Callback:
  ogfs_glock_cb          - callback used by lock modules

For prototypes and flags see "src/fs/glock.h".


glops
-----

Each G-Lock has a vector of functions ("operations") attached to it via gl_ops.
These functions handle all the interesting behavior of the filesystem and
journal that must occur just after getting a lock or or just before letting
one go, such as:

Just after getting a lock:
-- reading items from disk
-- reading LVB contents (rgrp usage statistics), sent from old lock owner

Just before giving up a lock:
-- flushing items to disk, so another computer can read them
-- invalidating local buffers, so we don't try to read them
       while another computer is modifying their contents
-- filling LVB contents (rgrp usage statistics), for new lock owner to use

The operations are architecture-dependent (arch_user vs. arch_linux_2_4),
and are type-dependent on the protected resource (inode, resource group, etc.).

The operations are called only from within glock.c, and are all (except for one)
implemented in arch_*/glops.c.

The operations are defined in "struct ogfs_glock_operations".  A
short description is given here, see src/fs/incore.h for details:

operations called by G-Lock layer:
  go_sync    - synchronise/flush dirty data (protected by a lock) to disk
               meta, rgrp:   sync glock's incore committed transaction logs
                             sync all glock's protected dirty data bufs to disk
               inode:        same, plus sync protected dirty *pages* to disk
               other types:  no action

  go_acquire - create a lock
               rgrp:         copy rgrp usage data from LVB (loaded from lock
                             storage, contains latest data from any node)
               other types:  no action

  go_release - release a glock to another node that needs it in exclusive mode.
               meta, inode:  sync glock's incore committed transactions
                             sync all glock's protected dirty data to disk
                             invalidate all glock's buffers/pages
               rgrp:         same, plus copy rgrp usage data to LVB (to
                                store and make visible to other nodes)
               other types:  no action

  go_lock    - if this is process' first (recursive) lock on this glock:
               inode:        read fresh copy of inode from disk
               rgrp:         read fresh copy of rgrp bitmap from disk
               other types:  no action

  go_unlock  - if this process is finished with this glock:
               inode:        copy OGFS dinode attributes to kernel's VFS inode
                                (so kernel can pass it to other process, and/or
                                write to disk)
               rgrp:         brelse() rgrp bitmap (so kernel can pass it to
                                another process, and/or write it to disk)
                             copy usage stats to LVB structure (so glock layer
                                can pass to other process or node)
               other types:  no action

  go_free    - free all buffers associated with a glock.
               used only when unmounting the lock protocol.
               meta, inode, rgrp:  free all buffers
               other types:  no action

data fields:
  go_type    - type of entity protected (e.g. inode, resource group, etc.)
  go_name    - human-readable string name for particular ogfs_glock_operations
               definition

Different implementions exist for different types of entities to be protected,
e.g. inode, resource group, etc.  Many of these types require only a few, or
none, of these operations, in which case the respective fields contain NULLs.
The go_type and go_name fields, however, are defined for each and every
ogfs_glock_operations implementation.

When requesting a lock structure, filesystem code selects the
ogfs_lock_operations implementation to be attached to the lock, via the
parameter "glops" in the call to ogfs_get_glstruct().

See Appendix H for an inventory and analysis of glops calls.


4. Using G-Locks in OpenGFS
---------------------------

The file system code uses G-Locks to protect the various structures on the disk
from concurrent access by multiple nodes.  The protected ondisk objects can be
dinodes (including the resource group and journal index dinodes), resource
group headers, the superblock, buffer heads (blocks), and journals.  In
addition, glocks are used for non-disk locks for cluster coordination.

The following information describes the locking strategies for various types
of locks and protected entities.

Special Non-Disk Locks
----------------------

  In addition to the block-based number assignments, OpenGFS uses some
  special, non-disk lock numbers.  They are defined in src/fs/ogfs_ondisk.h
  (even though they don't show up on disk), and are all of LM_TYPE_NONDISK:

    OGFS_SB_LOCK      (0) -- protects superblock read accesses from fs upgrades
                             that would re-write the superblock.

    OGFS_MOUNT_LOCK   (0) -- allows only one node to be mounting the filesystem
                             at any time.  Locked in exclusive mode, with
                             nondisk glops, when mounting.  Unlocked when mount
                             is complete, allowing another node to go ahead
                             and mount.

    OGFS_LIVE_LOCK    (1) -- protects??  Locked in shared mode, with nondisk
                             glops, when mounting.  Unlocked when unmounting.
                             Indicates that at least one node has the
                             filesystem mounted.

    OGFS_TRANS_LOCK   (2) -- protects journal recovery operations from new
                             transactions.  Used in shared mode by transactions,
                             so many transactions may be created simultaneously.

                             Used in exclusive mode by ogfs_recover_journal(),
                             to force other nodes and processes to finish
                             current transactions before journal recovery
                             begins, and keep them from starting new
                             transactions until the recovery is complete.
                             This allows the recovery process to have exclusive
                             write access to the entire filesystem.  Note that
                             the recovery process does *not* bother to grab
                             locks for protected entities (inodes, etc.) that
                             it writes.

                             Always uses trans glops, and is the only lock
                             to do so.

                             The glock structure for this lock is allocated
                             during the filesystem mount, and stays attached
                             to the incore superblock structure as
                             sdp->sd_trans_gl.

    OGFS_RENAME_LOCK  (3) -- protects file/directory renaming/moving operations
                             from clobbering one another.  Always used in
                             exclusive mode.

                             The glock structure for this lock is allocated
                             during the filesystem mount, and stays attached
                             to the incore superblock structure as
                             sdp->sd_rename_gl.

Unique On-Disk Locks
--------------------

  Resource Group Index -- Protects filesystem reads of the resource group index
                          from filesystem expansion (or shrinking) operations.
                          Filesystem expansion cannot proceed until all nodes
                          unlock this lock, therefore all locks must be
                          temporary.

                          All filesystem accesses to the rindex, during the
                          normal course of filesystem operations, are read
                          accesses, protected in shared mode.  The lock is
                          on the the resource index's dinode (LM_TYPE_INODE),
                          identified by its filesystem block number.  rindex
                          locking and unlocking is done by:

                          ogfs_get_riinode() -- initial read-in of rindex'
                               dinode (but not the rindex file itself),
                               during the filesystem mount sequence
                               (see _ogfs_read_super() in
                               src/fs/arch_linux_2_4/super_linux.c).  Attaches
                               OGFS dinode structure to superblock structure as
                               sdp->sd_riinode, and the inode's glock structure
                               is attached to that structure.

                               Unlocks lock before leaving function, but sets
                               GLF_STICKY bit so it will stay in glock cache.
                               This and ogfs_get_jiinode() are the only two
                               functions that set the GLF_STICKY bit.

                          ogfs_rindex_hold() -- makes sure we have latest
                               rindex file contents in-core.  Does *not*
                               unlock the lock unless error.  Called from
                               many places in code that need to access
                               resource groups.  THIS FUNCTION DETECTS
                               FILESYSTEM EXPANSION (or shrinkage).  See below.
                                
                          ogfs_rindex_release() -- unlocks lock asserted by
                               ogfs_rindex_hold().  An ogfs_rindex_hold() /
                               ogfs_rindex_release() pair (often/always?)
                               surrounds a transaction.

                          If a user invokes the user-space
                          ogfs_expand utility (see man page for ogfs_expand,
                          and source in src/tools/ogfs_expand/main.c),
                          it writes new resource groups headers out to the
                          new space on disk.  These are done outside of the
                          space that the filesystem knows about (yet), are
                          written using lseek() and write() calls to the
                          raw filesystem device, and require no locks.

                          Once done with resource groups, it writes a new
                          rindex, appending the descriptions of the new
                          resource groups to the current rindex file.
                          This is, of course, written to the filesystem proper
                          (i.e. not to the raw device, but rather to a file),
                          using an ioctl OGFS_JWRITE.  This ioctl (see
                          ogfs_jwrite_ioctl() in src/fs/ioctl.c) grabs an
                          exclusive lock on the rindex inode (using its
                          block # as the lock #), and also creates a journal
                          transaction around the write.  The exclusive lock
                          keeps the rindex write from proceeding until all
                          nodes have completed accessing resource groups.

                          Finally, the ioctl increments the version number of
                          the inode's glock, gl->gl_vn++.  This is what tells
                          ogfs_rindex_hold that the rindex has changed.  If
                          so, ogfs_rindex_hold reads the new rindex from disk.
                          

  Journal Index --        Protects filesystem reads of journal index from
                          journal addition (or removal) operations. Journal
                          addition/removal cannot proceed until all nodes
                          unlock this lock, therefore all locks must be
                          temporary.

                          Journal protection works just the same way as
                          resource index protection, with some name changes:

                          ogfs_get_riinode() -- initial read-in of rindex inode
                          ogfs_jindex_hold() -- lock jindex, get latest data,
                               THIS FUNCTION DETECTS JOURNAL ADDITION/REMOVAL!
                          ogfs_rindex_release() -- unlock jindex lock

                          The user space utility that adds journals is
                          ogfs_jadd.  See man page for ogfs_jadd,
                          and source in src/tools/ogfs_jadd/main.c.
                          

Locking Dinodes
---------------

Dinodes are locked in shared mode for read access, or in exclusive mode for
write access.  Since dinodes are involved in almost all file system operations,
they are locked quite often in either mode.

Locking Resource Groups
-----------------------

Files and dinodes are stored in "resource groups" on disk (see OGFS "Filesystem
On-Disk Layout").  A resource group is a large set of contiguous blocks that
are managed together.  The filesystem is divided into a number of equal sized
(except, perhaps, the first) resource groups.  Each resource group on disk has
a header that contains information about used and free blocks within the
resource group.

A file may be spread over a number of resource groups.  When a file or dinode
is manipulated, all resource groups that contain the file's data blocks or
meta data must be locked.

Since resource groups are spread evenly over the disks, reading them into
core memory each time they are accessed would incur a horrible performance
penalty.  However, only a few members of the information stored in a resource
group header ever change, namely the usage statistics.

To get around the need to ever read this information from disk once the file
system is mounted, the statistics are stored in the lock value block (LVB) of
the ogfs_glock_t structure that protects the resource group.  When a node
modifies the resource group, it writes the new statistics to disk *and* into
the LVB of the G-Lock.  The next node that acquires the lock can read this
information from the G-Lock instead of reading the disk block.

Locking the Superblock
----------------------

The superblock is read and written only when the file system is mounted or
unmounted.  It is locked only at these times.

Locking Journals
----------------

The journals usually need not be protected because they are used by only one
node each.

!!!

Locking Buffer Heads
--------------------

!!!


Appendices
----------

Appendix A. G-Lock Call Flags

Filesystem calls to ogfs_lock(), and its various wrappers (e.g. ogfs_glock_i()),
may use the following flags.  If the GL_SHARED or GL_DEFERRED flag is not used,
then the request is for an exclusive lock:

  GL_SHARED   - the lock may be shared between processes / nodes
  GL_DEFERRED - special lock mode, different from SHARED or EXCLUSIVE, but not
		currently used by OpenGFS (so we don't know what it means!).
  GL_PERM     - lock will be held for a long time, and will reside in PERM cache
  GL_DISOWN   - disallow recursive locking, allow other process to unlock
  GL_SKIP     - skip "go_lock()" and "go_unlock()".  In particular, used for
		grabbing locks so LVBs are accessible, while skipping any
		disk reads or flushes/writes that might otherwise occur.
		Currently used only for resource group locks when doing
		statfs() whole-filesystem block usage statistics gathering
		operations, skipping time-consuming reads/writes of
		rgrp header and block usage bitmaps.

Filesystem calls to ogfs_unlock(), and its various wrappers
(e.g. ogfs_gunlock_i()) may use the following flags:

  GL_SYNC     - all data and metadata protected by this lock shall be
                synced to disk before the lock is released
  GL_NOCACHE  - lock shall be dropped (*not* cached by the lock layer) after
                it has been unlocked



Appendix B. G-Lock States

A G-Lock can be in any of the following states:

  LM_ST_UNLOCKED  - Unlocked
  LM_ST_EXCLUSIVE - Exclusive lock
  LM_ST_SHARED    - Shared lock
  LM_ST_DEFERRED  - Another shared lock mode which does not share with
                    the LM_ST_SHARED mode.


Appendix C. Callback Types

A lock module on a given node can use the "ogfs_glock_cb()" interface of
the node's G-Lock layer, to notify G-Lock about various situations.  In
most cases, these messages originate from other nodes.  The following types
of messages are specified so far:

  LM_CB_EXPIRED   - another node has expired (died), recover its journal
  LM_CB_NEED_E    - another node needs an exclusive lock
  LM_CB_NEED_D    - another node needs a deferred lock
  LM_CB_NEED_S    - another node needs a shared lock
  LM_CB_DROPLOCKS - drop unused cached glocks (used when lock storage is
                      getting full)

The "EXPIRED" callback is discussed in section "Expiring Locks and the
Recovery Daemon".

The "NEED" callbacks are discussed in section "Caching G-Locks, Callbacks".

The "DROPLOCKS" callback is discussed in section "G-Lock Cache, G-Lock Daemon".


Appendix D. Example "touch foo" (dv and bc)

The following sequence of calls to ogfs_glock() has been observed after creating
a new OpenGFS filesystem (i.e. running mkfs.ogfs), mounting it at /ogfs, then
running the following command:

$ touch /ogfs/foo

ogfs_glock gets called for the following lock numbers (in the order listed):

  20 20 19 17 21 21 2 21 20 21 2

Block 17 is the header of resource group 0 (block bitmap)
Block 19 is the resource group index inode.
Block 20 is the root directory inode.
Block 21 is the block with the new inode for foo.
Lock # 2 is the transaction lock (OGFS_TRANS_LOCK).


Appendix E. Flaws in the current design and implementation of locking
   and related to locking (dv)

* The locking code takes care of way too many things:
  - Inter-node locks
  - Inter-process locks
  - Caching locks
  - Deadlock detection
  - Watching node heartbeat
  - Stomithing dead nodes
  - Calling the journal recovery code

* All layers involved in the locking system are both active and passive
  (calling and being called by each adjacent other layer).

* Deadlock detection places the restriction that single read() or write()
  operations (or any other that uses locks) must be completed before the
  lock expires.  That limits the possible size of atomic transfers drastically
  and can cause problems on systems with poor response times.

* Resource group locking has deadlocks with transactions that span multiple
  resource groups.

* Due to potential deadlocks, any write() operation to a file must be served
  by a single resource group.  This further limits the flexibility of the file
  system and, I think, violates (which?) specs.  For example, on a 100 MB OGFS
  with ten resource groups, the largest possible chunk that can be written with
  a single write() call is 10 MB, or less than that if none of the resource
  groups is empty.

* Inter-node locks are needed too often.  A simple "touch /foo" on a pristine
  ogfs file system needs no less than eleven locks.

* The default block allocation policy is that each node uses a single resource
  group, selected at mount time based on the node's unique journal/cluster ID,
  so each node uses a different resource group.  This policy crams all
  directories and inodes (created by a single cluster node) into a
  single resource group, causing a catastrophic increase in lock collisions.

  Other policies (random, and round-robin selection of resource group) are
  available, but ignore the layout of the data on disk, possibly replacing
  delays caused by locking with delays caused by disk seeks.

* There is no concept of shared ownership of inter-process locks.  Sharing
  such locks, instead of serializing them, would enhance read performance.


Appendix F. Analysis of potential resource group deadlocks (dv)

Fact 1:
  Resource groups need to be locked only to allocate or deallocate blocks.  It
  is not necessary to lock the rg just to modify an inode or data block.

Fact 2:
  There potentially are deadlocks if two or more resource groups are locked in
  random order.

Fact3:
  When a new directory entry is created, the hash table of the directory might
  grow, requiring allocation of additional blocks.

Fact 4:
  When any data or meta data is allocated, the resource groups are locked
  exclusively, one by one, until one with enough space is found.  This can
  cause lots of inter node locks when the file system becomes full.

Now let us see how many *resource groups* are locked by various operations.

a) Modifying data blocks or inodes
   No rg locks required.

b) Allocating an inode (mkdir(), create(), link(), symlink())
   Creates a new directory entry in the parent directory.  In current code,
   if the directory grows (and thus needs new meta data blocks), the whole
   directory hash table, plus any other new blocks are moved to/allocated in the
   same resource group as the new inode.  This localization, once accomplished,
   minimizes the number of rgs that must be locked when accessing directory
   entries ...

   ... It may not seem like such a big limitation, but the current code
   tries to reserve enough space in that rg for the worst case of directory
   growth (hash table is created and immediately explodes to maximum size).  In
   other words:  in order to create a new inode, the target resource group must
   have about 1 MB of free data plus meta data blocks.

c) Deallocating an inode
   Locks the inode's rg to update the block bitmap.  Since ogfs never frees the
   space that has become unused in directories, the dir's rg is *not* locked.

d) Allocating file data / write() / writepage()
   Only one rg is locked.  A single ogfs_write() call never writes to more than
   a single resource group.  This is an unacceptable limitation of the write()
   system call.

e) Truncating a file (spanning multiple rgs)
   May need many rg locks.  Sorts them before locking them.

f) Removing a file or directory / unlink()
   Is done in two steps:

   1).  The directory entry is removed (no rg locks required, see above).

   2).  The inodes scheduled for removal are listed in the log, and their
        blocks are freed only after the transaction has been completed
        (i.e. flushed to disk?).

   This second stage needs to truncate each file (to 0 size) and remove its
   inode, sorting the corresponding rgs before locking them.  (This description
   may be a bit inaccurate).

g) Renaming plain files / rename()
   Needs one rg lock (see (f)).

h) Renaming directories / rename()
   Needs one rg lock (see (f)). In addition, another lock serializes directory
   renaming operations.

i) statfs()
   Locks all resource groups, one at a time, in order, while accumulating
   statistics from each group.

j) mmap() shared writable
   Would need many rg locks which could be ordered.  Not implemented since it
   would lock large parts of the file system for possibly long times.

k) flock()
   Does not need any rg locks.  It also prevents non locking file access by
   other processes and is thus not POSIX conformant.

Summary

In the current code, rg deadlocks are not possible, at least not
with above operations.  But the price one pays is high:

 - write() never writes more data than fitting into the rg with
   the most free space.
 - Inodes can be created only in resource groups that have at least
   1 MB of free space.
 - Once allocated, empty directory blocks are never freed.
 - A directory hash table is never shrunk.
 - Meta data blocks are never converted back to data blocks.
 - When a directory hash table grows it is copied to the same rg
   as the new inode en bloc.
 - When a new directory leaf is allocated, it is created in the
   same rg as the new inode.  This has the potential to scatter
   the directory leaves all over the file system.


Appendix G.  Inventory of calls to ogfs_get_glstruct()

glock.c:

  ogfs_glock_num() -- Find (alloc if needed) and lock a lock, given number/type
     Number:  From caller
     Type:    From caller
     Parent:  From caller
     CREATE:  Yes

     Calls:   ogfs_get_glstruct()
              ogfs_lock()

  ogfs_glock_num() -- Find and unlock a lock, given number/type
     Number:  From caller
     Type:    From caller
     Parent:  From caller
     CREATE:  No

     Calls:   ogfs_get_glstruct()
              ogfs_unlock()
              ogfs_put_glstruct()

inode.c:

  ogfs_lookupi() -- look up a filename in a directory, return its inode
     Number:  inum.no_formal_ino
     Type:    inode
     Parent:  No
     CREATE:  Yes


plock.c:

  find_unused_plockgroup() -- find a value to use for the plock group
     Number:  bid, composed of journal id, plock group id, and lock id
     Type:    plock
     Parent:  No
     CREATE:  Yes

  hold_plockgroup() --
     Number:  bid, composed of journal id, plock group id, and lock id
     Type:    plock
     Parent:  No
     CREATE:  Yes

  load_plock_head() --
     Number:  ip->i_num.no_formal_ino
     Type:    inode
     Parent:  No
     CREATE:  Yes

  load_plock_jid_head() --
     Number:  bid, composed of journal id, plock group id, and lock id
     Type:    plock
     Parent:  No
     CREATE:  Yes

  load_other_plock_elems() --
     Number:  bid, composed of journal id, plock group id, and lock id
     Type:    plock
     Parent:  No
     CREATE:  Yes

  add_plock_elem() -- adds a plock to a journal id's chain
     Number:  bid, composed of journal id, plock group id, and lock id
     Type:    plock
     Parent:  No
     CREATE:  Yes

recovery.c:

  ogfs_get_log_header() -- read the log header for a given journal segment
     Number:  jdesc->ji_addr, block # of journal index
     Type:    meta
     Parent:  No
     CREATE:  No

     Calls:   ogfs_dread() to read the header block of the journal segment
              ogfs_put_glstruct() as soon as read is done

  foreach_descriptor() -- go through the active part of the log
     Number:  jdesc->ji_addr, block # of journal index
     Type:    meta
     Parent:  No
     CREATE:  No

     Calls:   ogfs_dread() to read a block of the journal
              ogfs_put_glstruct() as soon as read is done

  do_replay_local() -- replay a metadata block (when in local fs mode)
     Number:  jdesc->ji_addr, block # of journal index
     Type:    meta
     Parent:  No
     CREATE:  No

     Calls:   ogfs_get_glstruct() to get ptr to glstruct
              ogfs_put_glstruct() as soon as ptr is obtained

     Comments from code:  The lock should (already) be held, so we don't need
     to hold a count on the structure.  We do need its pointer, though.

  do_replay_multi() -- replay a metadata block (when in multi-host mode)
     Number:  jdesc->ji_addr, block # of journal index
     Type:    meta
     Parent:  No
     CREATE:  No

     Calls:   ogfs_get_glstruct() to get ptr to glstruct
              ogfs_put_glstruct() as soon as ptr is obtained

     Comments from code:  The lock should (already) be held, so we don't need
     to hold a count on the structure.  We do need its pointer, though.

  replay_metadata() -- replay a metadata block (when in multi-host mode)
     Number:  jdesc->ji_addr, block # of journal index
     Type:    meta
     Parent:  No
     CREATE:  No

     Calls:   ogfs_get_glstruct() to get ptr to glstruct
              ogfs_put_glstruct() as soon as ptr is obtained

     Comments from code:  The lock should (already) be held, so we don't need
     to hold a count on the structure.  We do need its pointer, though.

  clean_journal() -- mark a dirty journal as being clean
     Number:  jdesc->ji_addr, block # of journal index
     Type:    meta
     Parent:  No
     CREATE:  No

     Calls:   lots of things
              ogfs_put_glstruct() at end of function

  collect_nopen() -- called by foreach_descriptor to get nopen counts
     Number:  jdesc->ji_addr, block # of journal index
     Type:    meta
     Parent:  No
     CREATE:  No

     Calls:   ogfs_get_glstruct() to get ptr to glstruct
              ogfs_put_glstruct() as soon as ptr is obtained

     Comments from code:  The lock should (already) be held, so we don't need
     to hold a count on the structure.  We do need its pointer, though.


arch_linux_2_4/super_linux.c:

  _ogfs_read_super -- the filesystem mount function
     Number:  OGFS_TRANS_LOCK, the cluster-wide transaction lock
     Type:    trans
     Parent:  No
     CREATE:  Yes

  _ogfs_read_super -- the filesystem mount function
     Number:  OGFS_RENAME_LOCK, the cluster-wide file rename/move lock
     Type:    nondisk
     Parent:  No
     CREATE:  Yes



Appendix H.  Inventory of calls to ogfs_glock(), either direct, or indirect via:

glock.h:
ogfs_glock_i()   -- lock glock for this inode (given ptr to inode struct)
ogfs_glock_rg()  -- lock glock for this resource group (given ptr to rg struct)

glock.c
ogfs_glock_num() -- lock glock for this lock # (often, lock # = block #)
ogfs_glock_m()   -- lock multiple glocks (given list of glock ptrs)

All except ogfs_glock_num() require that glock structure has already been
allocated via ogfs_get_glstruct().

Calls to ogfs_glock() directly (excluding those from ogfs_glock_*()):
--------------------------------------------------------------------

inode.c:

  inode_dealloc()

  ogfs_lookupi()
     grabs a lock on the found inode, in shared (GL_SHARED) mode.

     There are two cases here:

     1. Found inode's lock name (block #) is < directory's lock name (block #):
          Unlock directory's inode (give opportunity for someone else to
             change the directory's knowledge of the inode's block location?)
          Lock (1st) found inode
          Lock directory's inode in shared (GL_SHARED) mode.
          Do another ogfs_dir_search() for the inode (same name)
          Compare 2nd found inode number with 1st found inode number
             If same, we've found what we're looking for
             If different, restart the search

     2. Found inode's lock name (block #) is > directory's lock name (block #):
          Lock (1st) found inode ... this is the one we're looking for

     Case 1 seems to be a situation in which the inode is moving from block
     to block, and the code is looking for the directory to be stable as to
     the inode's final location (?).

     Once the inode is found, the function reads the inode structure into core,
     using ogfs_get_istruct().

     ogfs_lookupi() has an option to release the locks on the directory and
     the found inode at the end of the function.

     Comments from Dominik:  I think the dir lock is released and reacquired
     after the file lock to keep acquiring locks in ascending order (deadlock
     prevention).  However, I'm not sure that nothing bad can happen between
     releasing and reacquiring the lock.

     See also ogfs_lookupi() discussed under "Calls to ogfs_glocki()"


plock.c:

  find_unused_plockgroup()

  load_plock_head()

  load_plock_jid_head()

  load_other_plock_elems()

  load_my_plock_elems()

  add_plock_elem()

recovery.c:

  ogfs_recover_journal() ... replay a journal to recover consistent state
     grabs the transaction lock (sdp->sd_trans_gl) in exclusive (0) mode,
     with flags:

     LM_FLAG_NOEXP -- Always used.  Grab this lock even if it is "expired",
                        i.e. being recovered from a dead node.  See "Expiring
                        Locks and the Recovery Daemon".

     Grabbing this lock in exclusive mode prevents other nodes and processes
     from creating new transactions while the journal recovery is proceeding.

     This is the only function in which the transaction lock is grabbed in
     *exclusive* mode.  The lock is unlocked by this function as soon as the
     journal replay is complete.

     This is a special (non-disk) lock, ID #2:
     From src/fs/ondisk.h:  #define OGFS_TRANS_LOCK  (2).
     src/fs/arch_linux_2_4, _ogfs_read_super() assigns:
       sdp->sd_trans_gl = ogfs_get_glstruct(sdp, OGFS_TRANS_LOCK, ...);

trans.c:

  ogfs_trans_begin() ... begin a new transaction
     grabs the transaction lock (sdp->sd_trans_gl) in shared (read) mode
     (GL_SHARED).

     Grabbing this lock in shared mode allows other nodes and processes to
     create transactions simultaneously, unless and until a journal recovery
     occurs.  See comments above.

     This is the only function in which the transaction lock is grabbed in
     *shared* mode.  The lock is normally unlocked by ogfs_trans_end(),
     but will be unlocked by ogfs_trans_begin() itself if a failure occurs.

arch_linux_2_4/inode_linux.c:

  ogfs_rename() ... rename a file
     grabs the rename lock (sdp->sd_rename_gl) in exclusive (0) mode.

     Grabbing this lock in exclusive mode prevents other nodes and processes
     from doing any renaming while this renaming is proceeding.

     This is the only function in which the rename lock is grabbed at all.
     The lock is released by this function, once the renaming is complete.

     From src/fs/ondisk.h:  #define OGFS_RENAME_LOCK  (3).
     src/fs/arch_linux_2_4, _ogfs_read_super() assigns:
       sdp->sd_rename_gl = ogfs_get_glstruct(sdp, OGFS_RENAME_LOCK, ...);

     This function creates a complete transaction!

arch_linux_2_4/super_linux.c:

  _ogfs_read_super() ... set up in-core superblock, mount the fs
     grabs this node's journal lock (sdp->sd_my_jnl_gl) in exclusive (0) mode,
     called with disown flag (GL_DISOWN).

     The lock is then immediately unlocked!

     The journal lock is a lock on the first block of this node's journal,
     created and grabbed earlier, in exclusive mode, within the same function,
     using ogfs_glock_num().

     Comments in src/fs/glock.h say that GL_DISOWN tells the lock layer
     to disallow recursive locking, and allow a different process to
     unlock the lock.  So, it seems, this negates the exclusivity of the lock
     grabbed earlier in the function, while still holding a lock!?!

     The earlier exclusive lock is unlocked by ogfs_put_super(), when
     unmounting the filesystem.


Calls to ogfs_glock_i()
--------------------------------------------------------------------

blklist.c:

  ogfs_rindex_hold() -- locks resource index, makes sure we have latest info
     Resource :  resource group index inode (sdp->sd_riinode) block #
     Mode:       shared (GL_SHARED)
     Type:       inode
     Flags:      --
     gunlock:    only in case of error doing ogfs_rgrp_update()
     put_glstruct:  No
     unlock:     elsewhere, by ogfs_rindex_release(), via ogfs_gunlock_i()
                    (no put_glstruct)

     Calls:      If resource group info is out-of-date (i.e. filesystem has
                 been expanded or shrunk), calls ogfs_rgrp_update() to read
                 new info from disk.

     Called fm:  ogfs_inplace_reserve(), blklist.c *
                 do_strip(), bmap.c
                 leaf_free(), dir.c
                 dinode_dealloc(), inode.c
                 ogfs_stat_rgrp_ioctl(), ioctl.c
                 ogfs_reclaim_one_ioctl(), ioctl.c
                 ogfs_reclaim_all_ioctl(), ioctl.c
                 ogfs_setup_dameth(), super.c
                 ogfs_stat_ogfs(), super.c

     * calls ogfs_rindex_release() *only* in case of error.  Normally relies
       on ogfs_inplace_release() to do the unlock.  All other functions unlock
       before exiting.  None does a put_glstruct().

     Comments from code:  we keep this lock for a long time
     compared with other locks, since it is shared and very, very rarely
     accessed in exclusive mode.

     Comments (bc):  What do they mean by "long time", or by "we"?
     It looks to me that the lock is not very long-lived.

     This function is the one that detects that a filesystem has grown or
     shrunk!  Filesystem size change requires the addition or removal of
     resource groups, which in turn requires a change to the resource index.
     This function compares a version number held in the filesystem superblock
     structure (sdp->sd_riinode_vn) with a version number associated with
     the glock (gl->gl_vn) to detect a change in the resource index.

     This is the same lock grabbed by ogfs_get_riindex().

bmap.c:

  ogfs_truncate() ... change file size
     grabs a lock on the file's inode in exclusive (0) mode.
     Comments:  file size can grow, shrink, or stay the same.

glock.c:

  ogfs_glock_m()
     wrapper ... for each lock in the list, calls ogfs_glock_i() if the lock is
     for an inode.  Mode is determined by flags contained in the list.

inode.c:

  ogfs_lookupi() ... look up a filename in a directory
     grabs a lock on the directory inode, in shared mode (GL_SHARED).

     In a special case (that the found inode is located in a lower block # than
     the searched directory's inode), this function gives up the directory
     lock, then re-aquires it to try the search again.  Does the location
     relationship indicate that something else is messing with the directory??

     See discussion of ogfs_lookupi() under "Calls to ogfs_glock directly".


  ogfs_create_i() ... find requested inode, or create a new one
     grabs a lock on the directory inode, in exclusive (0) mode.

     If requested inode matches a name search of the directory, this function
     releases this exclusive lock before calling ogfs_lookupi(), which places
     a shared lock on the same directory.

     Comment from Dominik:  It would be good to be able to downgrade the lock
     from exclusive to shared, without first needing to unlock it entirely
     (i.e. keep a lock locked while transitioning from exclusive to shared).
     This feature is not currently provided in OGFS locking.

  ogfs_update_atime() ... update inode's atime, if needed
     grabs a lock on the inode, in exclusive (0) mode, if it needs to write
     an update to the inode.

     Called only from src/fs/arch_linux_2_4/file.c | inode_linux.c, mostly
     via OGFS_UPDATE_ATIME macro (conditional on whether fs was mounted
     noatime), but directly from ogfs_file_mmap() (also conditional, just
     doesn't use the macro).

     In all cases, the caller holds a shared lock on the inode, so
     ogfs_update_atime() must release that shared lock just before grabbing the
     exclusive lock, if it needs to write an update to the inode.

     ogfs_update_atime() returns with a lock held (either the original shared
     lock or the replacement exclusive lock).  So, the calling function is
     responsible for releasing the lock.  It doesn't matter if the lock is held
     as shared or exclusive at the time of release.

ioctl.c:

  ogfs_print_frag() ... print info about block locations for an inode
     grabs a lock on the inode, in shared (GL_SHARED) mode, to read it.

  ogfs_jread_ioctl() ... read from a journaled file, via ioctl
     grabs a lock on the file's inode, in shared (GL_SHARED) mode, to read
     the inode and file, using ogfs_readi().

  ogfs_jwrite_ioctl() ... write to a journaled file, via ioctl
     grabs a lock on the file's inode, in exclusive (0) mode, to write it.

     Lots of interesting stuff going on in this function, look again later!

super.c:

  ogfs_jindex_hold() ... grab a lock on the journal index (ondisk)
     grabs a lock on the ondisk journal index inode (sdp->sd_jiinode), in
     shared (GL_SHARED) mode.  Also uses LM_FLAG_TRY if caller does *not* want
     to wait for the lock if it is currently unavailable.

     Function compares version numbers of incore superblock's sdp->sd_jiinode_vn
     and inode's glock version # ip->i_gl->gl_vn.  If they are out of sync, then
     incore journal index is out-of-date relative to ondisk jindex(?).  To read
     new journal index into core, function calls ogfs_ji_update().

arch_linux_2_4/dcache.c:

  ogfs_drevalidate() ... validate lookup path from parent directory to inode
     grabs a lock on parent directory, in shared (GL_SHARED) mode.

     Function uses ogfs_dir_search(), and kernel's BKL (lock_kernel()).

arch_linux_2_4/file.c:

  ogfs_read() ... read bytes from a file
     grabs a lock on file's inode, in shared (GL_SHARED) mode.

     Function uses kernel's generic_file_read(), and BKL (lock_kernel()).

  ogfs_write() ... write bytes to a file
     grabs a lock on file's inode, in exclusive (0) mode.  Releases when done.

     Calls ogfs_inplace_reserv(), ogfs_inplace_release(), ogfs_trans_begin(),
     ogfs_trans_end(), ogfs_get_inode_buffer(), ogfs_trans_add_bh(),
     ogfs_dinode_out()

     Creates a complete transaction!

     Function uses kernel's generic_file_write_nolock(), brelse(),
     and BKL (lock_kernel()).

  ogfs_readdir() ... read directory entries from a directory
     grabs a lock on directory inode, in shared (GL_SHARED) mode.  Releases
     lock when done reading.

     Function uses ogfs_dir_read().

  ogfs_sync_file() ... sync file's dirty data to disk (across the cluster)
     grabs a lock on file's inode, in exclusive (0) mode.  Does *not*
     release the lock, unless there is an error.

     Interesting use of exclusive lock!  Function does no more than just grab
     the lock.  This forces any other cluster member (that might own the file
     lock) to flush data to disk.

     Function uses kernel's BKL (lock_kernel()).

     See also ogfs_irevalidate(), which uses a shared lock in a similar
     way for opposite reasons!

  ogfs_shared_nopage() ... support shared writeable mappings (mmap)
     graps a lock on vm area's inode (area->vm_file->f_dentry->d_inode),
     in exclusive (0) mode.  Releases lock when done.

     Function uses kernel's filemap_nopage(), and BKL (lock_kernel()).

  ogfs_private_nopage() ... do safe locking on private mappings (mmap)
     graps a lock on vm area's inode (area->vm_file->f_dentry->d_inode),
     in shared (GL_SHARED) mode.

     Function uses kernel's filemap_nopage(), and BKL (lock_kernel()).

  ogfs_file_mmap() ... memory map a file, no sharing
     grabs a lock on file's inode, (file->f_dentry->d_inode),
     in shared (GL_SHARED) mode.

     Shared lock is grabbed after mapping, before calling ogfs_update_atime()
     (see ogfs_update_atime(), above).

     Once the call returns to ogfs_file_mmap(), we release *a* lock ...
     it may be the shared one we originally grabbed, or the exclusive
     one that ogfs_update_atime() grabbed if it needed to write.


arch_linux_2_4/inode_linux.c:

  ogfs_set_attr() ... change attributes of an inode
     grabs a lock on the inode, in exclusive (0) mode, to write the inode.

     Creates an entire transaction (ogfs_trans_begin() to ogfs_trans_end())
     in a certain case.

  ogfs_irevalidate() ... check that inode hasn't changed (ondisk?)
     grabs a lock on the inode, in shared (GL_SHARED) mode, to read the inode.

     Interesting use of shared lock!  Function does no more than just grab
     the lock.  This forces the incore image of the inode to sync up with
     the disk, if it's not already in sync.

     Function uses kernel's BKL (lock_kernel()).

     See also ogfs_sync_file(), which uses an exclusive lock in a similar
     way for opposite reasons!

  ogfs_readlink() ... read the value of a symlink (and copy_to_user).
     grabs a lock on the link's inode, in shared (GL_SHARED) mode.

     Function calls OGFS_UPDATE_ATIME (see ogfs_update_atime(), above),
     and ogfs_get_inode_buffer().

     Function uses kernel's vfs_follow_link(), and BKL (lock_kernel()).

  ogfs_follow_link() ... follow a symbolic link (symlink)
     grabs a lock on the link's inode, in shared (GL_SHARED) mode.

     Function calls OGFS_UPDATE_ATIME (see ogfs_update_atime(), above),
     and ogfs_get_inode_buffer().

     Function uses kernel's BKL (lock_kernel()).

  ogfs_readpage() ... read a page of data for a file
     grabs a lock on the file's (?) inode (page->mapping->host), in shared
     (GL_SHARED) mode.

     Function calls stuffed_readpage() (src/fs/arch_linux_2_4/inode_linux.c).

     Function uses kernel's UnlockPage(), block_read_full_page(),
     and BKL (lock_kernel()).

  ogfs_writepage() ... write a complete page for a file
     grabs a lock on the file's (?) inode (page->mapping->host), in exclusive
     (0) mode.

     Calls ogfs_inplace_reserv(), ogfs_inplace_release(), ogfs_trans_begin(),
     ogfs_trans_end(), and either stuffed_writepage(), or
     block_write_full_page(), depending on whether file is "stuffed" in inode
     block.

     Creates a complete transaction!

     Function uses kernel's BKL (lock_kernel()).

  ogfs_bmap() ... block map
     grabs a lock on the file (?) (mapping->host), in shared (GL_SHARED) mode.

     Function uses kernel's generic_block_bmap(), and BKL (lock_kernel()).


Calls to ogfs_glock_rg()
--------------------------------------------------------------------

blklist.c:

  ogfs_rgrp_lvb_init() ... init the data of a resource group lock value block
     grabs 1 or 2 locks on the resource group:

     If !force, grab a lock in shared (GL_SHARED) mode, with GL_SKIP flag.
     GL_SKIP, used for both the lock and unlock phase, keeps the glops
     lock_rgrp() from reading or writing resource group header/bitmap data
     to or from disk ... all we need to get is the LVB data.

     For all cases (except error), grab an exclusive (0) lock.  Releases
     lock when done.

     Calls ogfs_rgrp_lvb_fill(), ogfs_rgrp_save_out(), ogfs_sync_lvb().

  __get_best_rg_fit() ... find and lock the rg that best fits a reservation
     grabs locks on all(?) rgs, one at a time in ascending order, in
     exclusive (0) mode.  The loop accumulates locks, without releasing
     them unless/until a "best fit" rg is found, that is, an rg that can
     accomodate the complete reservation.

     Releases locks on all but the selected "best fit" rg.

     Called only by ogfs_inplace_reserve(), as the "plan C" last resort
     method of reserving space for something.

  ogfs_inplace_reserve() ... reserve space in the filesystem
     grabs locks on a series of rgs, in exclusive (0) mode.

     Calls ogfs_rgrpd_get_with_hint(), try_rgrp_fit(), __get_pure_metares(),
     __get_best_rg_fit().


bmap.c:

  do_strip() ... strip off a particular layer(?) of the file
     grabs locks on a list of rgs, in exclusive (0) mode.

     Calls ogfs_rindex_hold(), ogfs_rlist_add(), ogfs_rlist_sort(),
     ogfs_trans_begin(), ogfs_trans_end(), ogfs_get_inode_buffer(),
     ogfs_trans_add_bh(), ogfs_blkfree(), ogfs_metafree(), ogfs_dinode_out().

     Calls kernel's brelse().

     Creates a full transaction!

     Comments from Dominik:  I think "strips off a particular layer" refers
     to layers of indirect data blocks.  That is, when a file shrinks, the
     number of indirections may be reduced, too.  See "Filesystem On-Disk
     Layout" for info on indirect data blocks.

dir.c:

  leaf_free() ... deallocate a directory leaf
     grabs locks on a list of rgs, in exclusive (0) mode.

     Calls ogfs_rindex_hold(), ogfs_rlist_add(), ogfs_rlist_sort(),
     ogfs_trans_begin(), ogfs_trans_end(), ogfs_get_leaf(), ogfs_leaf_in(),
     ogfs_trans_add_bh(), ogfs_metafree(), ogfs_internal_write().

     Calls kernel's brelse().

     Creates a full transaction!


inode.c:

  dinode_dealloc() ... deallocate a dinode
     grabs a lock on the inode's resource group, in exclusive (0) mode.

     Calls ogfs_rindex_hold(),
     ogfs_trans_begin(), ogfs_trans_end(), ogfs_difree(),
     ogfs_trans_nopen_change(), ogfs_trans_add_gl().

     Creates a full transaction!


super.c:

  ogfs_stat_ogfs() ... do a statfs, adding up statistics from all rgrps.
     grabs a lock on each resource group in the filesystem, one by one,
     in shared (GL_SHARED) mode, and with GL_SKIP flag.  GL_SKIP skips any
     reads or writes of resource group data on disk ... all we need to use
     is the lock's LVB data.

     Releases each lock after adding (accumulating) stats for its rgrp.

     Calls ogfs_rindex_hold(), ogfs_rgrpd_get_with_hint(),
     ogfs_lvb_init(), if it thinks that a node crashed when writing the LVB,
        to read rgrp statistics from disk, and re-initialize the corrupt LVB.



Calls to ogfs_glock_num()
ogfs_glock_num() embeds an ogfs_get_glstruct() within the call.
ogfs_gunlock_num() does the unlock, and embeds ogfs_put_glstruct() within call.
--------------------------------------------------------------------

flock.c:

  ogfs_flock() -- acquire an flock on a file
     Resource :  ip->i_num.no_formal_ino
     Mode:       shared (GL_SHARED), if flock is shared (otherwise exclusive)
     Type:       flock
     Parent:     No
     Flags:      GL_PERM -- for all flocks
                 GL_DISOWN -- for all flocks
                 LM_FLAG_TRY -- if !wait
     gunlock_num:  only on error
     unlock:     in ogfs_funlock(), via ogfs_gunlock_num()

     See several interesting comments in this function.

inode.c:

  read_dinode() -- read an inode from disk into the incore OGFS inode cache
     Resource :  ip->i_num.no_formal_ino
     Mode:       shared (GL_SHARED)
     Type:       nopen
     Parent:     No
     Flags:      GL_PERM -- 
                 GL_DISOWN -- 
     gunlock_num:  see below
     put_glstruct:  Yes
     unlock:     in ??()

     Calls:  ogfs_copyin_dinode().

     In error situation, this function calls ogfs_gunlock() with GL_NOCACHE
     flag, since the lock will not be used in the future.  It then calls
     ogfs_put_glstruct() to decrement the usage count on the glock structure
     (that's *all* that ogfs_put_glstruct() does).

     In normal situation, this function just calls ogfs_put_glstruct(), to
     decrement the usage count.  It does not unlock the lock, since presumably
     something else wants to do something with the block, after it's been
     read in.

     Where/when would this lock be unlocked??

  inode_dealloc() -- deallocate an inode
     Resource :  inum.no_formal_ino
     Mode:       exclusive
     Type:       inode
     Parent:     No
     Flags:      -- 
     gunlock:    Yes, before leaving function
     put_glstruct:  Yes, before leaving function

     Calls:
       ogfs_glock() to get nopen lock
       ogfs_get_istruct() to read inode (ip) from disk
       ogfs_gunlock() to unlock nopen lock
       ogfs_dir_exhash_free()  ???
       ogfs_shrink() to truncate the file to 0 size (deallocate data blocks)
       dinode_dealloc() to deallocate the inode block
       ogfs_put_istruct() to decrement usage count on ip istruct
       ogfs_destroy_istruct() to deallocate istruct from memory



ogfs_dealloc_inodes() ... go through the list of inodes to be deallocated
     Resource :  inum.no_addr
     Mode:       exclusive
     Type:       nopen
     Parent:     No
     Flags:      LM_FLAG_TRY, see below 
     gunlock:    Yes, before leaving function
     put_glstruct:  Yes, before leaving function

     LM_FLAG_TRY -- if lock not immediately available, the function makes
       note of this as a "stuck" inode.  This keeps us from spinning if the
       list can't be totally purged.  (Why would an inode have a lock on it if
       it is de-allocatable?).

     Calls:

       ogfs_pitch_inodes() to throw away any inodes flagged to be discarded
       ogfs_nopen_find() to search sdp->sd_nopen_ic_list for a deallocatable
          inode.
       ogfs_gunlock() to unlock the inode (so following call can lock it
          in exclusive mode).
       inode_dealloc() to remove the inode and associated data
       ogfs_put_glstruct() to deallocate the lock structure

     Unlocks the lock, and and calls ogfs_put_glstruct() when done with each
     inode, as noted in Calls above.

ogfs_createi() ... create a new inode
     grabs a lock on the inode, in two ways, one right after the other:

     Resource :  inum.no_formal_ino
     Mode:       exclusive
     Type:       inode
     Parent:     No
     Flags:      --
     gunlock:    No (where unlocked?), except after error
     put_glstruct:  Yes(!), before leaving function

     Resource :  inum.no_addr
     Mode:       shared (GL_SHARED)
     Type:       nopen
     Parent:     No
     Flags:      --
     gunlock:    Yes, before leaving function
     put_glstruct:  Yes, before leaving function

     Note:  Current implementation sets inum.no_formal_ino = inum.no_addr
     (see fs/inode.c pick_formal_ino()).  These two locks are differentiated
     only by their glops/type, since the lock number is the same!

     This function creates a complete transaction!


ioctl.c:

  ogfs_get_super() ... dump disk superblock into user-space buffer
     Resource :  OGFS_SB_LOCK (lock "0"), the cluster-wide superblock lock
     Mode:       shared (GL_SHARED)
     Type:       meta
     Parent:     No
     Flags:      --
     gunlock:    Yes, before leaving function
     put_glstruct:  Yes, before leaving function

     Calls:  ogfs_dread() to read the block from disk
             copy_to_user() (kernel) to copy into user-space buffer.

     Normally, you would think that the superblock should be static, so why
     lock it for a read?  To protect it against filesystem upgrades, as rare
     as they may be!  The only other place SB_LOCK is grabbed is in
     _ogfs_read_super(), when mounting.  See below.


recovery.c:

  ogfs_recover_journal() ... do a replay on a given journal
     Resource :  sdp->sd_jindex[jid]->ji_addr, requested journal's first block
     Mode:       exclusive
     Type:       meta
     Parent:     No
     Flags:      LM_FLAG_NOEXP, always
                 LM_FLAG_TRY, see below
     gunlock:    Yes, before leaving function
     put_glstruct:  Yes, before leaving function


     LM_FLAG_NOEXP -- Always used.  Grab this lock even if it is "expired",
                        i.e. being recovered from a dead node.  See "Expiring
                        Locks and the Recovery Daemon".

     LM_FLAG_TRY -- Conditionally used.  If this is *not* the first node to
        mount into the cluster, don't block when waiting for the lock.
        Instead, if the lock is not immediately available, print
        "OGFS:  Busy" to the console, *don't* replay the journal, and exit
        with a successful return code.

        This looks at a boolean member of the lock module structure,
        (sdp->sd_lockstruct.ls_first).  When a computer mounts a lock module,
        the module sets this value to TRUE to indicate that the computer is
        the first one in the cluster to mount the module.  The memexp protocol
        is accurate in this (it can check with the memexp lock server), but the
        nolock protocol unconditionally sets this value to TRUE (it has no
        server to check).  The stats protocol passes along the value set
        by the protocol which *it* mounted (stats is a stacking protocol).

        When mounting the filesystem, _ogfs_read_super() will replay *all*
        of the filesystems journals if ls_first is TRUE, calling
        ogfs_recover_journal() once for each journal.  In this case,
        we must block when waiting for each journal lock (we *must* replay
        each journal before proceding).

        Once all journals have been replayed, _ogfs_read_super() calls
        ogfs_others_may_mount() (allowing other nodes that are blocked
        within the protocol mount() call to proceed), and sets ls_first
        to FALSE.

        If ls_first is FALSE, _ogfs_read_super() will replay only its own
        journal.  In this case, we grab the lock with LM_FLAG_TRY.
        If we fail to get the lock, it just means some other computer is
        currently replaying the journal; there's no need for us to replay it,
        so we return with "success"!

        Note:  The lock could also be held if a computer is doing a filesystem
        upgrade, but my guess is that the sequence of events would make it
        impossible for an upgrade to happen at the same time that we're
        mounting the filesystem???


super.c:

  ogfs_do_upgrade() -- upgrade a filesystem to a newer version
     Resource :  sdp->sd_jindex[jid]->ji_addr, each journal's first block
     Mode:       exclusive
     Type:       meta
     Parent:     No
     Flags:      LM_FLAG_TRY, see below
     gunlock:    Yes, after ogfs_find_jhead() for each journal
     put_glstruct:  Yes, after ogfs_find_jhead() for each journal

     This function checks, before upgrading a filesystem, to make sure that
     each and every journal in the filesystem is unmounted.  So, for each
     journal, it grabs a lock, calls ogfs_find_jhead(), and checks for the
     OGFS_LOG_HEAD_UNMOUNT flag.  This flag is present in a "shutdown"
     journal header, and indicates that the journal has been unmounted.
     (Does it mean that the journal is empty?).

     If it does not find the UNMOUNT flag in the current journal head, or
     if it can't immediately acquire the journal lock, the function stops
     and reports an error -EBUSY.


ogfs_get_riinode() -- reads resource index inode from disk, inits incore image
     Resource :  resource index inode's block #
                    (sdp->sd_sb.sb_rindex_di.no_formal_ino)
     Mode:       shared (GL_SHARED)
     Type:       inode
     Parent:     No
     Flags:      GLF_STICKY -- applied by set_bit()
     gunlock:    Yes, before leaving function
     put_glstruct:  Yes, before leaving function

     Calls:      ogfs_get_istruct() to read inode from disk.

     Called fm:  filesystem mount function, _ogfs_read_super().

     Sets sdp->sd_riinode_vn = gl->gl_vn - 1.  Is this to force
     ogfs_rindex_hold() to read new resource index from disk?

     This is the same lock grabbed by ogfs_rindex_hold().


ogfs_get_jiinode() -- reads journal index inode from disk, inits incore image
     Resource :  jindex inode (sdp->sd_sb.sb_jindex_di.no_formal_ino)
     Mode:       shared (GL_SHARED)
     Type:       inode
     Parent:     No
     Flags:      GLF_STICKY -- applied by set_bit()
     gunlock:    Yes, before leaving function
     put_glstruct:  Yes, before leaving function

     Calls:      ogfs_get_istruct() to read inode from disk.

     Called fm:  filesystem mount function, _ogfs_read_super().

     Sets sdp->sd_jiinode_vn = gl->gl_vn - 1.  Is this to force
     ogfs_jindex_hold() to read new journal index from disk?



arch_linux_2_4/file.c:

ogfs_open_by_number() ... open a file by inode number
     grabs a lock on the inode (inum.no_formal_ino), in shared (GL_SHARED) mode.
     Resource :  inum.no_formal_ino
     Mode:       shared (GL_SHARED)
     Type:       inode
     Parent:     No
     Flags:      --
     gunlock:    Yes, after ogfs_get_istruct()
     put_glstruct:  Yes, after ogfs_get_istruct()

     Calls:  ogfs_get_istruct() to read inode from disk.


arch_linux_2_4/super_linux.c:

_ogfs_read_super() ... mount the filesystem
  1) grabs a lock on OGFS_MOUNT_LOCK (non-disk lock # 0), in exclusive (0)
     mode, using ogfs_nondisk_glops, with flags:

     GL_PERM --

     LM_FLAG_NOEXP -- Always used.  Grab this lock even if it is "expired",
                        i.e. being recovered from a dead node.  See "Expiring
                        Locks and the Recovery Daemon".


  2) grabs a lock on OGFS_LIVE_LOCK (non-disk lock # 1), in shared (GL_SHARED)
     mode, using ogfs_nondisk_glops, with flags:

     GL_PERM --

     GL_DISOWN --

     LM_FLAG_NOEXP -- Always used.  Grab this lock even if it is "expired",
                        i.e. being recovered from a dead node.  See "Expiring
                        Locks and the Recovery Daemon".

  3) grabs a lock on OGFS_SB_LOCK (meta-data lock # 0), in shared (GL_SHARED)
     mode, using ogfs_meta_glops.

     Uses exlusive mode if mount argument calls for filesystem upgrade.

  4) grabs a lock on the machine's journal (sdp->sd_my_jdesc.ji_addr),
     in exclusive (0) mode, using ogfs_meta_glops, with flags:

     GL_PERM --

     LM_FLAG_NOEXP -- Always used.  Grab this lock even if it is "expired",
                        i.e. being recovered from a dead node.  See "Expiring
                        Locks and the Recovery Daemon".


  5) grabs a lock on the root inode (sdp->sd_sb.sb_root_di.no_formal_ino),
     in shared (GL_SHARED) mode, using ogfs_inode_glops.


  ogfs_iget_for_nfs() ... get an inode based on its number
     grabs a lock on the inode (inum->no_formal_ino), in shared (GL_SHARED)
     mode, using ogfs_inode_glops.


Calls to ogfs_glock_m()
--------------------------------------------------------------------

arch_linux_2_4/inode_linux.c:

  ogfs_link() ... link to a file
     grabs 2 locks, both in exclusive (0) mode, using ogfs_inode_ops, on:

     1) inode of directory containing new link

     2) inode being linked

     The function has a local variable array of 2 ogfs_lockop_t structures
     that it zeroes, then fills the lo_ip fields with inode pointers for the
     two lock targets.  Exclusive mode is set by the zeroes, and the
     ogfs_inode_ops are selected by the fact that the lo_ip fields are used.
     See fs/glock.c ogfs_glock_m().

  ogfs_unlink() ... unlink a file
     See ogfs_link(), above.  The same locks are grabbed the same way.

  ogfs_rmdir() ... remove a directory
     See ogfs_link(), above.  Locks for directory and its parent are grabbed
     the same way.

  ogfs_rename() ... rename/move a file
     grabs up to 4 locks, all in exclusive (0) mode, using ogfs_inode_ops, on:

     1) inode of old parent directory

     2) inode of new parent directory

     3) inode of new name(?) (if pre-existing?)

     4) inode of old directory(?) (if moving a directory?)



Appendix I.  Inventory of calls to glops.

All calls to go_* are from glock.c.
All implementations (except ogfs_free_bh()) are in src/fs/arch_*/glops.c.

go_sync:

   Called from:

     sync_dependencies() -- sync out any locks dependent on this one

     ogfs_gunlock() -- unlock a glock

   Implementations:

     sync_meta() -- sync to disk all dirty data for a metadata glock
         used for types:  meta, rgrp
         calls:  test_bit()       -- GLF_DIRTY (any dirty data to flush?)
                 ogfs_log_flush() -- flush glock's incore committed transactions
                 ogfs_sync_bh()   -- flush all glock's buffers
                 clear_bit()      -- clear GLF_DIRTY, GLF_SYNC
         also called by:  release_meta(), release_rgrp()

     sync_inode() -- sync to disk all dirty data for an inode glock
         used for type:  inode
         calls:  test_bit()       -- GLF_DIRTY (any dirty data to flush?)
                 ogfs_log_flush() -- flush glock's incore committed transactions
                 ogfs_sync_page() -- flush all glock's pages
                 ogfs_sync_bh()   -- flush all glock's buffers
                 clear_bit()      -- clear GLF_DIRTY, GLF_SYNC
         also called by:  release_inode()

go_acquire:

   Called from:

     xmote_glock() -- promote a glock

   Implementations:

     acquire_rgrp() -- done after an rgrp lock is acquired
         used for type:  rgrp
         calls:  ogfs_rgrp_save_in() -- read rgrp data from glock's LVB

go_release:

   Called from:

     cleanup_glock() -- prepare (an exclusive?) glock to be released to
                        another node

   Implementations:

     release_meta()
         used for type:  meta
         calls:  sync_meta() -- sync to disk all dirty data assoc with glock
                 ogfs_inval_bh() -- invalidate all buffers assoc with glock

     release_inode()
         used for type:  inode
         calls:  ogfs_flush_meta_cache()
                 sync_inode() -- sync to disk all dirty data assoc with glock
                 ogfs_inval_pg() -- invalidate all pages assoc with glock
                 ogfs_inval_bh() -- invalidate all buffers assoc with glock

     release_rgrp() -- prepare an rgrp lock to be released
         used for type:  rgrp
         calls:  sync_meta() -- sync to disk all dirty data assoc with glock
                 ogfs_inval_bh() -- invalidate all buffers assoc with glock
                 ogfs_rgrp_save_out() -- write rgrp data out to glock's LVB

     release_trans() -- prepare *the* transaction lock to be released
         used for type:  transaction
         calls:  ogfs_log_flush() -- flush glock's incore committed transactions
                 fsync_no_super() -- (kernel) flush this fs' dirty buffers


go_lock:  -- get fresh copy of inode or rgrp bitmap from disk
             returns 0 on success, error code on failure of read
             (this is the only go_* function that returns anything)

   Called from:

     ogfs_glock()

   Implementations:

     lock_inode()
         used for type:  inode
         calls:  atomic_read() -- gl_locked, recursive cnt of process ownership
                 ogfs_copyin_dinode() -- get fresh copy of inode from disk


     lock_rgrp() -- done after an rgrp lock is locked by a process
         used for type:  rgrp
         calls:  atomic_read() -- gl_locked, recursive cnt of process ownership
                 ogfs_rgrp_read() -- get fresh copy of rgrp bitmap from disk

go_unlock:  -- copy inode attributes to VFS inode, or
               release rgrp bitmap blocks, copy rgrp stats to LVB struct

   Called from:

     ogfs_gunlock()

   Implementations:

     unlock_inode()
         used for type:  inode
         calls:  atomic_read() -- gl_locked, recursive cnt of process ownership
                 test_and_clear_bit() -- GLF_POISONED (gl_vn++ if so)
                 test_bit() -- GLF_DIRTY (have inode attributes changed?)
                 ogfs_inode_attr_in() -- copy attributes fm dinode -> VFS inode

     unlock_rgrp() -- prepare an rgrp lock to be unlocked by a process
         used for type:  rgrp
         calls:  atomic_read() -- gl_locked, recursive cnt of process ownership
                 ogfs_rgrp_relse() -- release (i.e. brelse()) rgrp bitmaps
                 test_and_clear_bit() -- GLF_POISONED (gl_vn++ if so)
                 test_bit() -- GLF_DIRTY (have rgrp usage stats changed?)
                 ogfs_rgrp_lvb_fill() -- copy rgrp usage stats to LVB struct

go_free:

   Called from:

     ogfs_clear_gla() -- clear all glocks before unmounting the lock protocol

   Implementation (in arch_*/dio_arch.c):

     ogfs_free_bh() -- free all buffers associated with a G-Lock
         used for types:  meta, inode, rgrp
         calls:  list_del() -- removes private bufdata from glock list
                 ogfs_put_glstruct() -- decrement reference count gl_count
                 ogfs_free_bufdata() -- kmem_cache_free private bufdata from
                                         ogfs_bufdata_cache




Appendix J.  Some info from earlier ogfs-internals doc

NOTE (bc):  Some of this information is no longer exactly accurate, but
provides interesting reading nonetheless.

	A Guide to the Internals of the OpenGFS File System

copyright 2001 The OpenGFS Project.
copyright 2000 Sistina Software Inc.

	The G-Lock Functions

The concept of a G-Lock is fundamental to OpenGFS. A G-Lock represents an
abstraction of some underlying locking protocol and is essential to
maintaining consistency in an OpenGFS filesystem. The G-Lock layer provides
the glue required between the abstract lock_harness code and the
filesystem operations. The lock_harness itself is the subject of
a separate document and not covered here.

The G-Locks are held in a hash table contained in the OpenGFS specific
portion of the superblock. Each hash chain has three separate lists
plus associated counters and a read/write lock. The lists are
associated with the state in which the G-Locks happen to be. The
not_held state is for locks which are not held by the client, but
the structure still exists (used to reduce the number of memory
allocations/deallocations). The held state is for locks which are
held by the client (although not in use by any processes). These
locks can be dropped immediately upon a request from another client
or upon memory pressure. The third state (perm) is used for locks
which are both locked and in use.

In order to release a G-Lock so that another client may access the data
which it protects, all the data which that G-Lock covers must be flushed
to disk. Also further accesses to the data on the client releasing the lock
must be prevented until such time as the client requires the lock. Clients
can cache data even when they don't have the G-Lock which covers that
data provided they check the validity of the data the next time they
acquire the lock and reread it if it has changed on disk.

We use various techniques to improve the efficiency of the glock
layer. Read/write locks are used upon the G-Lock lists so that the
more usual lookup operations can occur in parallel with each other
and only write operations (moving G-Locks between lists or creating
or deleting them) need the exclusive lock. Also when a lock has
been locked, it is not unlocked until it has aged a certain number
of seconds. This is done to increase the chances of a future lock
request being able to reuse the lock instead of requiring a separate
locking operation. Of course, if another client requires the lock, it
must post a callback to the lock holding client to request it. This is
done by marking the lock with a special flag which causes it to unlock as
soon as the current operation has completed (or immediately if there is
no current operation).

There is a deamon function (glockd) which runs periodically to clear
the G-Lock cache of old entries. It does this in a two stage process.
The first stage of ogfs_glockd_scan is really a part of the inode
functions, and not part of the glock code, but it fits nicely here.
The first part of the glockd scanning function looks at all the held
locks and demotes any which have been held too long. The second part
deletes any G-Locks which have exceeded the time out for not_held
locks.
	see opengfs/src/fs/glock.c
SourceForge Logo