memexp_Locking

memexp_Locking_Protocol (Aug 6 2003)
Contact the author.
              ---  The memexp Locking Protocol ---

Author, July-Aug, 2003:
Ben Cahill (bc), ben.m.cahill@intel.com

Introduction
------------

This document contains details of the memexp locking protocol, including the
kernel-space memexp lock module, and the user-space memexpd lock storage
server.  This is one of a few different protocols available for use on
the OpenGFS filesystem.

This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of shared file system locking in
OpenGFS.

This document is not intended as a user guide to OpenGFS.  Look in the OpenGFS
HOWTO-generic or HOWTO-nopool for details of configuring and setting up OpenGFS.

This document may contain inaccurate statements.  Please contact the author
(bc) if you see anything wrong or unclear.


Quick Overview
--------------

Lock modules are kernel implementations of G-Locks (see ogfs-locking).  Each
provides a distinct locking protocol that can be used in OpenGFS:

locking/modules/memexp -- provides inter-node locking, for cluster use
locking/modules/nolock -- provides "fake" inter-node locks, not for cluster use
locking/modules/stats -- provides statistics, stacks on top of another protocol

The memexp protocol supports clustered operation, and is fairly sophisticated.
The memexp module on each cluster member computer node keeps track of cluster
membership, and which members own which locks.  It does this with the help of
a central lock storage facility, and with LAN communication between computer
nodes.

Whenever a client node needs a lock, it must:

1)  Load the lock from lock storage
2)  Check the lock's state
3)  Change the lock's state to indicate that it is held by this node
4)  Store the lock into lock storage

If step 2) doesn't indicate a current lock state that is compatible with
the new requested lock state, the client must keep retrying steps 1) and 2)
until it does.  Current code has tight retry loops, so this can generate a
lot of traffic between the client nodes and the central lock storage.

Step 4) also incorporates a guarantee for atomicity of a load/store pair, by
comparing a lock's sequence number sent by the client with the sequence number
in lock storage.  If the sequence number is different, some other node has
modified the lock, so the client must retry the entire load/store procedure.

The only communication between nodes is for "callbacks" that request one node
to release a hold on a lock needed by another node.


Cluster Info:
------------

One of the first things done by the memexp lock module (on each computer node
in the cluster) is to read the cluster information from the Cluster Information
Device (cidev).  This is done when mounting the memexp locking protocol, during
filesystem mount.  The cidev is never accessed after mounting.

The cidev is a storage device (or dedicated partition), separate from the
filesystem or any journals, and separate from lock storage, that contains
information on each potential member (each of which is a potential memexp
client) of the cluster. 

When setting up an OpenGFS filesystem, a user/admin uses the ogfsconf user-space
utility to write the cluster/client information to the cidev, based on a
configuration file edited by the user/admin person.  See man page for ogfsconf,
as well as HOWTO-generic or HOWTO-nopool for info.

The interchange between the cidev and the lock module, and matching/conversion
of client info parameters (e.g. translating an IP address into a client ID),
are all handled by cluster_info.c.  This file defines, and is the only file
to touch the following structures:

struct node_info {
  uint16 jid;            -- Journal ID for a client node
  uint16 cid;            -- Client (or cluster?) ID for a client node
  uint32 ipaddr;         -- IP Address for a client node
  uint32 nstomith;       -- # of stomith methods for a client node
  struct node_info *next; -- next node_info_t in group list
};
typedef struct node_info node_info_t;

struct node_group {
  uint32 count;          -- # node_info_t entries in list of clients
  node_info_t *list;     -- the list of clients in the cluster
};
typedef struct node_group node_group_t;

Note that each client separately reads the cidev and saves a complete list of
potential clients.  It uses this (static) list to translate, for example,
to search for the IP address for a client that *this* client needs to
communicate with.

As part of the memexp mounting and initialization, a client writes
information about *itself* into lock storage, filling a single CIDBUF (see
below).  This write is done by client_state_init() in client_state.c.


Lock Storage
------------

The memexp protocol relies on a central repository of data, either on a
DMEP (Device Memory Export Protocol) device (e.g. certain SCSI drives),
or a "fake" DMEP server, the memexpd server.  Most users use the memexpd
server.

The memexpd server emulates DMEP operation, but stores data in memory or
a local disk store (on the computer running memexpd) rather than on SCSI
devices that are members of an OpenGFS pool (?).  Note that this storage
is completely separate from the filesystem and its journals, and from the
cluster information device (cidev)!

The memexp modules on the cluster client machines communicate with memexpd,
and with each other, via LAN rather than SCSI (DMEP).

To select between DMEP or memexpd, the memexp locking module uses one of
the low level ll_memexp_ops{}:

  ll_dmep_scsi  -- for SCSI DMEP devices
  ll_dmep_tcp   -- for memexpd DMEP emulator
  ll_dmep_debug -- for debugging (stacks on top of *both* scsi and tcp!)

ll_memexp_ops{} consists of the following operations:

  load     -- load a memory buffer into lock module from lock storage
                 (alloc a lock storage buffer if necessary)
  store    -- store a memory buffer from lock module into lock storage
                 (must do a load before trying this)
  dump     -- dump (read) many buffers from lock storage, for scanning/searching
                 (used only during recovery operations)
  modepg   -- fetch size and # buffers in lock storage (for lock state mgr init)
  init     -- initialize a connection to storage (called when module mounts)
  release  -- un-initialize a connection to storage (called at module unmount)
  stomith  -- tell memexpd server to ignore/ban communication from IP address
                 of bad machine.  Not used for SCSI DMEP.  Does not execute the
                 stomith procedure (i.e. power switch, etc.) for a node; scope
                 is limited to communication between memexpd and clients.

The protocol requires that a client machine must load a buffer before trying
to store it.  The load procedure is the only one that can allocate space in
storage, when a new lock buffer is accessed by a client.  For more on this,
see src/locking/servers/memexpd/memexp.c, Melload() and Melstore().


memexpd lock buffers and other storage:
---------------------------------------

The central lock storage server is organized as a bunch of equal-sized buffers.
The default size of each buffer is 32 bytes, but this can be changed with the
-b command line option for memexpd (see man page for memexpd).

The data buffers take up most of the memory space that memexpd allocates for
itself.  The default memory size is 32 MBytes, but that can be changed with
the -m command line option (see man page for memexpd).

Internal to memexpd, the overall memory size is divided between the following
usages (see src/locking/servers/memexp/memexp.c, create_MTab() and
create_Meltable()).  memexpd is hard coded to use about 2% of the memory for
the hash table.  The IP table and banned table are tiny.  This leaves the
bulk (about 98%) of memory for the element and data tables:

  hash table    -- maps buffer IDs (bid, actually the lockname, combination of
                     8-bit lock type and 64-bit lock number) to a physical
                     buffer number (pbn, storage location in memexpd data table)
  element table -- status, including links to Most-Recently-Used (MRU) table,
                     and hash table links for each data buffer (see below).
  data table    -- 32-bit data buffers for lock storage
  IP table      -- IP addresses of clients (*not* in order of OGFS cluster IDs)
  banned table  -- list of (dead node) clients isolated from memexpd
 
The element table contains (see src/locking/servers/memexp/memexp.h):

#define BIDSIZE 10
typedef struct MEelement_s {
  u32 Hnext;        /* index of the next hash element */
  u32 Hprev;        /* index of the prev hash element */
  u32 MRUn;         /* index of the next MRU element */
  u32 MRUp;         /* index of the prev MRU element */
  u64 seq;          /* sequence number, ++ each time change is made to buffer */
  u8 bid[BIDSIZE];  /* ~ lockname, combination of lock type and lock number */
  u8 InUse;         /* buffer in use, not empty */
} MEelement_t;

One element (35 bytes) and one data buffer (32 bytes) is required for each
lock or lock value block (LVB, see elsewhere).  Thus, 32 MBytes can store a bit
less than 1/2 million locks or LVBs.


accessing storage buffers:
--------------------------

Loading a lock storage buffer is the only way for a node to find current
status of a lock; there is no mechanism to proactively notify a node that the
lock that it wants is now available (although callback messages provide a way
to request a lock from another node).  So, polling (repeated loading) is the
only way for a node to keep trying to grab a lock.  Conversely, storing is
the only way for a node to make a lock's new status available for other nodes
to see.

When loading lock buffers from the server into the memexp module, memexp calls
the ll_memexp_ops "load" call, with the following as parameters:

memexp_t      * mep       -- memexp module's "superblock".  See ll_memexp.h.
lm_lockname_t * lockname  -- As with G-Lock layer, lockname consists of:
                               ln_number -- lock number
                               ln_type   -- lock type (e.g. LM_TYPE_CIDBUF)
                               See opengfs/include/lm_interface.h.
buf_load_t    * ret       -- buffer for Status returned from lock server:
                               inuse -- is memexpd buffer in use (yes/no)?
                               full  -- indicates how full the lock storage is
                               seq   -- storage sequence number for memexpd buf
                               pbn   -- physical buffer number
                               See ll_memexp.h, struct buf_load_s.
void          * data      -- buffer for Data returned from lock server
uint32          len       -- Length of data (i.e. memexpd buffer size, not
                               necessarily the length of data structure we're
                               reading, although that data must fit within the
                               memexpd buffer size, of course)

A call to store a lock is very similar, but contains a struct buf_store_s
instead of a buf_load_s (see ll_memexp.h), that leaves out the "full" field.

The module uses the "inuse" field to tell memexpd whether to store the data
(inuse = 1), or free the buffer space (inuse = 0).  This is not the same as
the "inuse" field of a CIDBUF!  (When loading a buffer, though, memexp will
look at the "inuse" status first when determining the status of a given lock,
and may, for example, return a CID_NOT_USED status if it sees !bld.inuse).
See elsewhere in this doc for more info on CIDBUFs.

The module simply copies the "seq" and "pbn" fields from the buf_load_t
(usually seen as "bld" in the code) to the buf_store_t (usually seen as
"bst" in the code).  See ll_memexp.h, prepare_store().  When handling the
storage request from the memexp client node, the memexpd server compares these
values to its internal state:  

  pbn -- serves as a sanity check that client and server agree on which lock
           buffer to access.
  seq -- guarantees atomic access to a lock buffer.  Indicates some other node
           has written to the buffer since a client loaded it.

If a client memexp module receives a Mel_Err_BadPbn, a Mel_Err_BadSeq, or
any other error from a store attempt, the client must re-load the lock, and
try again.

Dumps are similar to loads, but they read a lot of buffers at once.  These are
used during recovery or reset-the-cluster types of operations, not during
normal filesystem usage.


stored data (lock) types:
------------------------

There are several types of information stored in the central server, each no
larger than the common size (32 bytes, by default) of buffers in the server.
Types are defined in src/include/lm_interface.h, and are recognized throughout
the locking stack (from memexpd lock storage server to G-Lock layer) and
filesystem code:

  LM_TYPE_CIDBUF    (0x01) -- cluster member info, used by memexp only
  LM_TYPE_MOUNT     (0x02) -- mount, used by memexp only
  LM_TYPE_*         (0x03 - 0x0A) -- various types of locks (OGFS uses these)
  LM_TYPE_LVB_MASK  (0x80) -- Lock Value Block, ORd with another type number
                                to become the lock type (part of the "lock
                                name") for an LVB stored in lock storage.

memexp doesn't need to know the difference between the various types of
filesystem locks (0x03 - 0x0A), but it does need to know about the other types
(CIDBUF, MOUNT, and LVB).


LM_TYPE_CIDBUF:
---------------

This is what goes into a lock storage buffer for a lock of type CIDBUF
(see src/locking/modules/memexp/ll_memexp.h).  There is one CIDBUF stored
in lock storage for each and every cluster member node:

struct cidbuf_s {
  uint64 counter;      -- heartbeat counter; memexp increments this to prove
                          to other nodes that it's alive (heartbeat).
  uint16 cleaner;      -- the member node that's doing cleanup (journal
                          recovery, etc.) for the node represented by this
                          cidbuf.  Used only during recovery.
  unsigned char inuse; -- state of node, either unused, in-use, or dead and in
                          one of several recovery states:
                          CID_NOT_USED -- OGFS not mounted on this node
                          CID_USED     -- OGFS mounted on this node
                          CID_STOMITH  -- protecting fs and locks from dead node
                          CID_STOMITH_DONE -- node is ready for lock recovery
                          CID_EXPIRED_MARKING -- first step of lock recovery
                          CID_EXPIRED  -- journal recovery
                          CID_RESETING -- last step of lock recovery
  unsigned char dead;  -- local use only - TRUE if counter hasn't changed
};
typedef struct cidbuf_s cidbuf_t;

cidbuf_s shows cluster node membership status, as determined by the cluster
logic in memexp, for one node.  This structure shows the "aliveness" of the
node it represents, and also shows if another node is in the process of
recovering it after it expired (died).

Note that the node represented by this structure is *not* identified within
the structure; it is identified by its lock name (combination of lock number
and lock type) when loading from or storing into the lock store:

lockname.ln_number = cid;           -- Client ID
lockname.ln_type = LM_TYPE_CIDBUF;  -- Lock Type

The node that *is* identified within the structure is the "cleaner" node, if
any, that has taken over the node recovery after failure of the node
represented by the structure.

The heartbeat element is the only way that nodes detect the death of other
nodes.  There is no timeout on locks, per se.  However, each node must update
its own heartbeat counter periodically (HEARTBEAT_INTERVAL is defined by
memexp.h to be 2 seconds).  Counter updates are done by client_refresh(), in
client_state.c, called from lock_refresh() daemon in daemon.c.

Detection of dead nodes is done in reread_client_state(), and 
check_expired_clients() in client_state.c, called from node_monitor() daemon
in daemon.c.  Each node periodically checks *all* nodes' (including its own)
heartbeat counters.  If a node's counter has not changed since the last check,
that node is considered "dead".  The timeout for this detection is set by the
user when configuring the cluster, using ogfsconf (see man page for ogfsconf,
and HOWTO-generic or HOWTO-nopool).  As an example, a 30-second timeout causes
the node_monitor() daemon to run once every 30 seconds, thus it could take up
to almost a minute to determine that a node is dead.

The first node to find another node dead is responsible for initiating the
recovery process.  Recovery is a multi-step process.
See recover_expired_client() in client_state.c.  In short, recovery:

1).  Performs STOMITH for the dead node.  See stomith_cid() in client_state.c.
     This is two-fold:
     a).  Ban the dead node from communication with memexpd lock server.
     b).  Execute a STOMITH procedure to isolate the dead node from the
            filesystem.  This may be a machine reset or power-down, or
            a "fencing" operation that blocks access to the filesystem
            storage device by means of a fibre channel switch or the like.

2).  Partially unholds locks that were held by the dead node.  See 
     lsm_mark_expired() and cleanup_lock() in lock_state.c:
     a).  Unholds shared locks (so they can be held in exclusive mode by other
            nodes).  These locks allowed the dead node to read filesystem
            data, so the dead node would not have modified (written) any
            filesystem data associated with these locks, so it is safe to
            release them immediately.  Frees (deallocs from lock storage) any
            shared locks that now have no holder.
     b).  Marks all locks that were held by the dead node in "exclusive" mode.
            These locks are marked as "expired".  They may not be held (see
            check_compat() in lock_state.c) by other nodes until after recovery
            of the dead node's journal.  These locks allowed the dead node to
            write filesystem data/metadata.  There is no telling if these
            operations completed, so it is not safe to release them until
            after journal recovery.

3).  Recover journal of dead node.  This step is performed in the filesystem
     layer of the code.  memexp requests this via the LM_CB_EXPIRED callback
     to the G-Lock layer, which in turn causes the journal recovery to occur
     (see ogfs-locking doc).

     Journal recovery  requires the use of at least one of
     the locks that are "expired" (notably, the dead node's journal lock).
     The G-Lock layer of the node performing recovery (the "cleaner" node) can
     request a lock, using  LM_FLAG_NOEXP, and force the lock to become
     available, even if the lock is "expired" (see check_compat() in
     lock_state.c ... LM_FLAG_NOEXP causes "force" to be TRUE).  The "expired"
     status of the lock is maintained throughout the journal recovery process,
     even though the "cleaner" node actively holds the lock.

     Once the journal recovery is complete, the filesystem layer notifies
     memexp, using the ls_ops->reset_exp() call (see src/fs/recover.c,
     ogfs_recover_journal()).  memexp can then continue to the next step.

4).  Free the "expired" locks.  This allows other nodes to use the locks
     that the dead node had held in "exclusive" mode.  Now that journal
     recovery is complete, this is a safe thing to do.

For more information on the "inuse" field, and stages of recovery, see
comments at the beginning of src/locking/modules/memexp/client_state.c, which
is the main (but not the only) file that touches the cidbuf_t data type.


LM_TYPE_MOUNT:
--------------

This is what goes into a lock storage buffer for a lock of type MOUNT
(see src/locking/modules/memexp/mount_state.c, which is the only file that
touches this data type).  Within an OGFS system using memexp, there are
actually two types of "MOUNT" locks.  One is used at the filesystem level, and
is LM_TYPE_NONDISK, a "normal" type of lock.  See ogfs-locking for more info.
The type described here is LM_TYPE_MOUNT, is a different format than a
"normal" lock, and is known only within memexp:

struct mlbuf_s {
  uint16 state;        -- state of cluster regarding first node to mount the
                          filesystem.  One of the following:
                          MOUNT_STATE_NULL  -- no node has touched MOUNT lock
                          MOUNT_STATE_INIT  -- first node has touched MOUNT lock
                          MOUNT_STATE_RESET -- a node is resetting the cluster
                          MOUNT_STATE_OK    -- other nodes may mount
  uint16 owner;        -- node that owns the mount lock (OWNER_NONE in
                          NULL or OK state)
};
typedef struct mlbuf_s mlbuf_t;

This special lock enables a node to discover if it is the first node in the
cluster to mount the filesystem.  If so, the node will replay all journals for
the filesystem, then set the mount lock state to OK, so other nodes may go
ahead and mount the filesystem (see _ogfs_read_super() in
src/fs/arch_linux_2_4/super_linux.c).

The algorithm for "first" discovery is implemented in check_mount_state(),
in src/locking/modules/memexp/mount_state.c.  Here's what it does when it
finds the mount lock in various states:

  NULL  -- No node has touched the lock before us.  We're FIRST!
           (The implementation actually tests the "inuse" field of the
           status, bld, from the load.  !bld.inuse means NULL state.)
           This node can go ahead and mount, *and* replay all journals.

  INIT  -- Some node (the FIRST node) has exclusive ownership of the mount lock.
           Three possibilities:

           1).  Another (alive) node was first.  We keep trying for the lock
                every 2 seconds, using kernel call schedule_timeout() to block
                and wait between retries.  When we get it, it should normally
                be in OK state, so we can go ahead and mount.

           2).  *This* node was first.  Somehow we're asking for the lock again,
                but, hey, we're *still* FIRST, so we can go ahead and mount,
                and replay all journals.

           3).  Another (dead) node was first.  We take over the lock, and
                reset the entire cluster.  Once done, we claim that we are
                FIRST, so we can go ahead and mount, and replay all journals.
          
  RESET -- Some node has exclusive ownership of the mount lock, and is resetting
           the entire cluster, erasing all locks and lock value blocks,
           and setting each cluster member's CIDBUF to ?? state.
           Two possibilities:

           1).  The owning node is alive.  We keep trying for the lock
                every 2 seconds.  When we get it, it should normally be in
                OK state, so we can go ahead and mount.

           2).  The owning node is dead.  *We* take over the lock, and
                reset the entire cluster.  Once done, we claim that we are
                FIRST, so we can go ahead and mount, and replay all journals.

  OK    -- Some other node was FIRST.  We still have some things to check
           before going ahead and mounting.  Three possibilities:

           1).  This node has not been mounted yet (state == CID_NOT_USED).
                This is normal case.  We can go ahead and mount, without
                needing to replay all journals.

           2).  The cluster is alive (at least one node is heartbeating).  But,
                this node has already been mounted(?).  Something is suspect;
                we keep trying for the lock every 2 seconds.  What happens
                when we get it?  Will state be CID_NOT_USED?  Is this
                incomplete code?

           3).  The cluster is dead.  We take over the lock, and
                reset the entire cluster.  Once done, we claim that we are
                FIRST, so we can go ahead and mount, and replay all journals.

Once the FIRST node replays all journals, it will call the lock module to
execute others_may_mount(), also implemented in mount_state.c.  This sets
the mount lock state to MOUNT_STATE_OK, and sets the mount lock owner to
OWNER_NONE (65535), allowing any other node to go ahead and mount.


Filesystem Locks (LM_TYPE_XXX):
-------------------------------

The information stored in the central server, for each lock, consists of
the following lsm_lock_s structure (see src/locking/modules/memexp/ll_memexp.h).
This is the lowest level of lock representation, and is handled only by low
level code in lock_state.c and lvbops.c:

/*  Lock structure stored in DMEP buffer  */

struct lsm_lock_s {
  uint8 state;               -- lock state, one of:
                                ME_ST_UNLOCKED
                                ME_ST_EXCLUSIVE
                                ME_ST_DEFERRED
                                ME_ST_SHARED
  uint8 conv:4,              -- requesting "conversion" of lock, one of:
                                NO_CONV
                                YES_CONV
        lvb_perm:2,          -- Lock Value Block (LVB) is "permanent" (yes/no)
        lvb_lock:2;          -- Lock Value Block is locked (yes/no)
  uint16 expired_holders;    -- # of expired clients holding this lock
  uint16 live_holders;       -- # of alive clients holding this lock
  uint16 lvb_holders;        -- # of clients holding this lock's LVB
  uint8 holders[MAX_CLIENTS / 8];    -- bitmap of clients holding this lock
  uint8 lvb[MAX_CLIENTS / 8];        -- bitmap of clients holding lock's LVB
  uint8 expired[MAX_CLIENTS / 8];    -- bitmap of expired clients holding lock
  uint8 conversion[MAX_CLIENTS / 8]; -- bitmap of clients requesting conversion
};
typedef struct lsm_lock_s lsm_lock_t;


Lock Value Blocks (LVBs):
------------------------


LAN Comm:  Lock Storage and Inter-Node Callbacks
------------------------------------------------

LAN comm consists of two different paths:

-- Node to lock server
-- Node to node

Node-to-lock-server LAN is used only when the lock storage is implemented via
the memexpd lock server (node-to-lock-storage transfer is done via SCSI for
DMEP).  Node-to-node LAN is always used, regardless of lock storage
implementation, for requesting lock release from other nodes.

dmep_tcp.c handles the memexp lock module side of the communication between 
memexp and the memexpd lock storage server.  There are some interesting
comments at the beginning of dmep_tcp.c about the current state of development.
According to comments, the module and server can handle only one lock at
a time (i.e. lock load/store is serialized).  The code author describes some
thoughts about making it multi-requests-pending.

dmep_scsi.c handles lock transfers to/from SCSI DMEP storage, if it is used.

callbacks.c supports callbacks between nodes.  send_mess() (callbacks.c)
sends a message to a single other node.  send_mess() is called in the following
situations:

kick_holders()   -- Notify each client holding a lock that we need it.
                    Note:  This uses the lock's holder bitmap to determine
                    which holders to notify.  kick_holders() issues one of
                    the following message types:
                    LM_CB_NEED_S     -- we need the lock in shared mode.
                    LM_CB_NEED_E     -- We need the lock in exclusive mode.
                    LM_CB_NEED_D     -- We need the lock in deferred mode.

drop_locks()     -- Notify all clients to drop excess locks (see DROPLOCKS).
                    Called when memexp notices that the lock storage is
                    getting full, drop_locks() issues the following message
                    type to *all* alive nodes, to request them to release any
                    locks which are no longer needed:
                    LM_CB_DROPLOCKS  -- drop some locks, if possible.
                      See process_results() and drop_locks() in lockops.c.
                      See also ogfs_glock_cb() in src/fs/glock.c.

memexp_mount()   -- Something failed, send unmount (?) message (to self??).
memexp_unmount() -- Send unmount (?) message (to self??).

A callback thread, memexp_callback() is responsible for receiving callbacks
from other nodes.  When one is received, it simply forwards the callback to the
filesystem's G-Lock layer.  There is apparently no interest within the memexp
module in the callback's contents.



Registering with Lock Harness and Mounting
------------------------------------------

init_memexp() (main.c) is memexp's module_init function.  Its only job is to
register with the lock harness, via lm_register_proto() (harness.c).  When
registering, memexp passes a pointer to its struct lm_lockops memexp_ops{},
exposing its interface (see ogfs-locking).  The lock harness keeps this in
a table, and passes it to the filesystem if the filesystem mounts the memexp
protocol.

When mounting memexp as the lock protocol, the harness calls the memexp_ops{}
function "mount", implemented as memexp_mount() (moduleops.c).  This call
initializes memexp, and discovers the node's "first-to-mount" status and the
node's journal ID.

exit_memexp() (main.c) is memexp's module_init function.  Its only job is to
unregister from the lock harness, via lm_unregister_proto() (harness.c).


LockOps and The Lock State Manager:
----------------------------------

The bulk of lock management is performed in lockops.c and lock_state.c.

upper level lock support:
------------------------

lockops.c provides the memexp_ops{} calls used by the filesystem's G-Lock
layer to query and change states of locks.  These calls are all implemented
in lockops.c:

memexp_get_lock -- allocate and init a me_lock_t structure, given lockname
memexp_put_lock -- unallocate me_lock_t, and unhold attached LVB, if any
memexp_lock     -- lock a lock
memexp_unlock   -- unlock a lock
memexp_reset    -- reset a lock
memexp_cancel   -- cancel a lock request, so it won't get retried if it fails.

me_lock_t is the memexp-specific per-lock "private data" that attaches to a
filesystem G-Lock as an lm_lock_t type (see ogfs-locking).  This structure
identifies a lock stored, or soon to be stored, in lock storage.  It follows
along with a G-Lock as it is used in the filesystem layer, but it is never
accessed by any code outside of the memexp module.  Most calls from the G-Lock
layer to a memexp lock or LVB operation include an lm_lock_t as a parameter,
so memexp can identify and access the pertinent lock and/or LVB in lock storage.
An me_lock_t is defined in memexp.h as:

struct me_lock_s {
  lm_lockname_t lockname; -- 64-bit lock number and 32-bit lock type
  unsigned int cancel;    -- flag to cancel a lock request (end retry loop)
  lm_lvb_t *lvb;          -- pointer to Lock Value Block
  memexp_t *mep;          -- pointer to memexp "superblock" structure
};
typedef struct me_lock_s me_lock_t;

The memexp_lock and memexp_unlock commands provide the bulk of the memexp
functionality, as seen by the G-Lock layer.  Both cause state changes of locks
stored in lock storage.  Whenever a lock needs to be queried, created, or its
state changed, memexp loads it in from lock storage, manipulates it, and then
stores it right back into the lock storage.  There is no caching of locks
within the memexp module, but there may be many me_lock_t structures attached
to G-Locks cached in the G-Lock layer.

memexp_lock() attempts to change a lock's state to the state requested by the
G-Lock layer.  It calls the following functions:

calculate_action() -- determines a number (see memexp.h) representing the
                      "action", i.e. the transition from "old" to "new" state.
                      Some examples (14 are defined, including ME_ACT_NOP):
                      ME_ACT_U2E -- from unlocked to exclusive lock state
                      ME_ACT_D2S -- from deferred to shared lock state
                      ME_ACT_X2U -- from any state to unlocked state
lsm_lock()         -- makes one load/store attempt to change to the new state,
                      and handles any needed LVB operations.
process_results()  -- interprets results of lsm_lock() attempt.

lsm_lock() supports all lock and unlock operations for the locks used by the
G-Lock layer (see ogfs-locking).  That is, it coordinates state change and
Lock Value Block operations on all locks except for CIDBUF and MOUNT types,
which are used privately within the memexp module (see elsewhere in this
document).  

process_results() may indicate "success", "retry", or "error".  If retry is
indicated, it means the attempt failed, but try again.  memexp_lock()
automatically keeps trying unless and until the G-Lock layer calls
memexp_cancel() to cancel the memexp_lock() request.  Retries can happen in
three situations:

1)  While waiting for a conversion deadlock to be resolved
2)  While waiting for other nodes to release the lock
3)  After waiting for an expired lock to become unexpired and available

Retries are *not* attempted if the LM_FLAG_TRY flag is set in the lock request.
The retry loop may be aborted by a G-Lock call to memexp_cancel().

process_results() is the only source for callbacks to other nodes!  It calls
drop_locks() if it sees that the lock storage is becoming full, and calls
kick_holders() to notify nodes holding this lock to give it up.  See LAN
Comm section of this document.

memexp_unlock() behaves similarly to memexp_lock().  It first determines a
number expressing "old" and "new" states, then calls:

lsm_lock()         -- makes one load/store attempt to change to the new state,
                      and handles any needed LVB operations.

memexp_unlock() does not call process_results() or make retry attempts.  It
simply returns the result of lsm_lock() to the G-Lock layer.

memexp_reset() unlocks a lock, but does not provide any support for handling
LVBs (what happens if G-Lock layer resets a lock that has an LVB?).  It
calls lsm_lock, just like memexp_unlock() does, but uses the X2U action.


LVBs and lower level lock support:
----------------------------------

As suggested above, lsm_lock() (lock_state.c) is the function called by the
memexp_ops{} functions for *any* lock state change, whether locking or
unlocking (don't let the name deceive you).

lsm_lock() sorts requests into those that require accessing an LVB, vs. those
that do not.  The G-Lock layer makes the decision about whether to attach
an LVB to a lock.  If one is attached, there are two rules for accessing LVBs:

1)  Any transition that *starts* in the *exclusive* state *stores* the lock's
    LVB into lock storage.  Exclusive state indicates that this node has written
    to the protected resource; the LVB contains updated information about the
    resource, needed by other nodes that read or write the resource.

    lsm_lock() calls lvb_write() (lvbops.c) for this, *before* changing lock
    state (to deferred, shared, or unlocked), that is, before giving up this
    node's write privilege.  If the new state is shared or deferred, this
    node can continue to use (read) the contents of the LVB that it just stored.

2)  Any *other* transition that *ends* in a *locked* state *loads* an LVB from
    lock storage.  This node will either read or write the protected resource,
    and needs the latest information about it.

    lsm_lock() calls lvb_read() (lvbops.c) for this, *after* changing lock state
    (to deferred, shared, or exclusive), that is, after acquiring this node's
    read privilege.

One interesting exception to these rules:  If the action is NOP (no operation;
the NOP action is used when waiting for an expired lock to become unexpired and
available for use), no LVB activity takes place, regardless of starting or
ending state (both of which are the same, of course).

lsm_lock() calls one of two functions to handle lock state, depending on
whether the action ends in a locked vs. unlocked state.  Both of these:

-- load a lock from lock storage,
-- verify that this node's hold on the lock is appropriate,
-- change the lock state and/or holder bitmap(s),
-- store the lock again
-- fill in a lsm_status_t for use within the module by process_results()
      (for lock) or by unlock() (for unlock).

lsm_unlock() -- Handles E2U, D2U, S2U, and X2U (from anything to unlocked):
                Sets lock's lvb_perm = 1, if E2U and lock's LVB is PERM
                   (keeps lock from being freed if its LVB is "permanent",
                    even if there are no nodes holding the lock).
                Clears any pending conversion request from this node on lock,
                    by using macro clear_conversion() (ll_memexp.h).
                Frees the lock storage buffer, if this node is the last holder,
                    by clearing bst.inuse.
                Clears this node from the lock's "held" bitmap,
                    by using macro clear_holder() (ll_memexp.h).
                
__lsm_lock() -- Handles all others, including NOP:
                Clears lock buffer contents if storage buffer was unused.
                If NOP, returns after simply loading the lock buffer.
                Checks compatibility of current state with requested state,
                    calling check_compat() (lock_state.c), which returns one
                    of 3 conditions:
                    1 -- Compatible, OK to proceed with lock state change:
                         - Lock buffer is unused, or
                         - From exclusive to deferred or shared, or
                         - Various lock states, AND no other node is requesting
                            a conversion (but we might be).
                    0 -- Not Compatible, current state not good for requested
                         action, but try a "conversion":
                         - Various lock states, AND no node at all is requesting
                            a conversion.
                   -1 -- Not Compatible, no conversion possible.  Cases:
                         - Lock is expired (owned by dead node), or
                         - Another node is requesting a conversion.
                If NOP, returns after simply loading the lock buffer.


conversions:
-----------

retries:
-------

Both of the memexp module's low level lock state change functions, lsm_unlock()
and __lsm_lock(), load a lock from storage, change the lock state, then store
the lock back into storage.  Both functions have an infinite retry loop that
keeps retrying forever, and with rapid repeats, if the storage operation fails,
but will abort immediately if the load operation fails.  Unlike the higher
level retry loop, there is no hook by which this retry loop can be aborted
by higher level intervention; the only ways out are success of the store,
or failure of the load.

The memexpd (storage server) store handler, Melstore
(src/locking/servers/memexp/memexp.c) will fail for any of the following
reasons:

  Mel_Err_BadSize -- inappropriate size of data to be stored
  Mel_Err_NoBid   -- lock type/number not in hash table
  Mel_Err_BadPbn  -- storage location is invalid
  Mel_Err_BadSeq  -- another node has written to this buffer since this
                       node loaded it.

Any of these failures (except BadSize?) might be cured by a fresh load of the
lock buffer, which is exactly what lsm_unlock() and __lsm_lock() attempt to do.
If a load fails, either function will return an error code immediately to the
caller.  This breaks out of the low level retry loop, and puts control in the
hands of the (cancelable) retry logic in memexp_lock().  The memexpd (storage
server) load handler, Melload (src/locking/servers/memexp/memexp.c) will
fail for any of the following reasons:

  Mel_Err_BadSize    -- inappropriate size of data to be loaded
  Mel_Pbn_Space_Full -- lock storage is full, no room for new buffer

Results of an LSM lock operation, passed up from lsm_unlock() or __lsm_lock(),
to be analyzed by process_results():

struct lsm_status_s {
uint8  success:1,                 /* successful load/store */
       state:2,                   /* new lock state */
       conv:1,                    /* conversion ?? */
       hvconv:1,                  /* have conversion ?? */
       extra:3;                   /* padding */
uint8  full;                      /* 0-255 "meter" of lock storage fullness */
uint16 live_holders;              /* # of alive clients holding lock */
uint16 expired_holders;           /* # of dead clients holding lock */
uint8  cid_list[MAX_CLIENTS / 8]; /* bitmap indicating live holders */
};
typedef struct lsm_status_s lsm_status_t;


LVB Operations:
--------------

lvbops.c provides the memexp_ops{} calls used by the filesystem's G-Lock
layer to query and change contents of LVBs.  Each and every LVB is attached
to a lock, so the lock must exist before creating an LVB to attach to it.
These calls are all implemented in lvbops.c:

memexp_hold_lvb   -- If LVB exists already (attached to the lock specified in
                       the call), simply return a pointer to the LVB's
                       descriptor (no load/store needed).
                     If not, allocate and init an lm_lvb_t (LVB descriptor)
                       structure and an lvb data buffer for our own use, and
                       attach the LVB to the lock (load lock from lock storage,
                       add this node to lock's lvb bitmap, store it again).
                       Load the LVB buffer from lock storage.  If not in use
                       by another node, zero-init it and store it again.
                       Copy LVB contents to our lvb data buffer.

memexp_unhold_lvb -- Load the lock from lock storage (to get LVB status).
                     If other nodes hold the LVB, or LVB is permanent:
                       simply remove this node from lock's lvb bitmap,
                       free our LVB descriptor and data buffer.
                     If not, we can free the LVB buffer (and maybe the
                       lock itself) in lock store:
                       Set lvb_lock on lock, store it (prevent races).
                       Load LVB, store it with bst.inuse=0 to free it.
                       Remove our LVB hold in lock, clear lvb_lock.
                       Store lock (with bst.inuse=0 if we can free it).
                       Free our LVB descriptor and data buffer.

memexp_sync_lvb   -- store node's current LVB contents into lock storage.



Appendix A:  memexp Lock Protocol Mount Sequence
------------------------------------------------

This summarizes the sequence of events found in memexp_mount() (moduleops.c),
called by the lock harness when mounting the memexp lock protocol on this node.

1)  Copies name of cidev ("table_name", for cluster member table?  also
    called "lock space").  Checks list of memexp instances on this node to
    make sure this cidev is not already in use by another instance.

2)  Allocates memory for memexp_t structure, the memexp "superblock" structure,
    the pointer to which is usually seen as "mep" in memexp code.
    Zeroes all, and initializes G-Lock callback pointer cb (note that this
    is *not* the inter-node callback socket, see step 4, below), and
    filesystem data pointer (identifies filesystem instance), fsdata.

3)  Opens the cidev.  Reads the info on all potential cluster members from the
    cidev via init_cluster_info() (cluster_info.c).  This also discovers
    *this* machine's identity (e.g. the journal id for this machine), and fills
    in some members of mep structure.  Calls show_cluster_info()
    (cluster-info.c) to print (to console) info on all potential members.

4)  Opens callback socket for inter-machine callback LAN communications, using
    open_cb_socket() (callbacks.c).  Note that this is *not* the callback
    interface between the locking module and the G-Lock layer (see step 2,
    above, and ogfs-locking document), although all callbacks received through
    this socket are forwarded through that interface to the G-Lock layer.

5)  Determines the lock storage server (SCSI DMEP, TCP memexpd, or debug,
    which looks like it invokes both DMEP and memexpd).

6)  Initializes the connection to the lock storage server via call to
    ll_memexp_ops "init" function.

7)  Initializes Lock State Manager subsystem via lsm_init() (lock_state.c).

8)  Discovers our mount status (are we first to mount?), and blocks until
    we're clear to mount, via check_mount_state() (mount_state.c).
    Fills in *first parameter, to be returned to lock harness and filesystem.

9)  Initializes Client State Management, including writing our client info
    into lock storage, via client_state_init() (client_state.c).

10) Starts Lock Callback Thread, memexp_callback(), to support inter-machine
    LAN callbacks (has nothing to do with callbacks from lock module to G-Lock
    layer).

11) Starts Lock Refresh (heartbeat) Thread, lock_refresh() (daemon.c).  This
    thread periodically calls client_refresh (client_state.c) to increment
    the heartbeat counter stored in the CIDBUF for this client.

12) Starts Expiration Monitor Thread, node_monitor() (daemon.c).  This thread
    periodically calls reread_client_state() (client_state.c) to read the
    states (CIDBUFs) of all active clients from lock storage, and
    check_expired_clients() to check heartbeat counters for each one.

13) Adds this mep (memexp superblock) pointer to list of instances of memexp
    on this node.

14) Fills in *jid parameter, to be returned to lock harness and filesystem.



Appendix B:  "memexp:  asking for lock (4, 184) action 3 from 0"
-------------------------------------------------------------------------------

This very popular error message occurs in kick_holders() (lockops.c) when a
node is trying to acquire a lock from another node, and it's taking more than
one attempt to do so.

The example above is interesting.  It actually occurred, and indicates that
something is wrong, as you will see below.

The numbers in parentheses () are (lock type, lock number).  Combined, these
numbers are the lock "name".  Lock types are defined in
opengfs/src/include/lm_interface.h:

#define LM_TYPE_RESERVED       (0x00)
#define LM_TYPE_CIDBUF         (0x01)
#define LM_TYPE_MOUNT          (0x02)
#define LM_TYPE_NONDISK        (0x03)
#define LM_TYPE_INODE          (0x04)
#define LM_TYPE_RGRP           (0x05)
#define LM_TYPE_META           (0x06)
#define LM_TYPE_NOPEN          (0x07)
#define LM_TYPE_FLOCK          (0x08)
#define LM_TYPE_PLOCK          (0x09)
#define LM_TYPE_PLOCK_HEAD     (0x0A)
#define LM_TYPE_LVB_MASK       (0x80)

In the example (4, 184), this is a lock for an inode type, at sector 184.
Note that the lock number relates to a 512-byte *sector*, not a filesystem
block (OGFS glock.c converts fs blocks to sectors when creating the lock
number).  Typical block size of 4096 requires us to calculate:

block # = (184 / 8) = 23

So, we are trying to grab an inode lock, for the inode on block 23.

The "action" is the requested transition from one lock state to another.
The transitions are defined in opengfs/src/locking/modules/memexp/memexp.h:

#define ME_ACT_NOP              (0)
#define ME_ACT_U2E		(1)
#define ME_ACT_U2D		(2)
#define ME_ACT_U2S		(3)
#define ME_ACT_E2D		(4)
#define ME_ACT_E2S		(5)
#define ME_ACT_E2U		(6)
#define ME_ACT_D2E		(7)
#define ME_ACT_D2S		(8)
#define ME_ACT_D2U		(9)
#define ME_ACT_S2E		(10)
#define ME_ACT_S2D		(11)
#define ME_ACT_S2U		(12)
#define ME_ACT_X2U		(13)

The example's action, 3, is "unlocked to shared" according to table above.

The example's last number, 0, is the cluster ID of the node that currently
holds the lock.

Here's the example again:

memexp:  asking for lock (4, 184) action 3 from 0

So, this node is requesting a shared lock on the inode on block 23, from a
node that holds it in unlocked state.  Something is wrong with the scenario,
since an unlocked lock should have *no* holders; that's why the error message
appeared!