OpenDLM_Design_and_Implementation

OpenDLM_Design_and_Implementation (May 26 2004)
Contact the author.
      Design of OpenDLM Lock Module for OpenGFS' G-lock  (V0.03)

Authors:
  Stanley Wang
  Ben Cahill

This document describes the implementation details of the "opendlm" lock module
that adapts OpenDLM for use as the lock manager for OpenGFS' g_lock layer.

You can find corresponding code within the OpenGFS source tree at:
src/locking/modules/opendlm/.

This lock module is based in part on the "memexp" lock module.  See
"ogfs-locking" and "ogfs-memexp" for more information on the g_lock layer
and memexp.


1. How to implement all "lm_lockops".
-------------------------------------

Fields in struct lm_lockops:

1.1  mount
----------

Do almost the same things as the memexp lock module's mount function:

1).  Calculate a 32-bit hash of the cluster information device
	(cidev) name (also known as table_name or lockspace name)
	to use as the ID of an ogfs instance.  This hash will be used
	as part of the name of each and every lock grabbed by this
	filesystem.

2).  Get cluster configuration info from OpenDLM (memexp reads the
	cluster information device "cidev" for this info, but we can
	get the info directly from OpenDLM, and not use cidev).
	Info includes total number of configured cluster member nodes
	(as configured in OpenDLM's /etc/dlm.conf file), and the node
	ID of *this* node.

3).  Initialize deadman locks.  There will be one deadman lock for
	each configured cluster member node (whether active or not).

4).  Return the lockspace structure, journal ID, and first-to-mount
	indication (see below) to OpenGFS.

lockspace: 
	OpenDLM itself has only a single instance, shared between
	user space apps and kernel clients such as ourself.  However,
	we need to support multiple instances of the opendlm *lock
	module*, one for each OpenGFS filesystem mounted on this node.

	The "lockspace" structure denotes a single lock module instance.
	This structure is private to the lock module, but gets passed
	around by the OGFS filesystem code, so that it obtains locks
	from the correct instance.

	Get lockspace based on cidev (because different ogfs instances
	use different cidevs). Record lockspace and node information
	(such as CB, fsdata, etc) in a private data sturct.

	Following is the definition of the private lockspace (instance)
	structure:

	struct opendlm {
       		struct list_head        list;
        	char                    table_name[256];
	        uint32                  name_space; /* hash value of table_name */
       		lm_callback_t           cb;
        	lm_fsdata_t             *fsdata;

        	struct file             *cfile;
	        unsigned int            jid;
       		unsigned int            nodes_count;
	        struct opendlm_lock     *mount_lock;
       		struct list_head        deadman;
	};



ls_jid: 
	OpenGFS journal ID.  Corresponds to OpenDLM node ID, but is
	offset by one (OpenDLM range 1-n, OpenGFS range 0 - (n-1).
	Get it from OpenDLM, and perform offset.

ls_first: 
	Determined by using deadman locks.  Only the first node to mount
	this instance of the filesystem will succeed in immediately
	grabbing *all* deadman locks (one for each configured node)
	for this lock space, thus detecting that it is, indeed, the
	first-to-mount.

	The first-to-mount node grabs the mount lock, blocking all
	other nodes from mounting until OGFS calls "others_may_mount",
	after replaying all journals. 

	Other nodes grab the mount lock for only as long as it takes
	to determine that they are not first-to-mount.

	Note that this mount lock is private within the opendlm lock module.
	OpenGFS never sees this lock, even though it grabs a (different)
	mount lock itself.


1.2  others_may_mount
---------------------

Release the mount lock.  Called only by first-to-mount node.


1.3  unmount
------------

Clean up all data structures. 


1.4  get_lock 
-------------

Allocate and initialize a private lock struct.  See section 2, below.

We do not allocate an LVB descriptor here, although there is a buffer for
LVB data included in the private lock struct's lksb member.  See hold_lvb().


1.5  put_lock
-------------

Clean and free a private lock struct.  The g_lock layer is responsible
for making sure that the lock struct is no longer needed (i.e. lock is
unlocked, and there is no hold on LVB, etc).

The g_lock layer calls this when:

-- re-using a free glock structure sitting in its not-held cache
-- periodically (every 1 minute?) cleaning up lock structures from its caches
-- cleaning up all locks just before unmounting.

Note:  While this should never be called before the unlock() call for this
lock, it may be called immediately after the unlock() call, before OpenDLM has
finished its processing (i.e. before the AST occurs).

Q:  Is there any need to keep our private lock struct around until the AST???
If so, then the AST will need to be responsible for de-allocating it ...

A:  The lock must be unlocked, and put_lock() must have been called, before
de-allocating it.  Therefore, we need a usage count that gets incremented in:

-- get_lock()
-- lock() (only for new locks, not lock conversions)

and decremented in:

-- put_lock()
-- unlock AST (i.e. only for unlocks, not lock conversions)

If either put_lock() or the unlock AST find the usage count to be zero, they
will be responsible for de-allocating our private lock structure and LVB
descriptor.


1.6  lock
---------

Do the lock operation.

	Lock state translation, OpenGFS -> OpenDLM:
		LM_ST_UNLOCKED	->	NL
		LM_ST_EXCLUSIVE	->	EX
		LM_ST_SHARED	->	PR
		LM_ST_DEFERRED	->	CW (supported, but not used by OpenGFS)

	Lock flag translation:
		LM_FLAG_TRY	->	LKM_NOQUEUE
		LM_FLAG_NOEXP	->	grant lock to OGFS during lock recovery

	Note: 	Shared locks are treated as non-persistent.
		Exclusive locks are treated as persistent (orphan).
		See section 2. OGFS Journal and ODLM Lock Recovery

When locking, we must decide whether to convert a pre-existing lock, or
create a new one.

Increment usage count only when creating a new lock.

Some historical notes:  We considered using the LKM_ORPHAN flag to cause
EXclusive (write) locks to be treated as "persistent", that is, they stay
"held" (in effect) by a node even after that node dies.  However, OpenDLM
supports the ORPHAN behavior only for dead client applications, but not for
dead nodes.  See section 2, OGFS Journal and ODLM Lock Recovery, for
discussion of how we handle persistence of dead node's locks.


1.7  unlock
-----------

Do the unlock operation.

Because of the need to preserve LVB data within OpenDLM, we may need to convert
a lock to NULL mode, instead of completely unlocking it.  This will occur if
the client still has an interest in the LVB.  Unlocking can occur before the
client has lost interest in the LVB!

Interest in the LVB is indicated by the "lvb_hold" member of the lock private
data structure, which must be incremented by hold_lvb(), and decremented by
unhold_lvb().

Decrement the usage count in the unlock AST (but not when doing a conversion,
which will *not* use the unlock AST).


1.8  reset
----------

Called by g_lock when it encounters an error from us (lock module) when
locking or unlocking a lock.

Forcefully (use LKM_FORCE flag) do the unlock operation. 

Decrement the usage count in the unlock AST.


1.9  cancel
-----------

Cancel (use LKM_CANCEL flag) a lock/convert request.

Decrement the usage count when cancelling a new lock (but not when cancelling a
conversion ... HOW can we tell which it is??).


1.10  hold_lvb (lock value block)
--------------

If not already present, alloc an LVB descriptor and attach it to our private
lock structure.  Return a pointer to the LVB descriptor.  The LVB data will be
(but is not yet) contained in the lock status block (lksb) defined by OpenDLM,
and already allocated by get_lock() as part of the private lock structure.

The g_lock layer calls hold_lvb() after allocating a private lock structure
(via get_lock()), but before actually locking a lock (via lock()).
The client should not try to access (read or write) the LVB data before locking
the lock.  An audit shows that OpenGFS is well-behaved in this regard (but
there is no real protection against bad behavior).  The client can write LVB
data only if it holds the lock in EXclusive mode.

The memexp implementation of hold_lvb actually reads the LVB data from the
lock storage.  However, we cannot do this here, as OpenDLM makes the data
available to us only as part of a locking operation (which the client hasn't
asked for yet).  Fortunately, OpenGFS never tries to read or write LVB data
before grabbing the lock.  However, there is no check that would prevent
the client from doing so, since we simply return a pointer that allows the
client to access our (currently invalid) LVB buffer.

The LVB data will be updated to OpenDLM (to be visible to other cluster
members) only when we release our EXclusive lock, either by down-converting
or by totally unlocking.

The g_lock layer keeps locks in a cache, and doesn't release them until 5
minutes after their last use by the filesystem.  This may occur *before*
the filesystem code releases its hold on the LVB!  We must make sure that we
don't totally unlock the lock before the client releases its hold on the LVB
(via unhold_lvb()).  Our unlock() operation must detect this, and either:

-- unlock, if there is no LVB hold.

-- down-convert to a NULL lock, if there is an LVB hold.

hold_lvb() increments the lvb_hold value, to indicate our client's interest
in the LVB.  unhold_lvb() decrements the lvb_hold value.

NOTE:  If we down-convert to NULL mode, we will need to later unlock the
lock after client gives up interest in LVB.  unhold_lvb() should check for
NULL lock, and unlock it.

Specify LKM_VALBLK for locks with LVBs, as indicated by non-NULL lvb member
of private lock structure.

Note: The default size of lvb in OpenDLM was 16 bytes, but we have changed
it (within OpenDLM) to 32 bytes, to accomodate the 32-byte OpenGFS LVB size.


1.11  unhold_lvb
----------------

The g_lock layer calls this when the client (filesystem) is no longer interested
in reading or writing the LVB data.

Unfortunately, this may occur before the client unlocks its EXclusive lock
(i.e. before we actually write our LVB data to OpenDLM).

The memexp implementation destroys its in-core LVB buffer and descriptor now.
However, we must keep these until we get a chance to update the LVB
data to OpenDLM.  Therefore, we will wait to destroy them in put_lock()
or during the final unlock AST.

If unhold_lvb() is called after the lock is unlocked, unlock() will have
down-converted the lock to NULL mode, to preserve the LVB data within OpenDLM.
If so, we must *unlock* the lock here!!

memexp supports a "permanent" LVB, that is, one that persists even though
no node holds a lock on the LVB's resource.  However, I haven't found anywhere
that OpenGFS or the g_lock layer requests a permanent LVB.


1.12  sync_lvb
--------------

The memexp implementation writes the LVB data to lock storage.  Unfortunately,
we cannot do that now, as OpenDLM allows updates of LVB data only when
unlocking or down-converting an EXclusive lock, which has not occurred yet
(the client must hold an EX lock when calling sync_lvb()).

No other node can read or write the LVB data until we have released our
EXclusive lock, however.  So, as long as we keep our in-core copy of LVB data
for our own reference, and update the OpenDLM LVB during unlock(), this should
be okay (i.e. other nodes will access the data that we wrote).


1.13  reset_exp
---------------

Called just after OpenGFS replays a journal (regardless of reason when journal
was replayed, i.e. mounting fileystem or recovering dead node).

This call notifies the lock module that journal recovery is complete, so the
lock module can re-start the flow of locks to OpenGFS, and coordinate the
other nodes to do the same.  See section 2.3.4, When to Stop Stashing Locks?


1.14  Fields in struct lm_lockstruct
------------------------------

*ls_jid
See previous section.

*ls_first
See previous section.

*ls_lockspace
See previous section.

*ls_ops
See previous section.


2.  OGFS Journal and ODLM Lock Recovery
---------------------------------------

2.1  OGFS Journal Recovery
--------------------------

OGFS journal recovery involves replaying the journal of a node that has died.
A surviving node plays back the journal, and writes to disk any metadata that
had been scheduled to be written to disk, but failed to happen because the
expired node died.

OGFS replays a journal when:

-- mounting the filesystem
-- another node dies while the fileystem is mounted

Normally, the replay at mount time should *not* cause any metadata writes,
because the metadata should have been flushed to disk, and the journal
emptied, when the filesystem was previously, successfully, unmounted.

It's likely, though, that node death will result in metadata writes to disk,
if the node was involved in writing (including moving, renaming, or removing)
items on the disk at its time of death.

In a multi-node cluster, another node will replay the journal of a dead node,
perhaps finishing before the dead node reboots and remounts the filesystem.
In a single-node system, the mount time replay serves as the node death replay,
since this is the first time that the one node gets access to the journal and
fileystem since time of death.


2.1.1.  Special locks used during OGFS Mount and Recovery
---------------------------------------------------------

OGFS assumes that EXclusive (write) locks held by dead nodes will persist after
the node's death, and continue to protect the object being written by the dead
node.  While this is the behavior of OGFS' legacy "memexp" lock protocol, this
is not true with OpenDLM, which cleans *all* locks owned by a dead node (even
if marked "ORPHANable").  The discussion below describes OGFS' expectations;
our job will be to simulate this behavior within the opendlm lock module.

Certain special locks are requested by OpenGFS with the NOEXP (no expired) flag.
This flag means that the lock should be granted to this node even if it is 
blocked by an orphan EX lock held by a dead ("expired") node.  This flag is
used only for locks used for recovery/mount operations:

OpenGFS Mount lock -- EX, *not* the same as the OpenDLM lock module Mount lock.
	If expired EX, the dead node was mounting the filesystem when it died.

OpenGFS Live lock -- a shared lock, so actually it would never need to persist
	after node death.  (and I haven't seen evidence of this protecting
	anything, anyway).

OpenGFS Transaction lock -- normally a shared lock, so multiple nodes can write
	data to disk simultaneously, but is grabbed EX by OpenGFS
	ogfs_recover_journal(), to keep any other nodes from writing to disk
	while a journal recovery is in progress.  If expired EX, the dead node
	was replaying its own or another node's journal when it died.

OpenGFS Journal locks -- EX, one lock for each journal.  This should always
	be expired EX when a node dies, because each node holds its own journal
	lock for the lifetime of the filesystem mount.

NOEXP overrides blocks by locks held by dead nodes, and allows the lock to be
granted to the requesting node.  NOEXP must *not* override locks held
legitimately by any (live) nodes.  That would violate the normal operation of
the locks.

The node doing journal recovery requires the transaction lock and the journal
lock for the recovering journal, both in EX mode.  No other nodes may write
to disk (by virtue of the transaction lock) while a journal recovery is going
on.  See OGFS function ogfs_recover_journal().


2.2  ODLM Lock Recovery
--------------------------

ODLM lock recovery involves cleaning out locks held by a node that died, while
retaining and re-distributing any valid information about the lockable resources
and locks held by surviving nodes.  Note that *no* locks owned by a dead node,
not even orphan locks, survive when a node dies.  Orphan locks survive only
client application deaths, not node deaths.

Lock recovery is controlled by the recovery state machine.  See the
"odlm-recovery" doc ("OpenDLM_Lock_Recovery") at:

http://opendlm.sourceforge.net/docs.php

The lock recovery process starts soon after receiving a message from the cluster
manager used by OpenDLM, indicating that a node has died.  During the recovery
process, OpenDLM does not service any new lock requests, but returns
DLM_DENIED_GRACE_PERIOD (the time spent in lock recovery is called the "grace
period"), so the requestor can try again.

The lock recovery process ends once all nodes have reached the final "barrier"
state, and have notified each other that they can all move back to "run"
mode.  As part of the last stage of recovery, the recovery process grants
locks that were blocked by the dead node.

The lock recovery process may restart if another node dies, so there can be more
starts than ends.  The recovery process is not recursive.  There is only one
recovery state machine, and only one state, for a given node.


2.3  Coordinating Journal and Lock Recovery
-------------------------------------------

We must protect the files that the dead node was writing, by not allowing
other nodes to grab locks on those files (either read/shared locks, since
they might read bad/stale information, or write/exclusive locks, since they
would clobber or be clobbered by the upcoming journal replay).

The protection must continue through the end of the journal replay.
Journal replay cannot start until OpenDLM has recovered, since journal replay
needs to grab a couple of important locks (transaction and journal locks),
which must be granted by OpenDLM, *after* lock recovery.  By that time,
OpenDLM will have "forgotten" all of the locks owned by the dead node, and
granted locks (those blocked by the dead node) to other waiting nodes.

So, the chain of events must be:

1)  Cluster manager detects dead node
2)  OpenDLM lock recovery mechanism begins
3)  "Something" protects OpenGFS
4)  OpenGFS lock module detects need for journal recovery, passes to OpenGFS
5)  OpenGFS requests EX transaction lock and dead node's journal lock
6)  OpenDLM finishes recovery, grants locks
7)  OpenGFS does journal recovery
8)  Things get back to normal


2.3.1.  "Something" -- Stashing granted locks
---------------------------------------------

The above requirements suggest that the lock module must intervene, accepting
the granted locks from OpenDLM during *lock* recovery, but *not* delivering
them to OpenGFS until after *journal* recovery, saving them on a list in the
meantime.  This simulates the effect of allowing the dead node's locks to
"persist", simply because OGFS on surviving nodes won't see that their lock
requests have been granted.  We'll call this non-delivery and saving on a
list "stashing".

Q:  Is there a possibility of deadlock here?? ... i.e. a node holds a shared
transaction lock, but can't finish its job (and release the transaction lock
so the node doing recovery can grab it EX) because it can't get, e.g. an inode
lock?? 

A:  No.  An audit of OpenGFS shows that the transaction lock never surrounds
the aquisition of another lock.  The inode lock would be obtained before the
transaction lock, so no deadlock would occur.  This is an important design
principle of OpenGFS.  Violating it could cause deadlocks, regardless of the
lock protocol.

To make life easy and safe, the lock module should stash *all* lock requests
that get granted between lock recovery start and journal recovery completion
(even though it would be acceptable to pass along some locks -- those that were
held in shared mode by the dead node).  This will effectively halt filesystem
activity across the cluster, until journal recovery completes.  An exception
for this will be made for locks requested "NOEXP".  See below.


2.3.2.  When to Start Stashing Locks?
-------------------------------------

So, how does the lock module determine when to start saving these granted
locks, holding them back from OpenGFS?  It should start when OpenDLM begins
its lock recovery.  Here are some possibilities for sensing when this occurs:

-- Lock module subscribes to membership service directly.  While this is a
   distinct possibility, it requires the lock module to be aware of which
   cluster manager is being used by OpenDLM.  This is a bit awkward, since
   different cluster managers have different interfaces, and is a duplication
   of code already within OpenDLM.

-- Lock module registers callback with OpenDLM to notify when recovery begins.
   Another distinct possibility, but requires an interface extension of
   OpenDLM to provide the callback hook, and coding effort within OpenDLM.

-- Lock module watches for DLM_NOTVALID status returned from lock requests.
   This requires the least work within OpenDLM, and is the method that has
   been used successfully in some long-standing applications that have used
   the DLM system for many years.

We've chosen to use DLM_NOTVALID.  This will require the lock module to use
Lock Value Blocks (LVBs) with each and every lock, whether OpenGFS uses the
LVB or not.  There is just a small overhead for this, since space for the LVB
data is a part of the status structure for each and every OpenDLM lock, anyway.

Caveat:  lock requests from the lock module must *not* use the LKM_INVVALBLK
flag.  This would create a false indication that the ODLM lock recovery
mechanism had started.  There's no need for OpenGFS nor the lock module to use
this flag, anyway, so this should not be a problem.

The OpenDLM lock recovery mechanism returns DLM_NOTVALID to any request for a
lock on a resource that *might have been* locked in EX mode by the dead node.
That is, it can indicate a "false positive" that a dead node held an EX lock:

-- false positive:  OpenDLM, during lock recovery, will mark a resource's LVB
	as invalid if it cannot prove to itself that a dead node, that was the
	resource master, did not own an EX lock on the resource.  It does this
	by looking for EX, PW, or CW locks owned by live nodes, which would
	have kept the dead node(s) from owning an EX or PW and writing to the
	LVB.  If it can't find a "live" EX/PW/CW lock, it marks the LVB
	invalid and clears the LVB data to all 0s (see OpenDLM's
	clear_value_block()), just to be safe, even though the dead node might
	not have actually owned an EX/PW.  See OpenDLM's clmr_lkvlb().

NOTE:  A false negative can also occur, but is not a problem, as long as
	once we start stashing locks, we continue stashing locks until
	journal recovery is complete:

-- false negative:  If the dead node was the directory, resource master, and
	only owner of a lock, no other node would know about the resource,
	and all information about the lock would be lost.  So, a lock request
	from one of the surviving nodes would end up looking like a request on
	a new resource.  However, this situation can occur only *after* lock
	recovery; if the lock request is made before lock recovery, the
	requesting node would be aware of the directory node.  Either the
	request processing would complete (and the requesting node would
	know about the resource), or the request would be processed by the
	recovery process, resulting in DLM_NOTVALID.

So, while there seems to be some ambiguity in detecting, via DLM_NOTVALID,
whether a lock had been held by a dead node, it errs on the side of safety,
and can be a reliable indicator that lock recovery has started.

Therefore, as soon as the lock module sees a DLM_NOTVALID returned from a lock
request, it should start stashing granted locks, holding them in a list until
journal recovery completes.

There are some granted-during-lock-recovery locks that might slip by, before
stashing begins.  These are locks that were held by the dead node in non-EX mode
(the lock module does not request any PW mode locks).  These locks would not
return DLM_NOTVALID status, and therefore would not trigger the stashing.
We don't care if these non-EX, i.e. shared, locks slip by; only *EX* locks
held by dead nodes need to "persist".

Since it will take some time to do the journal recovery, after the lock
recovery is complete, there is a possibility that OpenGFS could request, and
OpenDLM could grant, a request for a lock on a resource that was held by the
dead node in EX mode, but was forgotten about after recovery (see "false
positive" discussion).  To protect against this, stashing should continue once
it's started, and stash all locks, regardless of whether a lock's status is
DLM_NOTVALID or not, until journal recovery is complete.

It is possible that the dead node would have held no EX locks, or that
surviving nodes would not have requested locks on resources locked by the
dead node, in which case OpenDLM lock recovery would not grant any locks to
the surviving nodes.  No DLM_NOTVALIDs would be detected, and no node would
perform the journal recovery!  To cover this case, each node must request a
"deadman" lock for each of the other nodes (see ... ).


2.3.3.  Supporting NOEXP
------------------------

The NOEXP requests must be granted, as an exception to the "persistence"
behavior of a dead node's EX lock.  This means that NOEXP requests must *not*
be stashed; the lock module should pass granted NOEXP locks immediately to
OpenGFS.

This is safe; OpenDLM will not grant the lock request unless it is grantable,
so NOEXP will never override a lock that is held legitimately by a live node.


2.3.4.  When to Stop Stashing Locks?
------------------------------------

Locks may be once again passed to OpenGFS once journal recovery has occurred.
All nodes must wait until the recovery is complete, i.e. the node doing the
journal playback has finished.

This shall be coordinated through the use of deadman(?) locks, perhaps using
LVB data to communicate recovery status.  TBD.

The reset_exp call, from OpenGFS, tells the lock module when *this* node has
finished replaying a journal.  This should enable *this* node to deliver
the stashed locks to OpenGFS, and alert other nodes that they can do the
same (as long as the journal replayed by this node is the only one that
needs to be replayed! ... need to think about this).


3. Private lock structure and lock name.
----------------------------------------

Following is the definition of the private lock structure manipulated by the
lock module, one structure for each lock:

struct opendlm_lock {
        struct opendlm_lockname lockname;
        struct lockstatus       lksb;	/* Lock status block for OpenDLM*/
        lm_lvb_t                *lvb;	/* Lock value block descriptor */
        struct opendlm          *dlm;	/* OpenDLM lock module instance */
        int                     done;	/* for the "sync" lock operations*/
};

Following is the definition of the OpenDLM lock name.  The most unique value,
the lock_num, is placed first in the structure, with the idea that searches
for a match of the name will fail earlier (be more compute efficient).
The magic number identifies the lock as being an OGFS lock (OpenDLM may
support other clients simultaneously with OGFS, and requires a unique name
across all clients in a cluster):

#define OGFS_DLM_MAGICLEN       4
#define OGFS_DLM_MAGIC          "OGFS"

struct opendlm_lockname {
        uint64                  lock_num;	/* OGFS lock number */
        uint32                  name_space;	/* lockspace hash */
        unsigned int            lock_type;	/* OGFS lock type */
        char                    magic[OGFS_DLM_MAGICLEN]; /* "OGFS" */
}__attribute__((packed));