Contact the author.
Design of OpenDLM Lock Module for OpenGFS' G-lock (V0.03) Authors: Stanley Wang Ben Cahill This document describes the implementation details of the "opendlm" lock module that adapts OpenDLM for use as the lock manager for OpenGFS' g_lock layer. You can find corresponding code within the OpenGFS source tree at: src/locking/modules/opendlm/. This lock module is based in part on the "memexp" lock module. See "ogfs-locking" and "ogfs-memexp" for more information on the g_lock layer and memexp. 1. How to implement all "lm_lockops". ------------------------------------- Fields in struct lm_lockops: 1.1 mount ---------- Do almost the same things as the memexp lock module's mount function: 1). Calculate a 32-bit hash of the cluster information device (cidev) name (also known as table_name or lockspace name) to use as the ID of an ogfs instance. This hash will be used as part of the name of each and every lock grabbed by this filesystem. 2). Get cluster configuration info from OpenDLM (memexp reads the cluster information device "cidev" for this info, but we can get the info directly from OpenDLM, and not use cidev). Info includes total number of configured cluster member nodes (as configured in OpenDLM's /etc/dlm.conf file), and the node ID of *this* node. 3). Initialize deadman locks. There will be one deadman lock for each configured cluster member node (whether active or not). 4). Return the lockspace structure, journal ID, and first-to-mount indication (see below) to OpenGFS. lockspace: OpenDLM itself has only a single instance, shared between user space apps and kernel clients such as ourself. However, we need to support multiple instances of the opendlm *lock module*, one for each OpenGFS filesystem mounted on this node. The "lockspace" structure denotes a single lock module instance. This structure is private to the lock module, but gets passed around by the OGFS filesystem code, so that it obtains locks from the correct instance. Get lockspace based on cidev (because different ogfs instances use different cidevs). Record lockspace and node information (such as CB, fsdata, etc) in a private data sturct. Following is the definition of the private lockspace (instance) structure: struct opendlm { struct list_head list; char table_name[256]; uint32 name_space; /* hash value of table_name */ lm_callback_t cb; lm_fsdata_t *fsdata; struct file *cfile; unsigned int jid; unsigned int nodes_count; struct opendlm_lock *mount_lock; struct list_head deadman; }; ls_jid: OpenGFS journal ID. Corresponds to OpenDLM node ID, but is offset by one (OpenDLM range 1-n, OpenGFS range 0 - (n-1). Get it from OpenDLM, and perform offset. ls_first: Determined by using deadman locks. Only the first node to mount this instance of the filesystem will succeed in immediately grabbing *all* deadman locks (one for each configured node) for this lock space, thus detecting that it is, indeed, the first-to-mount. The first-to-mount node grabs the mount lock, blocking all other nodes from mounting until OGFS calls "others_may_mount", after replaying all journals. Other nodes grab the mount lock for only as long as it takes to determine that they are not first-to-mount. Note that this mount lock is private within the opendlm lock module. OpenGFS never sees this lock, even though it grabs a (different) mount lock itself. 1.2 others_may_mount --------------------- Release the mount lock. Called only by first-to-mount node. 1.3 unmount ------------ Clean up all data structures. 1.4 get_lock ------------- Allocate and initialize a private lock struct. See section 2, below. We do not allocate an LVB descriptor here, although there is a buffer for LVB data included in the private lock struct's lksb member. See hold_lvb(). 1.5 put_lock ------------- Clean and free a private lock struct. The g_lock layer is responsible for making sure that the lock struct is no longer needed (i.e. lock is unlocked, and there is no hold on LVB, etc). The g_lock layer calls this when: -- re-using a free glock structure sitting in its not-held cache -- periodically (every 1 minute?) cleaning up lock structures from its caches -- cleaning up all locks just before unmounting. Note: While this should never be called before the unlock() call for this lock, it may be called immediately after the unlock() call, before OpenDLM has finished its processing (i.e. before the AST occurs). Q: Is there any need to keep our private lock struct around until the AST??? If so, then the AST will need to be responsible for de-allocating it ... A: The lock must be unlocked, and put_lock() must have been called, before de-allocating it. Therefore, we need a usage count that gets incremented in: -- get_lock() -- lock() (only for new locks, not lock conversions) and decremented in: -- put_lock() -- unlock AST (i.e. only for unlocks, not lock conversions) If either put_lock() or the unlock AST find the usage count to be zero, they will be responsible for de-allocating our private lock structure and LVB descriptor. 1.6 lock --------- Do the lock operation. Lock state translation, OpenGFS -> OpenDLM: LM_ST_UNLOCKED -> NL LM_ST_EXCLUSIVE -> EX LM_ST_SHARED -> PR LM_ST_DEFERRED -> CW (supported, but not used by OpenGFS) Lock flag translation: LM_FLAG_TRY -> LKM_NOQUEUE LM_FLAG_NOEXP -> grant lock to OGFS during lock recovery Note: Shared locks are treated as non-persistent. Exclusive locks are treated as persistent (orphan). See section 2. OGFS Journal and ODLM Lock Recovery When locking, we must decide whether to convert a pre-existing lock, or create a new one. Increment usage count only when creating a new lock. Some historical notes: We considered using the LKM_ORPHAN flag to cause EXclusive (write) locks to be treated as "persistent", that is, they stay "held" (in effect) by a node even after that node dies. However, OpenDLM supports the ORPHAN behavior only for dead client applications, but not for dead nodes. See section 2, OGFS Journal and ODLM Lock Recovery, for discussion of how we handle persistence of dead node's locks. 1.7 unlock ----------- Do the unlock operation. Because of the need to preserve LVB data within OpenDLM, we may need to convert a lock to NULL mode, instead of completely unlocking it. This will occur if the client still has an interest in the LVB. Unlocking can occur before the client has lost interest in the LVB! Interest in the LVB is indicated by the "lvb_hold" member of the lock private data structure, which must be incremented by hold_lvb(), and decremented by unhold_lvb(). Decrement the usage count in the unlock AST (but not when doing a conversion, which will *not* use the unlock AST). 1.8 reset ---------- Called by g_lock when it encounters an error from us (lock module) when locking or unlocking a lock. Forcefully (use LKM_FORCE flag) do the unlock operation. Decrement the usage count in the unlock AST. 1.9 cancel ----------- Cancel (use LKM_CANCEL flag) a lock/convert request. Decrement the usage count when cancelling a new lock (but not when cancelling a conversion ... HOW can we tell which it is??). 1.10 hold_lvb (lock value block) -------------- If not already present, alloc an LVB descriptor and attach it to our private lock structure. Return a pointer to the LVB descriptor. The LVB data will be (but is not yet) contained in the lock status block (lksb) defined by OpenDLM, and already allocated by get_lock() as part of the private lock structure. The g_lock layer calls hold_lvb() after allocating a private lock structure (via get_lock()), but before actually locking a lock (via lock()). The client should not try to access (read or write) the LVB data before locking the lock. An audit shows that OpenGFS is well-behaved in this regard (but there is no real protection against bad behavior). The client can write LVB data only if it holds the lock in EXclusive mode. The memexp implementation of hold_lvb actually reads the LVB data from the lock storage. However, we cannot do this here, as OpenDLM makes the data available to us only as part of a locking operation (which the client hasn't asked for yet). Fortunately, OpenGFS never tries to read or write LVB data before grabbing the lock. However, there is no check that would prevent the client from doing so, since we simply return a pointer that allows the client to access our (currently invalid) LVB buffer. The LVB data will be updated to OpenDLM (to be visible to other cluster members) only when we release our EXclusive lock, either by down-converting or by totally unlocking. The g_lock layer keeps locks in a cache, and doesn't release them until 5 minutes after their last use by the filesystem. This may occur *before* the filesystem code releases its hold on the LVB! We must make sure that we don't totally unlock the lock before the client releases its hold on the LVB (via unhold_lvb()). Our unlock() operation must detect this, and either: -- unlock, if there is no LVB hold. -- down-convert to a NULL lock, if there is an LVB hold. hold_lvb() increments the lvb_hold value, to indicate our client's interest in the LVB. unhold_lvb() decrements the lvb_hold value. NOTE: If we down-convert to NULL mode, we will need to later unlock the lock after client gives up interest in LVB. unhold_lvb() should check for NULL lock, and unlock it. Specify LKM_VALBLK for locks with LVBs, as indicated by non-NULL lvb member of private lock structure. Note: The default size of lvb in OpenDLM was 16 bytes, but we have changed it (within OpenDLM) to 32 bytes, to accomodate the 32-byte OpenGFS LVB size. 1.11 unhold_lvb ---------------- The g_lock layer calls this when the client (filesystem) is no longer interested in reading or writing the LVB data. Unfortunately, this may occur before the client unlocks its EXclusive lock (i.e. before we actually write our LVB data to OpenDLM). The memexp implementation destroys its in-core LVB buffer and descriptor now. However, we must keep these until we get a chance to update the LVB data to OpenDLM. Therefore, we will wait to destroy them in put_lock() or during the final unlock AST. If unhold_lvb() is called after the lock is unlocked, unlock() will have down-converted the lock to NULL mode, to preserve the LVB data within OpenDLM. If so, we must *unlock* the lock here!! memexp supports a "permanent" LVB, that is, one that persists even though no node holds a lock on the LVB's resource. However, I haven't found anywhere that OpenGFS or the g_lock layer requests a permanent LVB. 1.12 sync_lvb -------------- The memexp implementation writes the LVB data to lock storage. Unfortunately, we cannot do that now, as OpenDLM allows updates of LVB data only when unlocking or down-converting an EXclusive lock, which has not occurred yet (the client must hold an EX lock when calling sync_lvb()). No other node can read or write the LVB data until we have released our EXclusive lock, however. So, as long as we keep our in-core copy of LVB data for our own reference, and update the OpenDLM LVB during unlock(), this should be okay (i.e. other nodes will access the data that we wrote). 1.13 reset_exp --------------- Called just after OpenGFS replays a journal (regardless of reason when journal was replayed, i.e. mounting fileystem or recovering dead node). This call notifies the lock module that journal recovery is complete, so the lock module can re-start the flow of locks to OpenGFS, and coordinate the other nodes to do the same. See section 2.3.4, When to Stop Stashing Locks? 1.14 Fields in struct lm_lockstruct ------------------------------ *ls_jid See previous section. *ls_first See previous section. *ls_lockspace See previous section. *ls_ops See previous section. 2. OGFS Journal and ODLM Lock Recovery --------------------------------------- 2.1 OGFS Journal Recovery -------------------------- OGFS journal recovery involves replaying the journal of a node that has died. A surviving node plays back the journal, and writes to disk any metadata that had been scheduled to be written to disk, but failed to happen because the expired node died. OGFS replays a journal when: -- mounting the filesystem -- another node dies while the fileystem is mounted Normally, the replay at mount time should *not* cause any metadata writes, because the metadata should have been flushed to disk, and the journal emptied, when the filesystem was previously, successfully, unmounted. It's likely, though, that node death will result in metadata writes to disk, if the node was involved in writing (including moving, renaming, or removing) items on the disk at its time of death. In a multi-node cluster, another node will replay the journal of a dead node, perhaps finishing before the dead node reboots and remounts the filesystem. In a single-node system, the mount time replay serves as the node death replay, since this is the first time that the one node gets access to the journal and fileystem since time of death. 2.1.1. Special locks used during OGFS Mount and Recovery --------------------------------------------------------- OGFS assumes that EXclusive (write) locks held by dead nodes will persist after the node's death, and continue to protect the object being written by the dead node. While this is the behavior of OGFS' legacy "memexp" lock protocol, this is not true with OpenDLM, which cleans *all* locks owned by a dead node (even if marked "ORPHANable"). The discussion below describes OGFS' expectations; our job will be to simulate this behavior within the opendlm lock module. Certain special locks are requested by OpenGFS with the NOEXP (no expired) flag. This flag means that the lock should be granted to this node even if it is blocked by an orphan EX lock held by a dead ("expired") node. This flag is used only for locks used for recovery/mount operations: OpenGFS Mount lock -- EX, *not* the same as the OpenDLM lock module Mount lock. If expired EX, the dead node was mounting the filesystem when it died. OpenGFS Live lock -- a shared lock, so actually it would never need to persist after node death. (and I haven't seen evidence of this protecting anything, anyway). OpenGFS Transaction lock -- normally a shared lock, so multiple nodes can write data to disk simultaneously, but is grabbed EX by OpenGFS ogfs_recover_journal(), to keep any other nodes from writing to disk while a journal recovery is in progress. If expired EX, the dead node was replaying its own or another node's journal when it died. OpenGFS Journal locks -- EX, one lock for each journal. This should always be expired EX when a node dies, because each node holds its own journal lock for the lifetime of the filesystem mount. NOEXP overrides blocks by locks held by dead nodes, and allows the lock to be granted to the requesting node. NOEXP must *not* override locks held legitimately by any (live) nodes. That would violate the normal operation of the locks. The node doing journal recovery requires the transaction lock and the journal lock for the recovering journal, both in EX mode. No other nodes may write to disk (by virtue of the transaction lock) while a journal recovery is going on. See OGFS function ogfs_recover_journal(). 2.2 ODLM Lock Recovery -------------------------- ODLM lock recovery involves cleaning out locks held by a node that died, while retaining and re-distributing any valid information about the lockable resources and locks held by surviving nodes. Note that *no* locks owned by a dead node, not even orphan locks, survive when a node dies. Orphan locks survive only client application deaths, not node deaths. Lock recovery is controlled by the recovery state machine. See the "odlm-recovery" doc ("OpenDLM_Lock_Recovery") at: http://opendlm.sourceforge.net/docs.php The lock recovery process starts soon after receiving a message from the cluster manager used by OpenDLM, indicating that a node has died. During the recovery process, OpenDLM does not service any new lock requests, but returns DLM_DENIED_GRACE_PERIOD (the time spent in lock recovery is called the "grace period"), so the requestor can try again. The lock recovery process ends once all nodes have reached the final "barrier" state, and have notified each other that they can all move back to "run" mode. As part of the last stage of recovery, the recovery process grants locks that were blocked by the dead node. The lock recovery process may restart if another node dies, so there can be more starts than ends. The recovery process is not recursive. There is only one recovery state machine, and only one state, for a given node. 2.3 Coordinating Journal and Lock Recovery ------------------------------------------- We must protect the files that the dead node was writing, by not allowing other nodes to grab locks on those files (either read/shared locks, since they might read bad/stale information, or write/exclusive locks, since they would clobber or be clobbered by the upcoming journal replay). The protection must continue through the end of the journal replay. Journal replay cannot start until OpenDLM has recovered, since journal replay needs to grab a couple of important locks (transaction and journal locks), which must be granted by OpenDLM, *after* lock recovery. By that time, OpenDLM will have "forgotten" all of the locks owned by the dead node, and granted locks (those blocked by the dead node) to other waiting nodes. So, the chain of events must be: 1) Cluster manager detects dead node 2) OpenDLM lock recovery mechanism begins 3) "Something" protects OpenGFS 4) OpenGFS lock module detects need for journal recovery, passes to OpenGFS 5) OpenGFS requests EX transaction lock and dead node's journal lock 6) OpenDLM finishes recovery, grants locks 7) OpenGFS does journal recovery 8) Things get back to normal 2.3.1. "Something" -- Stashing granted locks --------------------------------------------- The above requirements suggest that the lock module must intervene, accepting the granted locks from OpenDLM during *lock* recovery, but *not* delivering them to OpenGFS until after *journal* recovery, saving them on a list in the meantime. This simulates the effect of allowing the dead node's locks to "persist", simply because OGFS on surviving nodes won't see that their lock requests have been granted. We'll call this non-delivery and saving on a list "stashing". Q: Is there a possibility of deadlock here?? ... i.e. a node holds a shared transaction lock, but can't finish its job (and release the transaction lock so the node doing recovery can grab it EX) because it can't get, e.g. an inode lock?? A: No. An audit of OpenGFS shows that the transaction lock never surrounds the aquisition of another lock. The inode lock would be obtained before the transaction lock, so no deadlock would occur. This is an important design principle of OpenGFS. Violating it could cause deadlocks, regardless of the lock protocol. To make life easy and safe, the lock module should stash *all* lock requests that get granted between lock recovery start and journal recovery completion (even though it would be acceptable to pass along some locks -- those that were held in shared mode by the dead node). This will effectively halt filesystem activity across the cluster, until journal recovery completes. An exception for this will be made for locks requested "NOEXP". See below. 2.3.2. When to Start Stashing Locks? ------------------------------------- So, how does the lock module determine when to start saving these granted locks, holding them back from OpenGFS? It should start when OpenDLM begins its lock recovery. Here are some possibilities for sensing when this occurs: -- Lock module subscribes to membership service directly. While this is a distinct possibility, it requires the lock module to be aware of which cluster manager is being used by OpenDLM. This is a bit awkward, since different cluster managers have different interfaces, and is a duplication of code already within OpenDLM. -- Lock module registers callback with OpenDLM to notify when recovery begins. Another distinct possibility, but requires an interface extension of OpenDLM to provide the callback hook, and coding effort within OpenDLM. -- Lock module watches for DLM_NOTVALID status returned from lock requests. This requires the least work within OpenDLM, and is the method that has been used successfully in some long-standing applications that have used the DLM system for many years. We've chosen to use DLM_NOTVALID. This will require the lock module to use Lock Value Blocks (LVBs) with each and every lock, whether OpenGFS uses the LVB or not. There is just a small overhead for this, since space for the LVB data is a part of the status structure for each and every OpenDLM lock, anyway. Caveat: lock requests from the lock module must *not* use the LKM_INVVALBLK flag. This would create a false indication that the ODLM lock recovery mechanism had started. There's no need for OpenGFS nor the lock module to use this flag, anyway, so this should not be a problem. The OpenDLM lock recovery mechanism returns DLM_NOTVALID to any request for a lock on a resource that *might have been* locked in EX mode by the dead node. That is, it can indicate a "false positive" that a dead node held an EX lock: -- false positive: OpenDLM, during lock recovery, will mark a resource's LVB as invalid if it cannot prove to itself that a dead node, that was the resource master, did not own an EX lock on the resource. It does this by looking for EX, PW, or CW locks owned by live nodes, which would have kept the dead node(s) from owning an EX or PW and writing to the LVB. If it can't find a "live" EX/PW/CW lock, it marks the LVB invalid and clears the LVB data to all 0s (see OpenDLM's clear_value_block()), just to be safe, even though the dead node might not have actually owned an EX/PW. See OpenDLM's clmr_lkvlb(). NOTE: A false negative can also occur, but is not a problem, as long as once we start stashing locks, we continue stashing locks until journal recovery is complete: -- false negative: If the dead node was the directory, resource master, and only owner of a lock, no other node would know about the resource, and all information about the lock would be lost. So, a lock request from one of the surviving nodes would end up looking like a request on a new resource. However, this situation can occur only *after* lock recovery; if the lock request is made before lock recovery, the requesting node would be aware of the directory node. Either the request processing would complete (and the requesting node would know about the resource), or the request would be processed by the recovery process, resulting in DLM_NOTVALID. So, while there seems to be some ambiguity in detecting, via DLM_NOTVALID, whether a lock had been held by a dead node, it errs on the side of safety, and can be a reliable indicator that lock recovery has started. Therefore, as soon as the lock module sees a DLM_NOTVALID returned from a lock request, it should start stashing granted locks, holding them in a list until journal recovery completes. There are some granted-during-lock-recovery locks that might slip by, before stashing begins. These are locks that were held by the dead node in non-EX mode (the lock module does not request any PW mode locks). These locks would not return DLM_NOTVALID status, and therefore would not trigger the stashing. We don't care if these non-EX, i.e. shared, locks slip by; only *EX* locks held by dead nodes need to "persist". Since it will take some time to do the journal recovery, after the lock recovery is complete, there is a possibility that OpenGFS could request, and OpenDLM could grant, a request for a lock on a resource that was held by the dead node in EX mode, but was forgotten about after recovery (see "false positive" discussion). To protect against this, stashing should continue once it's started, and stash all locks, regardless of whether a lock's status is DLM_NOTVALID or not, until journal recovery is complete. It is possible that the dead node would have held no EX locks, or that surviving nodes would not have requested locks on resources locked by the dead node, in which case OpenDLM lock recovery would not grant any locks to the surviving nodes. No DLM_NOTVALIDs would be detected, and no node would perform the journal recovery! To cover this case, each node must request a "deadman" lock for each of the other nodes (see ... ). 2.3.3. Supporting NOEXP ------------------------ The NOEXP requests must be granted, as an exception to the "persistence" behavior of a dead node's EX lock. This means that NOEXP requests must *not* be stashed; the lock module should pass granted NOEXP locks immediately to OpenGFS. This is safe; OpenDLM will not grant the lock request unless it is grantable, so NOEXP will never override a lock that is held legitimately by a live node. 2.3.4. When to Stop Stashing Locks? ------------------------------------ Locks may be once again passed to OpenGFS once journal recovery has occurred. All nodes must wait until the recovery is complete, i.e. the node doing the journal playback has finished. This shall be coordinated through the use of deadman(?) locks, perhaps using LVB data to communicate recovery status. TBD. The reset_exp call, from OpenGFS, tells the lock module when *this* node has finished replaying a journal. This should enable *this* node to deliver the stashed locks to OpenGFS, and alert other nodes that they can do the same (as long as the journal replayed by this node is the only one that needs to be replayed! ... need to think about this). 3. Private lock structure and lock name. ---------------------------------------- Following is the definition of the private lock structure manipulated by the lock module, one structure for each lock: struct opendlm_lock { struct opendlm_lockname lockname; struct lockstatus lksb; /* Lock status block for OpenDLM*/ lm_lvb_t *lvb; /* Lock value block descriptor */ struct opendlm *dlm; /* OpenDLM lock module instance */ int done; /* for the "sync" lock operations*/ }; Following is the definition of the OpenDLM lock name. The most unique value, the lock_num, is placed first in the structure, with the idea that searches for a match of the name will fail earlier (be more compute efficient). The magic number identifies the lock as being an OGFS lock (OpenDLM may support other clients simultaneously with OGFS, and requires a unique name across all clients in a cluster): #define OGFS_DLM_MAGICLEN 4 #define OGFS_DLM_MAGIC "OGFS" struct opendlm_lockname { uint64 lock_num; /* OGFS lock number */ uint32 name_space; /* lockspace hash */ unsigned int lock_type; /* OGFS lock type */ char magic[OGFS_DLM_MAGICLEN]; /* "OGFS" */ }__attribute__((packed));