Contact the author.
OpenGFS Locking Mechanism Copyright 2003 The OpenGFS Project Original authors, early 2003: Stefan Domthera (sd) Dominik Vogt (dv) Updates since June, 2003: Ben Cahill (bc), ben.m.cahill@intel.com Introduction ------------ This document contains details of the OpenGFS locking system. The first part provides and overview of the software layers involved in the locking system. The second part describes the design of the G-Lock locking layer, and the final part explains the details of how the G-Locks are used in OpenGFS. This document is aimed at developers, potential developers, students, and anyone who wants to know about the details of shared file system locking in OpenGFS. This document is not intended as a user guide to OpenGFS. Look in the OpenGFS HOWTO-generic or HOWTO-nopool for details of configuring and setting up OpenGFS. This document may contain inaccurate statements. Please contact the author (bc) if you see anything wrong or unclear. Terminology ----------- Throughout this document, the combination of a mounted lock module, and - if applicable - a lock storage facility (e.g. memexpd) is sometimes called "locking backend" for simplicity. The OpenGFS filesystem and locking code contains many uses of terms such as "get", "put", "hold", "acquire", "release", etc., that may be inconsistent or vague. I (bc) have tried to write enough detail to clearly explain the true meaning of these terms, depending on the context. Let me know if anything is unclear or inaccurate. "Machines", "computers", "nodes", "cluster members", and sometimes "clients", mean pretty much the same thing: a computer (machine) that is a compute node within a cluster of computers sharing the filesystem storage device(s). If using the memexp locking protocol, the memexp module on the computer will be a "client" of the memexpd lock storage server. "dinodes" are the OGFS version of an "inode". "inode" is often used to mean "dinode", but may also mean a struct inode defined by the kernel's Virtual File System (VFS). Requirements ------------ In a distributed file system, at least three types of locking are required: 1). Inter-node locking must guarantee file system and data consistency when multiple computer nodes try to read or write the shared file system in parallel. In OpenGFS, these mechanisms are implemented by various "lock modules" that link into the filesystem code with the help of the "lock harness". 2). The file system code must protect the file system and the data structures in memory from parallel access by multiple processes on the node. The "G-Lock" software layer takes care of this protection, and also decides whether communication with other nodes or a central lock server, via the lock module, is necessary. 3). The file system structures in kernel memory must be protected from concurrent access by multiple CPUs (Symmetric Multi-Processing) on the same node. Linux spinlocks and/or other mutual exclusion (mutex) methods are used to achieve this. Overview -------- The following gives an overview of the locking hierarchy. The left side shows lock module loading/unloading (registering/unregistering with the lock harness) and lock protocol mounting, and the right side shows all other operations: +--------------------------------------------------------------------------+ | File System Module (ogfs.o) | | +-------------------+ +--------+ | |superblock, inodes,| | | | sdp->sd_lockstruct |flocks, journals, |-| | | / |resource grps, etc.| | glops | | / +---------|---------+ | | |-------/-----+ +-----------+ +---------|---------+ | | | mount | | misc. | | G-Lock |-| | +-------------+-------------+-----------+---+-------------------+-+--------+ | | | +--------------------------+ | | |Harness Module (harness.o)| | others_may_mount | lock & LVB +--------------------------+ | reset_expired | operations \ | unmount | register \ | | mount \ | | unregister \ | | +--------------------------------------+ | Lock Module (e.g. memexp.o) | +--------------------------------------+ | +--------------------------------------+ + Lock Storage (e.g. memexpd) + +--------------------------------------+ Loading, mounting, unloading <--|--> Ongoing filesystem operations, unmount Fig. 1 -- Software modules and layers involved in locking The current locking system is a big clump of different software layers that, in addition to locking, execute several tasks that are not locking, per se. Historically, these tasks were clumped together because they all relate to coordinating the behavior of the cluster member nodes, and the lock server has served double duty as the cluster membership service. Enhancement: It might be a good idea to split off the functionality unrelated to locking into independent components. The lock harness serves two simple purposes: - maintaining a list of available-to-mount lock modules - connecting a selected module to the filesystem at lock protocol mount time (one of the first things done during the filesystem mount). After protocol mount time, the harness module's job is done. The locking modules and lock storage facility take care of: - Managing and storing inter-node locks and lock value blocks (LVBs) - Lock expiration (lock request timeout) and deadlock detection - Heartbeat functionality (are other nodes alive and healthy?) - Fencing nodes, recovering locks, and triggering journal replay in case of a node failure The G-Lock software layer is a part of the file system code. It handles: - Coordinating and cacheing locks and LVBs among processes on *this* node - Communication with the locking backend (lock module) for inter-node locks - Executing glops when appropriate (see below) - Journal replay in case of a node failure The glops (G-Lock Operations) layer is also part of filesystem code. It implements the filesystem-specific, architecture-specific, and protected-item- specific operations that must occur after locking or before unlocking, such as: - Reading items from disk, or from another node via Lock Value Block (LVB), after locking a lock - Flushing items to disk, or to other nodes via LVB, before unlocking a lock - Invalidating kernel buffers, once flushed to disk, so a node can't read them while another node is changing their contents. Each lock has a type-dependant glops attached to it. This attachment point is the key to porting the locking system to other environments, and/or creating different types of locks, and defining their associated behavior. 2. Lock Harness and Lock Modules -------------------------------- At filesystem mount time, the filesystem uses the lock harness to mount a locking protocol (see "Lock Harness", below). The harness' mount call, lm_mount(), fills in the lm_lockstruct contained in the filesystem's incore superblock structure as sdp->sd_lockstruct. This exposes the chosen lock module's services to the filesystem. struct lm_lockstruct contains: ls_jid - journal ID of *this* computer ls_first - TRUE if *this* computer is the first to mount the protocol ls_lockspace - ptr to protocol-specific lock module private data structure ls_ops - ptr to struct lm_lockops, described below "ls_jid" indicates the journal ID that should be used for *this* computer. It is currently the lock module's job to discover this journal ID. The "memexp" lock module does this by reading information from a cluster information device (cidev), which is a small disk partition dedicated for that purpose. The "nolock" lock module provides either "0", or the value of a "jid=" entry (if used) in the command line for mounting the filesystem. This is some of the functionality that might be split off from the locking module, although journal assignment might, as an alternative possibility, be handled dynamically through the use of locks. "ls_first" tells the filesystem code, at mount time, that *this* is the first machine in the cluster to mount the filesystem. If TRUE, this machine replays *all* journals for the whole cluster, before allowing other machines to complete their lock protocol mounts (and therefore their filesystem mounts). If FALSE, the filesystem replays only the journal for *this* computer. See discussion in Appendix G, under "Calls to ogfs_glock_num()", recovery.c, ogfs_recover_journal(). This is also some functionality that might be split off from the locking component, although this functionality could also be handled through the use of locks. "ls_lockspace" points to a private data structure contained in, and for use by, the lock module itself. The filesystem/glock code includes this pointer in certain calls to the lock module, but never accesses the structure directly. The private data structure is typically the lock module's incore "superblock" (as called, perhaps inappropriately, by some of the code), i.e. its master data structure. Not to be confused with private module data structure relating to each lock. "ls_ops" provides the big hook through which the filesystem code (including G-Lock layer) accesses the lock module. Every lock module must implement "struct lm_lockops" (see src/include/lm_interface.h) that contains the following fields. Most (but not all) of these are function calls implemented by the locking module, and called by filesystem code. See ogfs-memexp document for details on one implementation of lm_lockops{}: data fields: proto_name - unique protocol name for this module, e.g. "nolock" or "memexp". local_fs - set to TRUE by nolock, FALSE by other protocols. Filesystem code checks this when mounting, to enable "localcaching" and "localflocks" for more efficient operation in non-cluster (nolock) environment. See man page for ogfs_mount. list - element of protocol list maintained by lock harness. cluster/locking/journaling functions called by lock harness or filesystem: mount - initialize, start the lock module's locking functionality. Called by lock harness' lm_mount() when mounting a protocol onto the file system. This call will block for non-first-to-mount machines, until the first-to- mount machine has replayed all journals, and has called others_may_mount(). This does not apply to the nolock protocol, but must work this way for any clustered protocol (e.g. memexp). others_may_mount - indicate to other nodes that they may mount the filesystem. Called by filesystem's _ogfs_read_super(), the fs mount function, after this first-to-mount machine in the cluster has replayed *all* journals, thus making the on-disk filesystem ready to use by all nodes. Other machines will block within the mount() call, until others_may_mount() is called by the first-to-mount node. unmount - stop, clean up the lock module's locking functionality. Called by filesystem's ogfs_unmount_lockproto() when unmounting lock protocol from file system. reset_exp - reset expired client node (from EXPIRED to USED / NOTUSED). Called from filesystem's journal subsystem's ogfs_recover_journal(), after this node replays journal of expired node. locking/LVB (Lock Value Block) functions called by G-Lock layer: get_lock - allocate and initialize a new lm_lock_t (lock module per-lock private data) struct on this node. Does *not* look for pre-existing structure. Does *not* access lock storage, or make lock known to other nodes. put_lock - de-allocate an lm_lock_t struct on this node, release usage of (perhaps de-allocate) an attached LVB (memexp internally calls memexp_unhold_lvb(), its own implementation of unhold_lvb, see below). Accesses lock storage only if LVB action is required. lock - lock an inter-node lock (alloc lock storage buff if needed) unlock - unlock an inter-node lock (de-alloc storage buff if posbl) reset - reset an inter-node lock (unlock if locked) cancel - cancel a request on an inter-node lock (ends retry loop) hold_lvb - find an existing, or allocate and initialize a new, Lock Value Block (LVB) unhold_lvb - release usage of (perhaps de_allocate) an LVB sync_lvb - synchronize LVB (make its contents visible to other nodes) The call prototypes can be found in "src/include/lm_interface.h". See the ogfs-memexp document for more detail on the use and implementation of these calls. Lock Harness ------------ The lock harness is a fairly thin abstraction within the OpenGFS locking hierarchy. Simply spoken, it is a plug for different locking protocols. To the lock modules, the harness offers services for protocol registration and unregistration: lm_register_proto() -- adds module's lm_lockops->list to harness' list of available modules lm_unregister_proto() -- removes module from harness' list of available modules These are called by the lock modules' module_init() or module_exit() functions. Typically, these calls are the *only* things done by module_init() (called when the kernel loads the module) or module_exit() (when unloading). There is no need to initialize any of the module's locking functionality until the kernel mounts the filesystem (and the filesystem in turn selects and mounts the particular module/protocol). To the filesystem code, the harness offers a locking protocol mounting service: lm_mount() -- initializes module's locking functionality (via lm_lockops mount), fills in an lm_lockstruct, exposing the lock module's lm_lockops functions and a private data pointer, and indicating journal ID and first-to-mount status. The protocol is mounted onto a specific file system during the filesystem mount. You can use different locking protocols, and/or different "lockspaces" for the same protocol, for different OpenGFS filesystems on the same computer. "Lockspace" seems to be a combination of Cluster Information Device (cidev) and protocol instance. A computer node mounting 3 separate OpenGFS filesystems, each with memexp protocol, would need 3 different cidevs to describe the clusters, one for each filesystem. This is enforced by code in the memexp module's memexp_mount(), but is not enforced, nor used at all, by the nolock module. cidevs are called out by "locktable" when mounting the filesystem (see man page for ogfs_mount), or "table_name" or "table" within harness and memexp code. The calling sequence when mounting is: _ogfs_read_super() -- mount filesystem, src/fs/arch_*/super_linux.c ogfs_mount_lockproto() -- select lock protocol, src/fs/locking.c lm_mount() -- mount lock protocol, src/locking/harness/harness.c _ogfs_read_super() mounts the filesystem, and is architecture-dependent code. ogfs_mount_lockproto() determines which lock protocol to mount, either by mount options set by the user (see man page for ogfs_mount), or by reading the filesystem superblock (for values set by mkfs.ogfs, see man page for mkfs_ogfs). lm_mount() calls a module's lm_lockops->mount() function to initialize the module's locking functionality. It supplies the following parameters in the call: table -- Cluster Information Device (cidev) for memexp data -- "hostdata" for protocol. For memexp: node's IP address cb -- Callback pointer for module to call fs' G-Lock layer. fsdata -- Private filesystem data (the incore superblock pointer, sdp) to be attached to callbacks from module to G-Lock layer. &lockstruct->ls_jid -- Journal ID for *this* computer, to be filled in by module. &lockstruct->ls_first -- First-to-mount status for *this* computer, to be filled in by module. "lockstruct" is actually sdp->sd_lockstruct, contained in the filesystem's in-core superblock structure. lm_mount() fills in two other other members of sdp->sd_lockstruct, so that the filesystem can access the newly mounted locking module's capabilities: ->ls_lockspace -- value returned from module's lm_lockops->mount() call, module-private data (typically a pointer to the module's in-core "superblock" structure). ->ls_ops -- pointer to module's lm_lockops structure The protocol is unmounted during the file system unmount process. The calling sequence is: ogfs_put_super() -- src/fs/arch_*/super_linux.c ogfs_unmount_lockproto() -- src/fs/locking.c sdp->sd_lockstruct.ls_ops->unmount() -- the lock module's unmount function Note that the lock harness is not involved here! Its job was done after it filled in the module information in sdp->sd_lockstruct. After that point, the filesystem can reach the module directly, without using the harness. The following use diagram gives a complete overview of using a lock module. It covers all calls to the module from all parts of filesystem and harness code. Some calls have no functionality for the nolock module (see discussion on Lock Modules elsewhere in this document, or ogfs-memexp for more details on how the memexp module works). +-------------------------------------+ | super_linux.c (fs mount/unmount) | +-------------------------------------+ | | | lock protocol |others_may_mount | mount/unmount | v | +---------------------------+ | +---------+ +-----------+ | locking.c (lock mnt/umnt) | | | glock.c | | journal.c | +---------------------------+ | +---------+ +-----------+ | | | | | |mount |unmount | |all |reset_exp(ired v | | |lock/LVB | node) +--------------+ | | |operations | | harness.c | | | | | +--------------+ | | | | ^ | | | | | |register |mount | | | | |unregister | | | | | | v v v v v +--------------------------------------------------------+ | lock module (memexp, nolock, or stats) | +--------------------------------------------------------+ Fig. 2 -- complete register/unregister, and lm_lockops usage Pertinent source files are: src/fs/arch_*/super_linux.c (architecture dependent filesystem code) src/fs/locking.c, glock.c, journal.c (architecture independent filesystem code) src/locking/harness/harness.c (lock harness kernel module code) src/locking/modules/*/* (lock module source code) Lock Modules ------------ Lock modules are kernel implementations of G-Locks (see below). Each provides a distinct locking protocol that can be used in OpenGFS: locking/modules/memexp -- provides inter-node locking, for cluster use locking/modules/nolock -- provides "fake" inter-node locks, not for cluster use locking/modules/stats -- provides statistics, stacks on top of another protocol The "memexp" protocol supports clustered operation, and is fairly sophisticated. The memexp modules, one on each node, work with each other to keep track of cluster membership, and which member nodes own which locks. The memexp protocol relies on a central repository of lock data that is shared among all nodes, but is completely separate from filesystem and journals. The repository can be either one or more DMEP (Device Memory Export Protocol) devices (e.g. certain SCSI drives), usually those in an OpenGFS pool (see ogfs-pool doc), or a "fake" DMEP server, the memexpd server. The memexpd server runs on one computer, and emulates DMEP operation, but stores data in memory or local disk storage, rather than shared disk storage, and communicates with cluster members via LAN rather than SCSI. Most users use the memexpd server, rather than DMEP devices. Source code is in locking/servers/memexp directory. For lots more information on memexp, see the ogfs-memexp document. The "nolock" protocol supports filesystem operations on a single node, and is much simpler than the memexp protocol. Many of the lm_lockops functions are stubbed out. There is no central lock storage, but the module does store a structure for each lock locally in a hash table. The "stats" protocol provides statistics (e.g. number of calls to each lm_lockops function, current and peak values of numbers of locks on inodes, metadata, etc., and lock latency statistics) for a protocol stacked below it (the "lower" protocol). It looks like stats are printk()ed when the module is *unmounted* ... I haven't found any other reporting mechanism. To mount the stats on top of memexp, try the following options when mounting the filesystem (see man ogfs_mount): lockproto=lockstats locktable=memexp:/dev/pool/cidev (cidev is the device containing the cluster information for memexp, e.g. /dev/pool/cidev, if you are using pool ... see HOWTO-generic, or HOWTO-nopool). Code for parsing the lower protocol (e.g. memexp) from the locktable option, and mounting it is in src/locking/modules/stats/ops.c, stats_mount(). 3. G-Locks (global locks) ------------------------- The G-Lock layer is an abstract locking mechanism that is used by the file system layer. It provides a service interface that conveniently supports: -- using one of several available locking protocols (lock modules) -- executing filesystem- and protected-entity-type-specific actions (glops) before or after acting on a lock. -- dependencies and parent/child relationships that the filesystem may wish to impose on locks The G-Lock layer's interfaces include: -- G-Lock services, presented to the filesystem code -- Lock module interface, between G-Lock layer and lock module -- Lock and LVB commands to lock module -- Callback from lock module to request lock release, journal replay -- glops hook for installing filesystem- and type-specific actions on each lock In theory, the G-Lock layer should be usable in any other software, too. The glops "socket" provides the opportunity to use G-Lock with other filesystems, and define new lock types and associated actions. Enhancement: Pull the G-Lock code from the file system sources and put it into a separate module compiled as a library. Lock instances ---------------- The G-Lock layer interfaces between the file system layer and the locking backend. The G-Lock layer decides whether the lock is already within the node (perhaps owned by another process, perhaps unowned), or whether it needs to get the lock from "outside", that is, from the inter-node locking protocol. If going "outside", G-Lock uses the lock module (true inter-node locking for memexp, or "fake" inter-node for the nolock protocol). A lock lives in (at least) two instances: 1. In the locking backend, outside of the file system. This inter-node lock may get passed around between the cluster member nodes by way of a central lock storage facility (in the case of memexp) or perhaps other methods, e.g. passing between nodes directly via LAN (for OpenDLM, a distributed lock manager, if/when a locking module is developed/integrated for OpenDLM). The backend's lock implementation can vary for different protocol modules. There are several data types defined in src/include/lm_interface.h as "void", to support this variability. This allows the G-Lock layer to pass these private structures around in a generic way, but not to actually access them: lm_lock_t -- generic lock. Identifies instance of a lock within the module. Current implementations: me_lock_t -- for memexp nolock_lock -- for nolock stats_lock -- for stats lm_lockspace_t -- generic lock "space". Identifies instance of lock module (there can be several instances of the module on a given node, one instance for each OGFS filesystem mounted on the node). Typically, it is the lock module's "superblock" structure. Current implementations: memexp_t -- for memexp nolock_space -- for nolock stats_space -- for stats On the opposite side of the interface, the lock module carries an ID for the filesystem it is mounted on. Just as filesystem code never accesses lock-module-specific structures, the lock module never accesses this data: lm_fsdata_t -- generic filesystem data. Identifies instance of filesystem (there can be several OGFS filesystems using the same module on a given node), when module does a callback to G-Lock layer. OGFS sets this to be the filesystem's incore superblock structure, usually seen as "sdp" in fs code. The representation of a lock within a locking backend is significantly more primitive than the G-Lock layer's representation; the interface between G-Lock and locking modules exchanges only a few basic parameters for each lock, thus limiting the knowledge that a lock module can have about it: -- lockname (64-bit lock number and 32-bit lock type) -- lock state (unlocked/shared/deferred/exclusive) -- attached lock value block (LVB), if any, and PERManent status of LVB -- cancellation (request from G-Lock to end a lock retry loop) -- flags attached to lock request from G-Lock: -- TRY (do not block if lock request can't be immediately granted) -- NOEXP no expiration (allows dead node's lock to be held by this node) -- release state (request from backend for G-Lock layer to release lock) Other data relating to the lock, within the backend, is private to the lock module, and is used for implementing the locking management of the specific lock protocol. Note that there is no awareness by the backend of inter-lock dependencies, parent/child relationships, process ownership, recursive locking, lock cacheing, glops actions, filesystem transactions, all of which are handled by, and confined to, the G-Lock layer (see below). 2. Within the filesystem module, as a struct ogfs_glock (ogfs_glock_t) in kernel memory of a given node. The file system layer knows only about the ogfs_glock_t structure (and nothing about the representation of a lock within a locking module). Within the node, the structure is protected by some code similar to semaphores. The G-Lock layer handles locks at a significantly more sophisticated level than does a lock module. It includes support for inter-lock dependencies, parent/child relationships, process ownership, recursive locking, lock cacheing, glops actions, filesystem transactions, and more. This data type is defined independently of the locking protocol, with no variability in its definition. For a detailed description of this data type, please refer to "G-Lock Structure" below. G-Lock Cache, G-Lock Daemon ----------------------------------- The G-Lock cache stores and organizes ogfs_glock_t structures on *this* computer node. The cache is implemented as a hash table with 3 chains: perm -- glocks are currently locked by a process on this node, are expected to be locked for a long time, and are locked at inter-node scope. Filesystem code may request this chain by using the GL_PERM flag in a call to ogfs_glock() or any of its wrappers. Used for: OGFS_MOUNT_LOCK, OGFS_LIVE_LOCK, journal index, flocks, plocks (POSIX locks), and dinodes (when reading into inode cache). This chain allows searches for glocks to be more efficient. Some searches start with glock chains ("notheld" or "held") that are more likely to hold the search target, and leave the "perm" chain until last. A search for an unlocked glock can skip the "perm" chain altogether. Glocks in this chain move immediately to the "held" chain when unlocked for the last time (recursive locking) by a process. Glocks in this chain have the GLF_HELD and GLF_PERM flags set. held -- glocks are currently or recently locked by a process on this node, and are locked at inter-node scope. Glocks typically stay in this cache chain for 5 minutes after being unlocked for the last time (recursive locking) by a process. This node retains the lock at inter-node scope, so the glock is ready to be quickly locked again by a process, without negotiating with the lock module. After the 5 minute timeout, the glockd() cleanup daemon releases the inter-node lock, and moves the glock to the "notheld" chain. The inter-node lock may be released before the 5 minute timeout, by request of a NEEDS or DROPLOCKS callback from another node. When the inter-node lock is released, the glock moves to the "notheld" chain. Locks in this chain have the GLF_HELD flag set, GLF_PERM unset. !!!??? (dv) Holding locks may be harmful on systems that write data more often than they read it. Should this be tuneable? notheld -- glocks are not locked by any process on this node, and are *not* locked at inter-node scope. if gl_count == 1, some process has some interest in the glock, even though it is not locked (process could be getting ready to lock a glock, etc.). if gl_count == 0, no process has an interest in the lock, contents of lock structure are meaningless, and the structure is free to be re-used for another glock (see new_glock() in glock.c) or be de-allocated. Locks in this chain have the GLF_HELD and GLF_PERM flags unset. Glock structures are first allocated and placed into the notheld cache via the ogfs_get_glstruct() call. For kernel-space code, glock structures are allocated using the following kernel call: kmem_cache_alloc(ogfs_glock_cachep, GFP_NOFS); G-Locks are "promoted" from notheld -> held -> perm, and "demoted" from perm -> held -> notheld, always one step at a time (never moving directly between notheld and perm). ogfs_glock() handles a GL_PERM request in two stages, first putting the lock into "held", then bumping it to "perm". ogfs_gunlock() moves it back down to "held" when a GL_PERM glock is unlocked. ogfs_put_glstruct() decrements gl->gl_count, the reference/usage/access count for code accessing the structure contents. ogfs_put_glstruct() does nothing more than that. The system relies on periodic garbage collection, performed by the G-Lock kernel daemon, ogfs_glockd(), to de-allocate these structures. _ogfs_read_super() launches it during filesystem mount, and schedules it to run once every 5 seconds. ogfs_glockd() is implemented in src/fs/arch_*/daemon.c, since the hooks to the daemon scheduling mechanism are architecture-dependent. However, the real work is done by architecture-independent ogfs_glockd_scan(), in glock.c, which calls: -- ogfs_pitch_inodes(). This is arch-independent, in glock.c. It scans through the "held" and "notheld" glock cache chains, looking to destroy inactive inode structures that were under the protection of glocks that are no longer held by any process on this node. The glock cache scan happens every time ogfs_pitch_inodes() is called (typically every 5 seconds, when called from glockd daemon). In addition, no more often than once every 60 seconds, ogfs_pitch_inodes() calls ogfs_drop_excess_inodes(), which cleans the *kernel's* directory cache and inode cache. ogfs_drop_excess_inodes() is an arch-specific (kernel vs. user-space) routine, see src/fs/arch_linux_2_4. It calls kernel functions: -- shrink_dcache_sb(), toss out unused(?) directories from kernel's dcache -- shrink_icache_sb(), toss out unused(?) inodes from the kernel's icache. Order is important, since a directory uses an inode. Freeing a directory makes its inode unused, so it in turn can be freed. -- scan_held_glocks(). This gets called once for every glock cache hash bucket (all 512 of them). Scans the "held" cache chain for glocks which: a). Have dependencies. Mark these with GLF_DEPCHECK for a bit later. b). Are no longer locked (i.e. gl_locked == 0) by any process on this node. There are two possible situations in this case: 1) (timeout threshold reached || GLF_DEMOTEME) && !GLF_STICKY In this situation, the glock is free to be dropped, so we drop it via drop_glock() (see section "Caching G-Locks, Callbacks"). This releases this node's inter-node lock corresponding to the glock, and moves the glock structure into the "notheld" chain. A glock normally sits in the "held" chain for a while after all processes on this node have unlocked it. While held, the lock does not change state with regard to the locking module (i.e. the inter-node lock status stays the same). This keeps the glock ready for use should a process in this node need it again. The timeout, however, allows the lock to be automatically dropped (i.e. this node gives up its inter-node lock), if it hasn't been recently used. The glock structure's gl_stamp member is used to remember when major changes of state occur to the glock. G-Lock code marks the time when it: -- gets a glock structure via ogfs_get_glstruct() -- unlocks the lock via ogfs_gunlock() -- moves lock from "held" to "notheld" cache chain via unhold_glock() The unlock and ogfs_get_glstruct() situations are the ones that apply here (get_glstruct() can find the glock in the held chain), and this seems to be the only place in which gl_stamp is ever tested for a timeout. The timeout is passed as a parameter to scan_held_glocks(), passed in turn from ogfs_glockd_scan(), passed in turn from: -- ogfs_glockd(), the glock cleanup daemon, with timeout = sdp->sd_gl_heldtime. This is set by _ogfs_read_super(), the filesystem mount function, to be 300 seconds. So, normally, a no-longer-used lock will stay in a node's glock "held" chain for 5 minutes. -- ogfs_glock_cb(), the lock module callback function to the glock layer, for LM_CB_DROPLOCKS, with timeout = 0. This will cause any unheld lock to be dropped from the "held" cache chain, regardless of how long it has been there. The GLF_DEMOTEME flag is used by *this* node (other nodes use the DROPLOCKS or the NEED callbacks) to remove a no-longer-used glock from the "held" cache before it times out. OGFS code sets it in only one situation. rm_bh_internal() (see fs/arch_*/dio_arch.c) sets it when it has removed the last buffer from the arch-specific list, attached to the glock, that is used for invalidating buffers (when releasing a lock?). The GLF_STICKY flag is used to keep a glock in the glock cache, even after it times out after 5 minutes of non-use by this node. OGFS uses it for only 3 locks that get used throughout an OGFS session: -- resource index inode -- journal index inode -- transaction 2) Not timed out, or STICKY flag is set. In this situation, we cannot drop the lock, but we check here for the GLF_DEPCHECK flag that was set earlier in the function. If set, we sync to disk all data protected by the locks that are dependent on this lock, via sync_dependencies(). Note that for the DROPLOCKS callback, the timeout is 0, so a non-STICKY glock will always be dropped rather than having dependencies sync'd. After calling ogfs_pitch_inodes() and scan_held_glocks(), all excess inter-node locks have been released to the cluster, and all corresponding glocks have been moved to the "notheld" glock cache chain. ogfs_glockd_scan() then: -- scans through "notheld" chain, looking for any glock with no access interest from any process (i.e. gl_count == 0). It removes any such structures from the glock cache altogether, and calls release_glstruct() which calls: -- ls_ops->put_lock() to tell lock module to reclaim its private data structure attached to the glock. -- ogfs_free_glock() to de-allocate the glock structure via kernel's kmem_cache_free() call The G-Lock cache needs cleaning up in a couple of other situations, as well, but these are handled outside of the ogfs_glockd daemon, per se: 1. LM_CB_DROPLOCKS callback (see ogfs_glock_cb()) from lock module, when the lock module's lock storage becomes rather (but before it becomes completely) full. See ogfs-memexp doc. The callback asks the glock layer to free any glocks that are in the glock cache, but not actually being locked by any process. The callback routine calls the following: -- ogfs_drop_excess_inodes(), to clean out kernel's directory cache and inode cache (as described above). Note that ogfs_drop_excess_inodes() is conditionally called by ogfs_pitch_inodes() (see directly below). This is no more often than every 60 seconds, however, hard coded in ogfs_pitch_inodes(). Calling ogfs_drop_excess_inodes() explicitly from the DROPLOCKS callback immediately cleans up as many inodes as possible, without regard to how recently it has been done before. -- ogfs_pitch_inodes(), to clean out inactive inodes attached to glocks in the "held" and "notheld" cache chains (as described above). Enhancement: We might be able to leave this call out of the callback code, since ogfs_glockd_scan() (see directly below) also calls it. -- ogfs_glockd_scan(). Also arch-independent, in glock.c. This is the same function called by the glock daemon to clean up no-longer-held glocks from the glock cache (see above). 2. Filesystem unmount (called from super_linux.c, just prior to lock module unmount), using: -- ogfs_clear_gla() Caching G-Locks, Callbacks -------------------------- One performance-critical feature of the G-Lock layer is holding (or "caching") locks. After a node has acquired an inter-node lock, and a process has taken ownership of the glock, does the needed job, then releases the glock internally, the inter-node lock is not released immediately to the locking backend (unless there is a request for the lock from another node). Instead, the G-Lock layer retains the glock in the G-Lock cache for a while (see "G-Lock Cache, G-Lock Daemon"). In case some *other* node needs an incompatible lock (e.g. needs a shared lock, when this node holds an exclusive lock, or needs an exclusive lock, when this node holds a lock of any sort), the other node's locking backend calls *this* node, via this node's backend, and thence via the ogfs_glock_cb() function in glock.c, to ask *this* node to yield the lock. LM_CB_NEED_E - need exclusive lock LM_CB_NEED_D - need deferred lock LM_CB_NEED_S - need shared lock Note that there is no way for a node to request of the filesystem on another node to suspend or "hurry up" a current operation (e.g. a write transaction or a read operation) and give up a lock. The operation continues until the filesystem completes it and unlocks the lock at the normal time. The "NEED" messages are simply advisory as to how the other (requesting) node will use the lock, once the filesystem code on this node is done with it. The NEED callbacks do the following (see ogfs_glock_cb() in src/fs/glock.c): 1). Search "held" and "perm" glock cache chains for requested lock. -- If not found, this node doesn't hold the lock any more, simply return from call. Caveat: If, somehow, this node thinks it doesn't hold the lock, but lock storage *does* show this node as a holder, there is an infinite loop created as the other node keeps requesting that this node release the lock. This shouldn't happen, of course, but it actually does seem to happen occasionally. Enhancement: Allow this node to repeat the lock release attempt, to eliminate the infinite loop. -- If found, continue to next step. 2). Mark glock's gl_flags field with GLF_RELEXCL, GLF_RELDFRD, or GLF_RELSHD, as appropriate, based on the request from the other node. Note that the flag gets set regardless of whether any process has exclusive access to the glock structure (via the GLF_LOCK bit in gl_flags). GLF_LOCK has nothing to do with lock state, but just means that some process is in the middle of manipulating the glock structure's contents at the moment. 3). Tries to get exclusive access to the structure via GLF_LOCK. -- If cannot, simply return from call (after decrementing gl_count field, which was incremented when the glock was found in the glock cache). We cannot try to release the lock while a process manipulates it. -- If can, continue to next step. 4). Checks gl_locked to see if any process on this node has the lock locked. -- If so, we cannot release the lock to another node. We must wait until all processes on this node have unlocked the lock. Simply return from call (after releasing exclusive access via GLF_LOCK, and decrementing gl_count field). -- If not, continue to next step. 5). If no process (on this node) has the lock locked, we can immediately proceed to make the lock available to the other node, by releasing our node's hold on the lock: a). Check *inter-node* lock state (as we see it in our local glock struct) to make sure it is indeed locked. -- if *not* locked, we figure that the other node should already have what it needs ... simply return from call (after releasing the exclusive access to the structure, and decrementing gl_count). Caveat: If, somehow, this node thinks the lock is unlocked, but lock storage thinks it *is* locked, and shows this node as a holder, there is an infinite loop created as the other node keeps requesting that this node release the lock. This shouldn't happen, of course, but it actually does seem to happen occasionally. Enhancement: Allow this node to repeat the lock release attempt, to eliminate the infinite loop. -- If locked (as expected), proceed to next step. 6). Release this node's hold on the lock at inter-node scope. This is done in one of two ways, depending on the NEED request: -- EXCLUSIVE calls drop_glock() function, to unlock this node's lock at inter-node scope, and remove this lock from this node's glock cache. drop_glock() does the following, some via cleanup_glock() and sync_dependencies(), and their calls to glops: a). sync data associated with glock's dependent glocks, via gl_ops->go_sync(), to disk. b). drop_glock() any of this glock's children glocks. This includes syncing any of their associated data to disk, and that of their dependencies and children, etc. c). sync glock's data to disk via gl_ops->go_release(), which also writes LVB info if glock is on a resource group. d). call lock module, via ls_ops->unlock(), to unlock this node's hold on the inter-node lock. e). move glock to "unheld" glock cache chain in this node. -- SHARED or DEFERRED calls xmote_glock() function, to change inter- node lock state, and update the state in this node's glock cache. xmote_glock() does the following, some via cleanup_glock() and sync_dependencies(), and their calls to glops: a). Check if inter-node lock state (as we see it in our local glock structure) is already in the requested state. -- if so, we're done. Simply return from the call. (Caveat: does this have the same problems mentioned above about glock cache and lock storage disagreeing on state??). -- if not, proceed to next step. b). call cleanup_glock() to sync dependents' and children's data to disk. (same as steps a, b, c above). c). call lock module, via ls_ops->lock() to change inter-node lock state to requested one. d). update cached glock to reflect new status returned from lock module (including setting GLF_RELEXCL/DFRD/SHRD if locking module knows of another queued request(?)). e). call gl_ops->acquire() to load fresh LVB data from locking module if needed. If the callback function could not immediately satisfy the request of the other node, the GLF_RELEXCL/DFRD/SHRD bits store the fact that another node wants the lock. When the filesystem unlocks a lock, the ogfs_gunlock() function checks the following in the glock structure: -- gl_count to see if any other processes have a hold on the lock. If not, we can release the lock to another requesting node, if there is one. -- gl_flags field for the GLF_RELEXCL/DFRD/SHRD bits. If so, it calls either drop_glock() (for exlusive) or xmote_glock() (for deferred or shared). These are the same functions called by the callback, described above. G-Lock Structure ---------------- The following paragraphs describe each member of struct ogfs_glock. One such structure exists for each G-lock. 1. G-Lock cache hash table struct list_head gl_list -- Hash table hook unsigned int gl_bucket -- Hash bucket that we inhabit See "G-Lock Cache", above. 2. Lock name lm_lockname_t gl_name -- Unique "name" (but not a string!) for lock The lockname structure has two components: uint64 ln_number -- lock number unsigned int ln_type -- type of protected entity For most locks, the lock number is the block number (within the filesystem's 64-bit linear block space, which can span many storage devices) of the protected entity, left shifted to be equivalent to a 512-byte sector. Details are in src/fs/glock.c, ogfs_blk2lockname(). As an example, if we wanted to protect an inode at block 0x100, and we are using 4-kByte blocks, the lock number would be 0x0800 (0x100 << 3). I believe the block-to-sector conversion is for support of hardware-based DMEP protocols, which address the DMEP storage space in terms of 512-byte sectors. This could turn out to be problematic in *very large* 64-bit filesystems, if they want to use the upper 3 bits of the 64-bit block number. There is a special lock for the disk-based superblock, defined in src/fs/ogfs_ondisk.h. Note that this lock is not based on the block number (the superblock is *not* stored in block 0): OGFS_SB_LOCK (0) -- protects superblock read accesses from fs upgrades In addition to the block-based number assignments, OpenGFS uses some special, non-disk lock numbers. They are defined in src/fs/ogfs_ondisk.h (even though they don't show up on disk!): OGFS_MOUNT_LOCK (0) -- allows only one node to mount at a time. Note: same lock number as OGFS_SB_LOCK, but different type, so different lock! OGFS_LIVE_LOCK (1) -- protects?? OGFS_TRANS_LOCK (2) -- protects journal recovery from journal transactions OGFS_RENAME_LOCK (3) -- protects file/directory renaming/moving operations See "Special Locks" below for more details. The lock type is determined by the glops attached to the ogfs_glock() call to request the lock. See "glops", elsewhere in this document. Lock types are defined in src/include/lm_interface.h: LM_TYPE_RESERVED (0x00) -- not used by OpenGFS LM_TYPE_CIDBUF (0x01) -- cluster information device, used by memexp LM_TYPE_MOUNT (0x02) -- mount, used by memexp LM_TYPE_NONDISK (0x03) -- special locks LM_TYPE_INODE (0x04) -- inodes LM_TYPE_RGRP (0x05) -- resource groups LM_TYPE_META (0x06) -- metadata LM_TYPE_NOPEN (0x07) -- n-open LM_TYPE_FLOCK (0x08) -- Linux flock LM_TYPE_PLOCK (0x09) -- POSIX file lock LM_TYPE_PLOCK_HEAD (0x0A) -- POSIX file lock head LM_TYPE_LVB_MASK (0x80) -- Lock Value Block, ORd with other type number Note that there is no lock type for individual data blocks. The glock layer inserts individual data blocks into a list of protected blocks associated with each glock. For example, a locked inode may have many data blocks attached to its glock. Since the lock name is dependent on *both* the lock number and the type, ogfs can request more than one unique lock (each of a different type) on the same filesystem block or static lock number. As an example, ogfs_createi() (create a new inode), locks two locks on the same lock number (current OpenGFS implementation sets inum.no_formal_ino = inum.no_addr), but different lock types/glops: -- (inum.no_formal_ino), in exclusive (0) mode, using ogfs_inode_glops -- (inum.no_addr), in shared (GL_SHARED) mode, using ogfs_nopen_glops Even though one of them is exclusive, they will both succeed, since they are, indeed, different locks. 3. Reference count atomic_t gl_count - Reference/usage count of ogfs_glock structure This represents a depth of reference/usage/access for code reading or writing the structure contents. It does *not* represent anything regarding lock state, recursive locking, or exclusive access to a glock structure. ogfs_get_glstruct() increments gl_count if structure found in glock cache, or sets gl_count = 1 if new alloc (and does lots more!) ogfs_put_glstruct() decrements gl_count() (and does nothing more!) ogfs_hold_glstruct() increments gl_count() (and does nothing more) gl_count > 0 keeps the glockd daemon from removing a glock from the "notheld" glock cache chain and de-allocating its structure. 4. Flags unsigned long gl_flags - Flags These appear to be mostly (except for LOCK, SYNC, DIRTY, POISONED, RECOVERY(?)) for glock cache maintenance. GLF_HELD - lock is held by a process (in "held" or "perm" glock cache) set/reset only within glock.c by: hold_glock(), unhold_glock() GLF_PERM - lock is expected to be held for a long time (in "perm" cache) set/reset only within glock.c by: perm_glock(), unperm_glock() GLF_LOCK - mutex for exclusive access to all glock structure fields. set/reset only within glock.c by: try_lock_on_glock(), lock_on_glock(), unlock_on_glock() GLF_SYNC - sync data and metadata to disk when process releases lock GLF_DIRTY - the incore data/metadata !!!??? has changed GLF_POISONED - transaction failed GLF_RELEXCL - another computer node needs this lock in exclusive mode. don't cache it (just drop it) when process releases it. GLF_RELDFRD - another computer node needs this lock in deferred mode. keep it cached in this node's "held" chain when process releases it, in case this node needs it again. GLF_RELSHRD - another computer node needs this lock in shared mode. keep it cached in this node's "held" chain when process releases it, in case this node needs it again. GLF_STICKY - don't demote this glock. Used only in glocks for riinode, jiinode, and transaction. GLF_DEMOTEME - demote this glock. Used by arch-specific code to indicate that there are no more buffers covered by this glock. GLF_DEPCHECK - indicates that lock has dependencies. Used only within scan_held_glocks(). GLF_RECOVERY - Set by ogfs_glock() when lock request has LM_FLAG_NOEXP. Normally, ogfs_glock() resets this before returning. In some error cases, though, it does not. 5. Lock Structure Ownership (Locking by a process) long gl_pid - Process ID of process, if any, that owns the struct, NULL if no owner, -1 if GL_DISOWN atomic_t gl_locked - recursive count of process ownership spinlock_t gl_head_lock - spinlock that covers above 2 fields (only) An audit shows that gl_pid is always covered by spinlock gl_head_lock. gl_locked is sometimes covered by GLF_LOCK (which covers *entire* struct) instead of gl_head_lock. Once a node has acquired a lock, it must prevent corruption of its protected resource (inode, block, etc.) by multiple processes on the node (which can have more than one CPU). This protection is achieved through the concept of ownership of the ogfs_glock_t structure. The requesting process can ask that it is not recorded as the owner of the structure with the GL_DISOWN flag. This effectively prevents the same process from further requesting recursive ownership of the structure, and allows other processes to unlock the lock (is this sharing, or not? !!!investigate how this is really used). Caveat: There is no concept of a shared (read only) ownership of the structure within a node. Thus, all read operations on the protected resource are serialised within the node. !!!Investigate how much of a performance penalty this is. Caveat: Because of a race condition between the request_glock_ownership() and request_glock_wait_or_abort() functions, requests for ownership can be processed out of order, i.e. a process that requests ownership later than another process may be granted ownership first. !!!Investigate if this can cause deadlocks. !!!Investigate if a simple semaphore could be used instead. Caveat: A deadlock occurs if a process requests ownership with GL_DISOWN and later requests the same ownership again. !!!Investigate if this can happen. 6. Waiting for process' exclusive access to structure wait_queue_head_t gl_wait_lock - Wait queue for exclusive access to glock fields (see GLF_LOCK) gl_wait_lock is a wait queue used for inter-process (not inter-node) coordination. This is used with the GLF_LOCK bit in gl_flags, which provides exclusive access to the fields of the glock structure, but does *not* indicate anything relating to lock state! 7. Waiting for process' ownership of lock wait_queue_head_t gl_wait_unlock - Wait queue for glock to be unlocked by another process. This is internal to this node, and does not relate to the inter-node lock state, which must be locked if a process owns it, and will continue to be locked as the new process takes ownership. This has nothing to do with gl_wait_lock! This is the wait queue for a process to wait until another process is done with the lock. 8. Lock operations ogfs_glock_operations_t *gl_ops - Operations which get called at certain events over the lifetime of a glock (e.g. just after locking a lock, or just before unlocking one). See separate section on glops. 9. Inter-node Lock State unsigned int gl_state - The inter-node state of the lock On each node, a lock can be in one of the following states: LM_ST_UNLOCKED -- the node has not acquired the lock. LM_ST_EXCLUSIVE -- the node has acquired the lock and no other node may own or acquire the lock before it is released (write lock). LM_ST_SHARED -- the node has acquired the lock and other nodes may own or acquire it while this node owns it (read lock). LM_ST_DEFERRED -- another shared mode, but cannot be shared with LM_ST_SHARED. Note: It is unclear to me if and how this mode is used. If it is used, the memexpd server seems to be the one to request that mode on its own, without being told to do so by a node. Currently, lock modes refer only to inter-node (not inter-process nor SMP) locking. Therefore, a node may own the lock and hold it in exclusive, shared, or deferred state, even though no process on the node currently has the glock locked. The glock will be on the "held" glock cache chain in this situation. 10. Lock Module Private data lm_lock_t *gl_lock - Per-lock private data for the lock module The private data is never accessed by glock or filesystem layer code, but these layers may pass around the pointer for use by the lock module. This pointer is included in almost every call from the G-Lock layer to the lock module (usually seen as gl->gl_lock). Not to be confused with module private data pointer saved as sdp->sd_lockstruct.ls_lockspace, which is not per-lock data, but rather module instance data. 11. Lock Relationships struct ogfs_glock *gl_parent - This lock's parent lock (NULL if no parent) struct list_head gl_parlist - Parent's list of children struct list_head gl_children - List of children of this lock Locks may be attached to a parent when allocating a lock with ogfs_get_glstruct(). This fills the gl_parent member of this lock's glstruct, and adds this lock to the parent's gl_children list. Locks with identical gl_name values (i.e. identical lock number and type), but attached to different parents, are considered unique and separate locks. See find_glstruct(). I (bc) haven't been able to find where this is used by OpenGFS. 12. Lock Value Blocks unsigned int gl_lvb_count - Number of LVB references held on this glock lm_lvb_t *gl_lvb - LVB descriptor (which points to data) The inter-node lock has a data area that can be used to store global data and communicate that data to other nodes that acquire the lock. This "lock value block" (LVB) currently has a size of 32 bytes. The G-Lock layer provides a function interface to attach and detach data to/from a lock's LVB. LVBs are used with inter-node locks on resource groups, to pass resource usage statistics from node to node, when exchanging locks (see "Locking Resource Groups"). LVBs are also used for plocks (POSIX locks). 13. Version number uint64 gl_vn - Version number (incremented when cache is not valid any more) 14. Timestamp osi_clock_ticks_t gl_stamp - Time of create or last unlock The glock structure's gl_stamp member is used to remember when major changes of state occur to the glock. G-Lock code marks the time when it: -- gets a glock structure via ogfs_get_glstruct() -- unlocks the lock via ogfs_gunlock() -- moves lock from "held" to "notheld" cache chain via unhold_glock() 15. Protected Object void *gl_object - The object the glock is protecting 16. Transaction being built /* Modified under the glock (i.e. gl_locked > 0) */ struct list_head gl_new_list - List of glocks in transaction being built struct list_head gl_new_bufs - List of buffers for this lock in transaction being built ogfs_trans_t gl_trans_t - The transaction being built 17. In-core Transaction /* Modified under the log lock */ struct list_head gl_incore_list - List of glocks in incore transaction struct list_head gl_incore_bufs - List of buffers for this lock in incore transaction ogfs_trans_t gl_incode_tr - The incore transaction 18. Dependent G-Locks atomic_t gl_num_dep - The number of glocks that need to be synced before this one can be released struct list_head gl_depend - The list of glocks that need to be synced before this one can be released OGFS uses this to make sure that all inodes (with all associated data pages and buffers) in a resource group are flushed to disk before the resource group can be released. These fields are set by ogfs_add_gl_dependency(), which is called only from blklist.c functions: ogfs_blkfree() -- free a piece of data ogfs_metafree() -- free a piece of metadata ogfs_difree() -- free a dinode 19. Architecture-specific (i.e. kernel 2.4 vs. user-space) data ogfs_glock_arch_t gl_arch - Pointer to struct ogfs_glock_arch Kernel-space (src/fs/arch_linux_2_4) uses this for a list of filesystem buffers associated with the glock, for the purpose of interacting with the kernel buffer cache. The list contains entries of type ogfs_bufdata_t, which is a private data structure that filesystem code attaches to Linux kernel buffer heads. struct ogfs_glock_arch { struct list_head gl_bufs; /* Buffer list for caching */ }; typedef struct ogfs_glock_arch ogfs_glock_arch_t; User-space (src/fs/arch_user) defines ogfs_glock_arch as empty. Expiring Locks and the Recovery Daemon -------------------------------------- The lock module is responsible for detecting dead (expired) nodes. The memexp protocol does this with a heartbeat counter for each client node (see ogfs-memexp for more info). Note that there is no timeout on individual locks, and no time restriction for how quickly a filesystem operation must complete. Once a node is detected as "expired", each of the locks that it held in shared (read) mode is freed, and each of the locks that it held in exclusive (write) mode is marked as "expired". This is done by another node (the "cleaner" node, assigned by the lock module). After freeing/marking, recovery of the dead node's journal may be performed. The ogfs_glock_cb() function provides an interface for the locking module to inform the G-Lock layer, via the LM_CB_EXPIRED callback command, to replay a dead node's journal. When the callback occurs, ogfs_glock_cb() sets a bit in sdp->sd_dirty_j, a bitmap that indicates which journal needs recovery, and then wakes up the process of the journal recovery daemon, ogfs_recoverd(). The recovery daemon normally runs every 60 seconds, and normally finds, when checking sdp->sd_dirty_j, that no journals need to be replayed. The callback is the only place where code sets a bit in sdp->sd_dirty_j, thus the callback is the only method for triggering journal recovery of an expired node (is there a need for the periodic daemon, then?). There is no need for more than one node to replay the dead node's journal. The assignment to replay the journal (that is, the recipient(s) of the LM_CB_EXPIRED callback) depends on the implementation of the locking backend. When replaying a dead node's journal, the dead node's "expired" (i.e. exclusive lock held by the dead node) journal lock is needed by the "cleaner" node to write journal replay results to the filesystem. The special flag LM_FLAG_NOEXP, contained in a call to the backend's lock function, allows the backend to grant the lock, even though the lock is "expired". See comments in src/fs/glock.h. LM_FLAG_NOEXP is also used during filesystem mount to obtain some special locks that are absolutely needed at mount time, and which may be expired due to the death of this or another node. LM_FLAG_NOEXP is used to obtain the following locks: during filesystem mount: OGFS_MOUNT_LOCK -- exclusive lock owned only when mounting filesystem (recovers lock from any node that died while mounting) OGFS_LIVE_LOCK -- shared lock owned through lifetime of filesystem on this node (recovers lock from any node that died while mounting). (since this lock is always shared, never exclusive, would it ever be put in "expired" state? might depend on implementation of locking module?) journal lock -- exclusive lock for *this* machine's journal, owned through lifetime of filesystem (NOEXP needed only if this node died and fs is being remounted). during journal recovery for any node's journal, *this* or other: transaction lock -- exclusive lock when doing journal recovery, keeps all other machines from writing to filesystem. journal lock -- exclusive lock when doing journal recovery, allows node to use and modify the journal. Note that journal recovery is performed without regard to locks on any of the recovered items. ogfs_recover_journal() grabs only the journal and transaction locks mentioned above, then calls replay_metadata(), which writes to the filesystem without grabbing locks on anything it is writing. This is why it is important to stop all writes across the filesystem before doing a journal replay. G-Lock Interfaces ----------------------- The G-Lock layer defines a set of operations which an underlying locking protocol must implement. These were described in section 2, "Lock Harness and Lock Modules". The G-Lock layer also offers a set of services that can be used by the file system, independent of the underlying architecture and mounted locking protocol: Basic lock functions: ogfs_get_glstruct - locate a pre-existing glock struct in G-Lock cache, *or* allocate a new one from kernel, init it, link with parent glock (if parent is in call), call lock module to allocate a per-lock private data structure, attach private data to glock, and place into "notheld" chain in glock cache. Note that this does not make this lock visible to other nodes, nor does it fill in any current lock status. ogfs_put_glstruct - decrement process ownership count (gl_count) Note that this does *not* de-allocate the structure, even if count decrements to 0. This is *not* the opposite of ogfs_get_glstruct. De-alloc relies on ogfs_glockd() daemon, which runs once every 5 seconds, or LM_CB_DROPLOCKS callback from lock module, to perform garbage collection. ogfs_hold_glstruct - increment process ownership count (gl_count) ogfs_glock - lock a lock ogfs_gunlock - unlock a lock "Wrappers" for basic lock functions. All except ogfs_glock_num() require that glock structure has already been allocated via ogfs_get_glstruct(): ogfs_glock_i - lock an inode ogfs_gunlock_i - unlock an inode ogfs_glock_rg - lock a resource group ogfs_gunlock_rg - unlock a resource group ogfs_glock_num - lock a lock, given its number ogfs_gunlock_num - unlock a lock, given its number ogfs_glock_m - lock multiple locks, given a list ogfs_gunlock_m - unlock multiple locks, given a list LVB functions: ogfs_hold_lvb - attach lock value block (LVB) to a glock ogfs_unhold_lvb - detach lock value block (LVB) from a glock ogfs_sync_lvb - sync a LVB (to lock storage, visible to other nodes) Lock Dependency functions: ogfs_add_gl_dependency - make release ordering dep. between two glocks sync_dependencies - sync out dependent locks (to lock storage? fs?) Callback: ogfs_glock_cb - callback used by lock modules For prototypes and flags see "src/fs/glock.h". glops ----- Each G-Lock has a vector of functions ("operations") attached to it via gl_ops. These functions handle all the interesting behavior of the filesystem and journal that must occur just after getting a lock or or just before letting one go, such as: Just after getting a lock: -- reading items from disk -- reading LVB contents (rgrp usage statistics), sent from old lock owner Just before giving up a lock: -- flushing items to disk, so another computer can read them -- invalidating local buffers, so we don't try to read them while another computer is modifying their contents -- filling LVB contents (rgrp usage statistics), for new lock owner to use The operations are architecture-dependent (arch_user vs. arch_linux_2_4), and are type-dependent on the protected resource (inode, resource group, etc.). The operations are called only from within glock.c, and are all (except for one) implemented in arch_*/glops.c. The operations are defined in "struct ogfs_glock_operations". A short description is given here, see src/fs/incore.h for details: operations called by G-Lock layer: go_sync - synchronise/flush dirty data (protected by a lock) to disk meta, rgrp: sync glock's incore committed transaction logs sync all glock's protected dirty data bufs to disk inode: same, plus sync protected dirty *pages* to disk other types: no action go_acquire - create a lock rgrp: copy rgrp usage data from LVB (loaded from lock storage, contains latest data from any node) other types: no action go_release - release a glock to another node that needs it in exclusive mode. meta, inode: sync glock's incore committed transactions sync all glock's protected dirty data to disk invalidate all glock's buffers/pages rgrp: same, plus copy rgrp usage data to LVB (to store and make visible to other nodes) other types: no action go_lock - if this is process' first (recursive) lock on this glock: inode: read fresh copy of inode from disk rgrp: read fresh copy of rgrp bitmap from disk other types: no action go_unlock - if this process is finished with this glock: inode: copy OGFS dinode attributes to kernel's VFS inode (so kernel can pass it to other process, and/or write to disk) rgrp: brelse() rgrp bitmap (so kernel can pass it to another process, and/or write it to disk) copy usage stats to LVB structure (so glock layer can pass to other process or node) other types: no action go_free - free all buffers associated with a glock. used only when unmounting the lock protocol. meta, inode, rgrp: free all buffers other types: no action data fields: go_type - type of entity protected (e.g. inode, resource group, etc.) go_name - human-readable string name for particular ogfs_glock_operations definition Different implementions exist for different types of entities to be protected, e.g. inode, resource group, etc. Many of these types require only a few, or none, of these operations, in which case the respective fields contain NULLs. The go_type and go_name fields, however, are defined for each and every ogfs_glock_operations implementation. When requesting a lock structure, filesystem code selects the ogfs_lock_operations implementation to be attached to the lock, via the parameter "glops" in the call to ogfs_get_glstruct(). See Appendix H for an inventory and analysis of glops calls. 4. Using G-Locks in OpenGFS --------------------------- The file system code uses G-Locks to protect the various structures on the disk from concurrent access by multiple nodes. The protected ondisk objects can be dinodes (including the resource group and journal index dinodes), resource group headers, the superblock, buffer heads (blocks), and journals. In addition, glocks are used for non-disk locks for cluster coordination. The following information describes the locking strategies for various types of locks and protected entities. Special Non-Disk Locks ---------------------- In addition to the block-based number assignments, OpenGFS uses some special, non-disk lock numbers. They are defined in src/fs/ogfs_ondisk.h (even though they don't show up on disk), and are all of LM_TYPE_NONDISK: OGFS_SB_LOCK (0) -- protects superblock read accesses from fs upgrades that would re-write the superblock. OGFS_MOUNT_LOCK (0) -- allows only one node to be mounting the filesystem at any time. Locked in exclusive mode, with nondisk glops, when mounting. Unlocked when mount is complete, allowing another node to go ahead and mount. OGFS_LIVE_LOCK (1) -- protects?? Locked in shared mode, with nondisk glops, when mounting. Unlocked when unmounting. Indicates that at least one node has the filesystem mounted. OGFS_TRANS_LOCK (2) -- protects journal recovery operations from new transactions. Used in shared mode by transactions, so many transactions may be created simultaneously. Used in exclusive mode by ogfs_recover_journal(), to force other nodes and processes to finish current transactions before journal recovery begins, and keep them from starting new transactions until the recovery is complete. This allows the recovery process to have exclusive write access to the entire filesystem. Note that the recovery process does *not* bother to grab locks for protected entities (inodes, etc.) that it writes. Always uses trans glops, and is the only lock to do so. The glock structure for this lock is allocated during the filesystem mount, and stays attached to the incore superblock structure as sdp->sd_trans_gl. OGFS_RENAME_LOCK (3) -- protects file/directory renaming/moving operations from clobbering one another. Always used in exclusive mode. The glock structure for this lock is allocated during the filesystem mount, and stays attached to the incore superblock structure as sdp->sd_rename_gl. Unique On-Disk Locks -------------------- Resource Group Index -- Protects filesystem reads of the resource group index from filesystem expansion (or shrinking) operations. Filesystem expansion cannot proceed until all nodes unlock this lock, therefore all locks must be temporary. All filesystem accesses to the rindex, during the normal course of filesystem operations, are read accesses, protected in shared mode. The lock is on the the resource index's dinode (LM_TYPE_INODE), identified by its filesystem block number. rindex locking and unlocking is done by: ogfs_get_riinode() -- initial read-in of rindex' dinode (but not the rindex file itself), during the filesystem mount sequence (see _ogfs_read_super() in src/fs/arch_linux_2_4/super_linux.c). Attaches OGFS dinode structure to superblock structure as sdp->sd_riinode, and the inode's glock structure is attached to that structure. Unlocks lock before leaving function, but sets GLF_STICKY bit so it will stay in glock cache. This and ogfs_get_jiinode() are the only two functions that set the GLF_STICKY bit. ogfs_rindex_hold() -- makes sure we have latest rindex file contents in-core. Does *not* unlock the lock unless error. Called from many places in code that need to access resource groups. THIS FUNCTION DETECTS FILESYSTEM EXPANSION (or shrinkage). See below. ogfs_rindex_release() -- unlocks lock asserted by ogfs_rindex_hold(). An ogfs_rindex_hold() / ogfs_rindex_release() pair (often/always?) surrounds a transaction. If a user invokes the user-space ogfs_expand utility (see man page for ogfs_expand, and source in src/tools/ogfs_expand/main.c), it writes new resource groups headers out to the new space on disk. These are done outside of the space that the filesystem knows about (yet), are written using lseek() and write() calls to the raw filesystem device, and require no locks. Once done with resource groups, it writes a new rindex, appending the descriptions of the new resource groups to the current rindex file. This is, of course, written to the filesystem proper (i.e. not to the raw device, but rather to a file), using an ioctl OGFS_JWRITE. This ioctl (see ogfs_jwrite_ioctl() in src/fs/ioctl.c) grabs an exclusive lock on the rindex inode (using its block # as the lock #), and also creates a journal transaction around the write. The exclusive lock keeps the rindex write from proceeding until all nodes have completed accessing resource groups. Finally, the ioctl increments the version number of the inode's glock, gl->gl_vn++. This is what tells ogfs_rindex_hold that the rindex has changed. If so, ogfs_rindex_hold reads the new rindex from disk. Journal Index -- Protects filesystem reads of journal index from journal addition (or removal) operations. Journal addition/removal cannot proceed until all nodes unlock this lock, therefore all locks must be temporary. Journal protection works just the same way as resource index protection, with some name changes: ogfs_get_riinode() -- initial read-in of rindex inode ogfs_jindex_hold() -- lock jindex, get latest data, THIS FUNCTION DETECTS JOURNAL ADDITION/REMOVAL! ogfs_rindex_release() -- unlock jindex lock The user space utility that adds journals is ogfs_jadd. See man page for ogfs_jadd, and source in src/tools/ogfs_jadd/main.c. Locking Dinodes --------------- Dinodes are locked in shared mode for read access, or in exclusive mode for write access. Since dinodes are involved in almost all file system operations, they are locked quite often in either mode. Locking Resource Groups ----------------------- Files and dinodes are stored in "resource groups" on disk (see OGFS "Filesystem On-Disk Layout"). A resource group is a large set of contiguous blocks that are managed together. The filesystem is divided into a number of equal sized (except, perhaps, the first) resource groups. Each resource group on disk has a header that contains information about used and free blocks within the resource group. A file may be spread over a number of resource groups. When a file or dinode is manipulated, all resource groups that contain the file's data blocks or meta data must be locked. Since resource groups are spread evenly over the disks, reading them into core memory each time they are accessed would incur a horrible performance penalty. However, only a few members of the information stored in a resource group header ever change, namely the usage statistics. To get around the need to ever read this information from disk once the file system is mounted, the statistics are stored in the lock value block (LVB) of the ogfs_glock_t structure that protects the resource group. When a node modifies the resource group, it writes the new statistics to disk *and* into the LVB of the G-Lock. The next node that acquires the lock can read this information from the G-Lock instead of reading the disk block. Locking the Superblock ---------------------- The superblock is read and written only when the file system is mounted or unmounted. It is locked only at these times. Locking Journals ---------------- The journals usually need not be protected because they are used by only one node each. !!! Locking Buffer Heads -------------------- !!! Appendices ---------- Appendix A. G-Lock Call Flags Filesystem calls to ogfs_lock(), and its various wrappers (e.g. ogfs_glock_i()), may use the following flags. If the GL_SHARED or GL_DEFERRED flag is not used, then the request is for an exclusive lock: GL_SHARED - the lock may be shared between processes / nodes GL_DEFERRED - special lock mode, different from SHARED or EXCLUSIVE, but not currently used by OpenGFS (so we don't know what it means!). GL_PERM - lock will be held for a long time, and will reside in PERM cache GL_DISOWN - disallow recursive locking, allow other process to unlock GL_SKIP - skip "go_lock()" and "go_unlock()". In particular, used for grabbing locks so LVBs are accessible, while skipping any disk reads or flushes/writes that might otherwise occur. Currently used only for resource group locks when doing statfs() whole-filesystem block usage statistics gathering operations, skipping time-consuming reads/writes of rgrp header and block usage bitmaps. Filesystem calls to ogfs_unlock(), and its various wrappers (e.g. ogfs_gunlock_i()) may use the following flags: GL_SYNC - all data and metadata protected by this lock shall be synced to disk before the lock is released GL_NOCACHE - lock shall be dropped (*not* cached by the lock layer) after it has been unlocked Appendix B. G-Lock States A G-Lock can be in any of the following states: LM_ST_UNLOCKED - Unlocked LM_ST_EXCLUSIVE - Exclusive lock LM_ST_SHARED - Shared lock LM_ST_DEFERRED - Another shared lock mode which does not share with the LM_ST_SHARED mode. Appendix C. Callback Types A lock module on a given node can use the "ogfs_glock_cb()" interface of the node's G-Lock layer, to notify G-Lock about various situations. In most cases, these messages originate from other nodes. The following types of messages are specified so far: LM_CB_EXPIRED - another node has expired (died), recover its journal LM_CB_NEED_E - another node needs an exclusive lock LM_CB_NEED_D - another node needs a deferred lock LM_CB_NEED_S - another node needs a shared lock LM_CB_DROPLOCKS - drop unused cached glocks (used when lock storage is getting full) The "EXPIRED" callback is discussed in section "Expiring Locks and the Recovery Daemon". The "NEED" callbacks are discussed in section "Caching G-Locks, Callbacks". The "DROPLOCKS" callback is discussed in section "G-Lock Cache, G-Lock Daemon". Appendix D. Example "touch foo" (dv and bc) The following sequence of calls to ogfs_glock() has been observed after creating a new OpenGFS filesystem (i.e. running mkfs.ogfs), mounting it at /ogfs, then running the following command: $ touch /ogfs/foo ogfs_glock gets called for the following lock numbers (in the order listed): 20 20 19 17 21 21 2 21 20 21 2 Block 17 is the header of resource group 0 (block bitmap) Block 19 is the resource group index inode. Block 20 is the root directory inode. Block 21 is the block with the new inode for foo. Lock # 2 is the transaction lock (OGFS_TRANS_LOCK). Appendix E. Flaws in the current design and implementation of locking and related to locking (dv) * The locking code takes care of way too many things: - Inter-node locks - Inter-process locks - Caching locks - Deadlock detection - Watching node heartbeat - Stomithing dead nodes - Calling the journal recovery code * All layers involved in the locking system are both active and passive (calling and being called by each adjacent other layer). * Deadlock detection places the restriction that single read() or write() operations (or any other that uses locks) must be completed before the lock expires. That limits the possible size of atomic transfers drastically and can cause problems on systems with poor response times. * Resource group locking has deadlocks with transactions that span multiple resource groups. * Due to potential deadlocks, any write() operation to a file must be served by a single resource group. This further limits the flexibility of the file system and, I think, violates (which?) specs. For example, on a 100 MB OGFS with ten resource groups, the largest possible chunk that can be written with a single write() call is 10 MB, or less than that if none of the resource groups is empty. * Inter-node locks are needed too often. A simple "touch /foo" on a pristine ogfs file system needs no less than eleven locks. * The default block allocation policy is that each node uses a single resource group, selected at mount time based on the node's unique journal/cluster ID, so each node uses a different resource group. This policy crams all directories and inodes (created by a single cluster node) into a single resource group, causing a catastrophic increase in lock collisions. Other policies (random, and round-robin selection of resource group) are available, but ignore the layout of the data on disk, possibly replacing delays caused by locking with delays caused by disk seeks. * There is no concept of shared ownership of inter-process locks. Sharing such locks, instead of serializing them, would enhance read performance. Appendix F. Analysis of potential resource group deadlocks (dv) Fact 1: Resource groups need to be locked only to allocate or deallocate blocks. It is not necessary to lock the rg just to modify an inode or data block. Fact 2: There potentially are deadlocks if two or more resource groups are locked in random order. Fact3: When a new directory entry is created, the hash table of the directory might grow, requiring allocation of additional blocks. Fact 4: When any data or meta data is allocated, the resource groups are locked exclusively, one by one, until one with enough space is found. This can cause lots of inter node locks when the file system becomes full. Now let us see how many *resource groups* are locked by various operations. a) Modifying data blocks or inodes No rg locks required. b) Allocating an inode (mkdir(), create(), link(), symlink()) Creates a new directory entry in the parent directory. In current code, if the directory grows (and thus needs new meta data blocks), the whole directory hash table, plus any other new blocks are moved to/allocated in the same resource group as the new inode. This localization, once accomplished, minimizes the number of rgs that must be locked when accessing directory entries ... ... It may not seem like such a big limitation, but the current code tries to reserve enough space in that rg for the worst case of directory growth (hash table is created and immediately explodes to maximum size). In other words: in order to create a new inode, the target resource group must have about 1 MB of free data plus meta data blocks. c) Deallocating an inode Locks the inode's rg to update the block bitmap. Since ogfs never frees the space that has become unused in directories, the dir's rg is *not* locked. d) Allocating file data / write() / writepage() Only one rg is locked. A single ogfs_write() call never writes to more than a single resource group. This is an unacceptable limitation of the write() system call. e) Truncating a file (spanning multiple rgs) May need many rg locks. Sorts them before locking them. f) Removing a file or directory / unlink() Is done in two steps: 1). The directory entry is removed (no rg locks required, see above). 2). The inodes scheduled for removal are listed in the log, and their blocks are freed only after the transaction has been completed (i.e. flushed to disk?). This second stage needs to truncate each file (to 0 size) and remove its inode, sorting the corresponding rgs before locking them. (This description may be a bit inaccurate). g) Renaming plain files / rename() Needs one rg lock (see (f)). h) Renaming directories / rename() Needs one rg lock (see (f)). In addition, another lock serializes directory renaming operations. i) statfs() Locks all resource groups, one at a time, in order, while accumulating statistics from each group. j) mmap() shared writable Would need many rg locks which could be ordered. Not implemented since it would lock large parts of the file system for possibly long times. k) flock() Does not need any rg locks. It also prevents non locking file access by other processes and is thus not POSIX conformant. Summary In the current code, rg deadlocks are not possible, at least not with above operations. But the price one pays is high: - write() never writes more data than fitting into the rg with the most free space. - Inodes can be created only in resource groups that have at least 1 MB of free space. - Once allocated, empty directory blocks are never freed. - A directory hash table is never shrunk. - Meta data blocks are never converted back to data blocks. - When a directory hash table grows it is copied to the same rg as the new inode en bloc. - When a new directory leaf is allocated, it is created in the same rg as the new inode. This has the potential to scatter the directory leaves all over the file system. Appendix G. Inventory of calls to ogfs_get_glstruct() glock.c: ogfs_glock_num() -- Find (alloc if needed) and lock a lock, given number/type Number: From caller Type: From caller Parent: From caller CREATE: Yes Calls: ogfs_get_glstruct() ogfs_lock() ogfs_glock_num() -- Find and unlock a lock, given number/type Number: From caller Type: From caller Parent: From caller CREATE: No Calls: ogfs_get_glstruct() ogfs_unlock() ogfs_put_glstruct() inode.c: ogfs_lookupi() -- look up a filename in a directory, return its inode Number: inum.no_formal_ino Type: inode Parent: No CREATE: Yes plock.c: find_unused_plockgroup() -- find a value to use for the plock group Number: bid, composed of journal id, plock group id, and lock id Type: plock Parent: No CREATE: Yes hold_plockgroup() -- Number: bid, composed of journal id, plock group id, and lock id Type: plock Parent: No CREATE: Yes load_plock_head() -- Number: ip->i_num.no_formal_ino Type: inode Parent: No CREATE: Yes load_plock_jid_head() -- Number: bid, composed of journal id, plock group id, and lock id Type: plock Parent: No CREATE: Yes load_other_plock_elems() -- Number: bid, composed of journal id, plock group id, and lock id Type: plock Parent: No CREATE: Yes add_plock_elem() -- adds a plock to a journal id's chain Number: bid, composed of journal id, plock group id, and lock id Type: plock Parent: No CREATE: Yes recovery.c: ogfs_get_log_header() -- read the log header for a given journal segment Number: jdesc->ji_addr, block # of journal index Type: meta Parent: No CREATE: No Calls: ogfs_dread() to read the header block of the journal segment ogfs_put_glstruct() as soon as read is done foreach_descriptor() -- go through the active part of the log Number: jdesc->ji_addr, block # of journal index Type: meta Parent: No CREATE: No Calls: ogfs_dread() to read a block of the journal ogfs_put_glstruct() as soon as read is done do_replay_local() -- replay a metadata block (when in local fs mode) Number: jdesc->ji_addr, block # of journal index Type: meta Parent: No CREATE: No Calls: ogfs_get_glstruct() to get ptr to glstruct ogfs_put_glstruct() as soon as ptr is obtained Comments from code: The lock should (already) be held, so we don't need to hold a count on the structure. We do need its pointer, though. do_replay_multi() -- replay a metadata block (when in multi-host mode) Number: jdesc->ji_addr, block # of journal index Type: meta Parent: No CREATE: No Calls: ogfs_get_glstruct() to get ptr to glstruct ogfs_put_glstruct() as soon as ptr is obtained Comments from code: The lock should (already) be held, so we don't need to hold a count on the structure. We do need its pointer, though. replay_metadata() -- replay a metadata block (when in multi-host mode) Number: jdesc->ji_addr, block # of journal index Type: meta Parent: No CREATE: No Calls: ogfs_get_glstruct() to get ptr to glstruct ogfs_put_glstruct() as soon as ptr is obtained Comments from code: The lock should (already) be held, so we don't need to hold a count on the structure. We do need its pointer, though. clean_journal() -- mark a dirty journal as being clean Number: jdesc->ji_addr, block # of journal index Type: meta Parent: No CREATE: No Calls: lots of things ogfs_put_glstruct() at end of function collect_nopen() -- called by foreach_descriptor to get nopen counts Number: jdesc->ji_addr, block # of journal index Type: meta Parent: No CREATE: No Calls: ogfs_get_glstruct() to get ptr to glstruct ogfs_put_glstruct() as soon as ptr is obtained Comments from code: The lock should (already) be held, so we don't need to hold a count on the structure. We do need its pointer, though. arch_linux_2_4/super_linux.c: _ogfs_read_super -- the filesystem mount function Number: OGFS_TRANS_LOCK, the cluster-wide transaction lock Type: trans Parent: No CREATE: Yes _ogfs_read_super -- the filesystem mount function Number: OGFS_RENAME_LOCK, the cluster-wide file rename/move lock Type: nondisk Parent: No CREATE: Yes Appendix H. Inventory of calls to ogfs_glock(), either direct, or indirect via: glock.h: ogfs_glock_i() -- lock glock for this inode (given ptr to inode struct) ogfs_glock_rg() -- lock glock for this resource group (given ptr to rg struct) glock.c ogfs_glock_num() -- lock glock for this lock # (often, lock # = block #) ogfs_glock_m() -- lock multiple glocks (given list of glock ptrs) All except ogfs_glock_num() require that glock structure has already been allocated via ogfs_get_glstruct(). Calls to ogfs_glock() directly (excluding those from ogfs_glock_*()): -------------------------------------------------------------------- inode.c: inode_dealloc() ogfs_lookupi() grabs a lock on the found inode, in shared (GL_SHARED) mode. There are two cases here: 1. Found inode's lock name (block #) is < directory's lock name (block #): Unlock directory's inode (give opportunity for someone else to change the directory's knowledge of the inode's block location?) Lock (1st) found inode Lock directory's inode in shared (GL_SHARED) mode. Do another ogfs_dir_search() for the inode (same name) Compare 2nd found inode number with 1st found inode number If same, we've found what we're looking for If different, restart the search 2. Found inode's lock name (block #) is > directory's lock name (block #): Lock (1st) found inode ... this is the one we're looking for Case 1 seems to be a situation in which the inode is moving from block to block, and the code is looking for the directory to be stable as to the inode's final location (?). Once the inode is found, the function reads the inode structure into core, using ogfs_get_istruct(). ogfs_lookupi() has an option to release the locks on the directory and the found inode at the end of the function. Comments from Dominik: I think the dir lock is released and reacquired after the file lock to keep acquiring locks in ascending order (deadlock prevention). However, I'm not sure that nothing bad can happen between releasing and reacquiring the lock. See also ogfs_lookupi() discussed under "Calls to ogfs_glocki()" plock.c: find_unused_plockgroup() load_plock_head() load_plock_jid_head() load_other_plock_elems() load_my_plock_elems() add_plock_elem() recovery.c: ogfs_recover_journal() ... replay a journal to recover consistent state grabs the transaction lock (sdp->sd_trans_gl) in exclusive (0) mode, with flags: LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired", i.e. being recovered from a dead node. See "Expiring Locks and the Recovery Daemon". Grabbing this lock in exclusive mode prevents other nodes and processes from creating new transactions while the journal recovery is proceeding. This is the only function in which the transaction lock is grabbed in *exclusive* mode. The lock is unlocked by this function as soon as the journal replay is complete. This is a special (non-disk) lock, ID #2: From src/fs/ondisk.h: #define OGFS_TRANS_LOCK (2). src/fs/arch_linux_2_4, _ogfs_read_super() assigns: sdp->sd_trans_gl = ogfs_get_glstruct(sdp, OGFS_TRANS_LOCK, ...); trans.c: ogfs_trans_begin() ... begin a new transaction grabs the transaction lock (sdp->sd_trans_gl) in shared (read) mode (GL_SHARED). Grabbing this lock in shared mode allows other nodes and processes to create transactions simultaneously, unless and until a journal recovery occurs. See comments above. This is the only function in which the transaction lock is grabbed in *shared* mode. The lock is normally unlocked by ogfs_trans_end(), but will be unlocked by ogfs_trans_begin() itself if a failure occurs. arch_linux_2_4/inode_linux.c: ogfs_rename() ... rename a file grabs the rename lock (sdp->sd_rename_gl) in exclusive (0) mode. Grabbing this lock in exclusive mode prevents other nodes and processes from doing any renaming while this renaming is proceeding. This is the only function in which the rename lock is grabbed at all. The lock is released by this function, once the renaming is complete. From src/fs/ondisk.h: #define OGFS_RENAME_LOCK (3). src/fs/arch_linux_2_4, _ogfs_read_super() assigns: sdp->sd_rename_gl = ogfs_get_glstruct(sdp, OGFS_RENAME_LOCK, ...); This function creates a complete transaction! arch_linux_2_4/super_linux.c: _ogfs_read_super() ... set up in-core superblock, mount the fs grabs this node's journal lock (sdp->sd_my_jnl_gl) in exclusive (0) mode, called with disown flag (GL_DISOWN). The lock is then immediately unlocked! The journal lock is a lock on the first block of this node's journal, created and grabbed earlier, in exclusive mode, within the same function, using ogfs_glock_num(). Comments in src/fs/glock.h say that GL_DISOWN tells the lock layer to disallow recursive locking, and allow a different process to unlock the lock. So, it seems, this negates the exclusivity of the lock grabbed earlier in the function, while still holding a lock!?! The earlier exclusive lock is unlocked by ogfs_put_super(), when unmounting the filesystem. Calls to ogfs_glock_i() -------------------------------------------------------------------- blklist.c: ogfs_rindex_hold() -- locks resource index, makes sure we have latest info Resource : resource group index inode (sdp->sd_riinode) block # Mode: shared (GL_SHARED) Type: inode Flags: -- gunlock: only in case of error doing ogfs_rgrp_update() put_glstruct: No unlock: elsewhere, by ogfs_rindex_release(), via ogfs_gunlock_i() (no put_glstruct) Calls: If resource group info is out-of-date (i.e. filesystem has been expanded or shrunk), calls ogfs_rgrp_update() to read new info from disk. Called fm: ogfs_inplace_reserve(), blklist.c * do_strip(), bmap.c leaf_free(), dir.c dinode_dealloc(), inode.c ogfs_stat_rgrp_ioctl(), ioctl.c ogfs_reclaim_one_ioctl(), ioctl.c ogfs_reclaim_all_ioctl(), ioctl.c ogfs_setup_dameth(), super.c ogfs_stat_ogfs(), super.c * calls ogfs_rindex_release() *only* in case of error. Normally relies on ogfs_inplace_release() to do the unlock. All other functions unlock before exiting. None does a put_glstruct(). Comments from code: we keep this lock for a long time compared with other locks, since it is shared and very, very rarely accessed in exclusive mode. Comments (bc): What do they mean by "long time", or by "we"? It looks to me that the lock is not very long-lived. This function is the one that detects that a filesystem has grown or shrunk! Filesystem size change requires the addition or removal of resource groups, which in turn requires a change to the resource index. This function compares a version number held in the filesystem superblock structure (sdp->sd_riinode_vn) with a version number associated with the glock (gl->gl_vn) to detect a change in the resource index. This is the same lock grabbed by ogfs_get_riindex(). bmap.c: ogfs_truncate() ... change file size grabs a lock on the file's inode in exclusive (0) mode. Comments: file size can grow, shrink, or stay the same. glock.c: ogfs_glock_m() wrapper ... for each lock in the list, calls ogfs_glock_i() if the lock is for an inode. Mode is determined by flags contained in the list. inode.c: ogfs_lookupi() ... look up a filename in a directory grabs a lock on the directory inode, in shared mode (GL_SHARED). In a special case (that the found inode is located in a lower block # than the searched directory's inode), this function gives up the directory lock, then re-aquires it to try the search again. Does the location relationship indicate that something else is messing with the directory?? See discussion of ogfs_lookupi() under "Calls to ogfs_glock directly". ogfs_create_i() ... find requested inode, or create a new one grabs a lock on the directory inode, in exclusive (0) mode. If requested inode matches a name search of the directory, this function releases this exclusive lock before calling ogfs_lookupi(), which places a shared lock on the same directory. Comment from Dominik: It would be good to be able to downgrade the lock from exclusive to shared, without first needing to unlock it entirely (i.e. keep a lock locked while transitioning from exclusive to shared). This feature is not currently provided in OGFS locking. ogfs_update_atime() ... update inode's atime, if needed grabs a lock on the inode, in exclusive (0) mode, if it needs to write an update to the inode. Called only from src/fs/arch_linux_2_4/file.c | inode_linux.c, mostly via OGFS_UPDATE_ATIME macro (conditional on whether fs was mounted noatime), but directly from ogfs_file_mmap() (also conditional, just doesn't use the macro). In all cases, the caller holds a shared lock on the inode, so ogfs_update_atime() must release that shared lock just before grabbing the exclusive lock, if it needs to write an update to the inode. ogfs_update_atime() returns with a lock held (either the original shared lock or the replacement exclusive lock). So, the calling function is responsible for releasing the lock. It doesn't matter if the lock is held as shared or exclusive at the time of release. ioctl.c: ogfs_print_frag() ... print info about block locations for an inode grabs a lock on the inode, in shared (GL_SHARED) mode, to read it. ogfs_jread_ioctl() ... read from a journaled file, via ioctl grabs a lock on the file's inode, in shared (GL_SHARED) mode, to read the inode and file, using ogfs_readi(). ogfs_jwrite_ioctl() ... write to a journaled file, via ioctl grabs a lock on the file's inode, in exclusive (0) mode, to write it. Lots of interesting stuff going on in this function, look again later! super.c: ogfs_jindex_hold() ... grab a lock on the journal index (ondisk) grabs a lock on the ondisk journal index inode (sdp->sd_jiinode), in shared (GL_SHARED) mode. Also uses LM_FLAG_TRY if caller does *not* want to wait for the lock if it is currently unavailable. Function compares version numbers of incore superblock's sdp->sd_jiinode_vn and inode's glock version # ip->i_gl->gl_vn. If they are out of sync, then incore journal index is out-of-date relative to ondisk jindex(?). To read new journal index into core, function calls ogfs_ji_update(). arch_linux_2_4/dcache.c: ogfs_drevalidate() ... validate lookup path from parent directory to inode grabs a lock on parent directory, in shared (GL_SHARED) mode. Function uses ogfs_dir_search(), and kernel's BKL (lock_kernel()). arch_linux_2_4/file.c: ogfs_read() ... read bytes from a file grabs a lock on file's inode, in shared (GL_SHARED) mode. Function uses kernel's generic_file_read(), and BKL (lock_kernel()). ogfs_write() ... write bytes to a file grabs a lock on file's inode, in exclusive (0) mode. Releases when done. Calls ogfs_inplace_reserv(), ogfs_inplace_release(), ogfs_trans_begin(), ogfs_trans_end(), ogfs_get_inode_buffer(), ogfs_trans_add_bh(), ogfs_dinode_out() Creates a complete transaction! Function uses kernel's generic_file_write_nolock(), brelse(), and BKL (lock_kernel()). ogfs_readdir() ... read directory entries from a directory grabs a lock on directory inode, in shared (GL_SHARED) mode. Releases lock when done reading. Function uses ogfs_dir_read(). ogfs_sync_file() ... sync file's dirty data to disk (across the cluster) grabs a lock on file's inode, in exclusive (0) mode. Does *not* release the lock, unless there is an error. Interesting use of exclusive lock! Function does no more than just grab the lock. This forces any other cluster member (that might own the file lock) to flush data to disk. Function uses kernel's BKL (lock_kernel()). See also ogfs_irevalidate(), which uses a shared lock in a similar way for opposite reasons! ogfs_shared_nopage() ... support shared writeable mappings (mmap) graps a lock on vm area's inode (area->vm_file->f_dentry->d_inode), in exclusive (0) mode. Releases lock when done. Function uses kernel's filemap_nopage(), and BKL (lock_kernel()). ogfs_private_nopage() ... do safe locking on private mappings (mmap) graps a lock on vm area's inode (area->vm_file->f_dentry->d_inode), in shared (GL_SHARED) mode. Function uses kernel's filemap_nopage(), and BKL (lock_kernel()). ogfs_file_mmap() ... memory map a file, no sharing grabs a lock on file's inode, (file->f_dentry->d_inode), in shared (GL_SHARED) mode. Shared lock is grabbed after mapping, before calling ogfs_update_atime() (see ogfs_update_atime(), above). Once the call returns to ogfs_file_mmap(), we release *a* lock ... it may be the shared one we originally grabbed, or the exclusive one that ogfs_update_atime() grabbed if it needed to write. arch_linux_2_4/inode_linux.c: ogfs_set_attr() ... change attributes of an inode grabs a lock on the inode, in exclusive (0) mode, to write the inode. Creates an entire transaction (ogfs_trans_begin() to ogfs_trans_end()) in a certain case. ogfs_irevalidate() ... check that inode hasn't changed (ondisk?) grabs a lock on the inode, in shared (GL_SHARED) mode, to read the inode. Interesting use of shared lock! Function does no more than just grab the lock. This forces the incore image of the inode to sync up with the disk, if it's not already in sync. Function uses kernel's BKL (lock_kernel()). See also ogfs_sync_file(), which uses an exclusive lock in a similar way for opposite reasons! ogfs_readlink() ... read the value of a symlink (and copy_to_user). grabs a lock on the link's inode, in shared (GL_SHARED) mode. Function calls OGFS_UPDATE_ATIME (see ogfs_update_atime(), above), and ogfs_get_inode_buffer(). Function uses kernel's vfs_follow_link(), and BKL (lock_kernel()). ogfs_follow_link() ... follow a symbolic link (symlink) grabs a lock on the link's inode, in shared (GL_SHARED) mode. Function calls OGFS_UPDATE_ATIME (see ogfs_update_atime(), above), and ogfs_get_inode_buffer(). Function uses kernel's BKL (lock_kernel()). ogfs_readpage() ... read a page of data for a file grabs a lock on the file's (?) inode (page->mapping->host), in shared (GL_SHARED) mode. Function calls stuffed_readpage() (src/fs/arch_linux_2_4/inode_linux.c). Function uses kernel's UnlockPage(), block_read_full_page(), and BKL (lock_kernel()). ogfs_writepage() ... write a complete page for a file grabs a lock on the file's (?) inode (page->mapping->host), in exclusive (0) mode. Calls ogfs_inplace_reserv(), ogfs_inplace_release(), ogfs_trans_begin(), ogfs_trans_end(), and either stuffed_writepage(), or block_write_full_page(), depending on whether file is "stuffed" in inode block. Creates a complete transaction! Function uses kernel's BKL (lock_kernel()). ogfs_bmap() ... block map grabs a lock on the file (?) (mapping->host), in shared (GL_SHARED) mode. Function uses kernel's generic_block_bmap(), and BKL (lock_kernel()). Calls to ogfs_glock_rg() -------------------------------------------------------------------- blklist.c: ogfs_rgrp_lvb_init() ... init the data of a resource group lock value block grabs 1 or 2 locks on the resource group: If !force, grab a lock in shared (GL_SHARED) mode, with GL_SKIP flag. GL_SKIP, used for both the lock and unlock phase, keeps the glops lock_rgrp() from reading or writing resource group header/bitmap data to or from disk ... all we need to get is the LVB data. For all cases (except error), grab an exclusive (0) lock. Releases lock when done. Calls ogfs_rgrp_lvb_fill(), ogfs_rgrp_save_out(), ogfs_sync_lvb(). __get_best_rg_fit() ... find and lock the rg that best fits a reservation grabs locks on all(?) rgs, one at a time in ascending order, in exclusive (0) mode. The loop accumulates locks, without releasing them unless/until a "best fit" rg is found, that is, an rg that can accomodate the complete reservation. Releases locks on all but the selected "best fit" rg. Called only by ogfs_inplace_reserve(), as the "plan C" last resort method of reserving space for something. ogfs_inplace_reserve() ... reserve space in the filesystem grabs locks on a series of rgs, in exclusive (0) mode. Calls ogfs_rgrpd_get_with_hint(), try_rgrp_fit(), __get_pure_metares(), __get_best_rg_fit(). bmap.c: do_strip() ... strip off a particular layer(?) of the file grabs locks on a list of rgs, in exclusive (0) mode. Calls ogfs_rindex_hold(), ogfs_rlist_add(), ogfs_rlist_sort(), ogfs_trans_begin(), ogfs_trans_end(), ogfs_get_inode_buffer(), ogfs_trans_add_bh(), ogfs_blkfree(), ogfs_metafree(), ogfs_dinode_out(). Calls kernel's brelse(). Creates a full transaction! Comments from Dominik: I think "strips off a particular layer" refers to layers of indirect data blocks. That is, when a file shrinks, the number of indirections may be reduced, too. See "Filesystem On-Disk Layout" for info on indirect data blocks. dir.c: leaf_free() ... deallocate a directory leaf grabs locks on a list of rgs, in exclusive (0) mode. Calls ogfs_rindex_hold(), ogfs_rlist_add(), ogfs_rlist_sort(), ogfs_trans_begin(), ogfs_trans_end(), ogfs_get_leaf(), ogfs_leaf_in(), ogfs_trans_add_bh(), ogfs_metafree(), ogfs_internal_write(). Calls kernel's brelse(). Creates a full transaction! inode.c: dinode_dealloc() ... deallocate a dinode grabs a lock on the inode's resource group, in exclusive (0) mode. Calls ogfs_rindex_hold(), ogfs_trans_begin(), ogfs_trans_end(), ogfs_difree(), ogfs_trans_nopen_change(), ogfs_trans_add_gl(). Creates a full transaction! super.c: ogfs_stat_ogfs() ... do a statfs, adding up statistics from all rgrps. grabs a lock on each resource group in the filesystem, one by one, in shared (GL_SHARED) mode, and with GL_SKIP flag. GL_SKIP skips any reads or writes of resource group data on disk ... all we need to use is the lock's LVB data. Releases each lock after adding (accumulating) stats for its rgrp. Calls ogfs_rindex_hold(), ogfs_rgrpd_get_with_hint(), ogfs_lvb_init(), if it thinks that a node crashed when writing the LVB, to read rgrp statistics from disk, and re-initialize the corrupt LVB. Calls to ogfs_glock_num() ogfs_glock_num() embeds an ogfs_get_glstruct() within the call. ogfs_gunlock_num() does the unlock, and embeds ogfs_put_glstruct() within call. -------------------------------------------------------------------- flock.c: ogfs_flock() -- acquire an flock on a file Resource : ip->i_num.no_formal_ino Mode: shared (GL_SHARED), if flock is shared (otherwise exclusive) Type: flock Parent: No Flags: GL_PERM -- for all flocks GL_DISOWN -- for all flocks LM_FLAG_TRY -- if !wait gunlock_num: only on error unlock: in ogfs_funlock(), via ogfs_gunlock_num() See several interesting comments in this function. inode.c: read_dinode() -- read an inode from disk into the incore OGFS inode cache Resource : ip->i_num.no_formal_ino Mode: shared (GL_SHARED) Type: nopen Parent: No Flags: GL_PERM -- GL_DISOWN -- gunlock_num: see below put_glstruct: Yes unlock: in ??() Calls: ogfs_copyin_dinode(). In error situation, this function calls ogfs_gunlock() with GL_NOCACHE flag, since the lock will not be used in the future. It then calls ogfs_put_glstruct() to decrement the usage count on the glock structure (that's *all* that ogfs_put_glstruct() does). In normal situation, this function just calls ogfs_put_glstruct(), to decrement the usage count. It does not unlock the lock, since presumably something else wants to do something with the block, after it's been read in. Where/when would this lock be unlocked?? inode_dealloc() -- deallocate an inode Resource : inum.no_formal_ino Mode: exclusive Type: inode Parent: No Flags: -- gunlock: Yes, before leaving function put_glstruct: Yes, before leaving function Calls: ogfs_glock() to get nopen lock ogfs_get_istruct() to read inode (ip) from disk ogfs_gunlock() to unlock nopen lock ogfs_dir_exhash_free() ??? ogfs_shrink() to truncate the file to 0 size (deallocate data blocks) dinode_dealloc() to deallocate the inode block ogfs_put_istruct() to decrement usage count on ip istruct ogfs_destroy_istruct() to deallocate istruct from memory ogfs_dealloc_inodes() ... go through the list of inodes to be deallocated Resource : inum.no_addr Mode: exclusive Type: nopen Parent: No Flags: LM_FLAG_TRY, see below gunlock: Yes, before leaving function put_glstruct: Yes, before leaving function LM_FLAG_TRY -- if lock not immediately available, the function makes note of this as a "stuck" inode. This keeps us from spinning if the list can't be totally purged. (Why would an inode have a lock on it if it is de-allocatable?). Calls: ogfs_pitch_inodes() to throw away any inodes flagged to be discarded ogfs_nopen_find() to search sdp->sd_nopen_ic_list for a deallocatable inode. ogfs_gunlock() to unlock the inode (so following call can lock it in exclusive mode). inode_dealloc() to remove the inode and associated data ogfs_put_glstruct() to deallocate the lock structure Unlocks the lock, and and calls ogfs_put_glstruct() when done with each inode, as noted in Calls above. ogfs_createi() ... create a new inode grabs a lock on the inode, in two ways, one right after the other: Resource : inum.no_formal_ino Mode: exclusive Type: inode Parent: No Flags: -- gunlock: No (where unlocked?), except after error put_glstruct: Yes(!), before leaving function Resource : inum.no_addr Mode: shared (GL_SHARED) Type: nopen Parent: No Flags: -- gunlock: Yes, before leaving function put_glstruct: Yes, before leaving function Note: Current implementation sets inum.no_formal_ino = inum.no_addr (see fs/inode.c pick_formal_ino()). These two locks are differentiated only by their glops/type, since the lock number is the same! This function creates a complete transaction! ioctl.c: ogfs_get_super() ... dump disk superblock into user-space buffer Resource : OGFS_SB_LOCK (lock "0"), the cluster-wide superblock lock Mode: shared (GL_SHARED) Type: meta Parent: No Flags: -- gunlock: Yes, before leaving function put_glstruct: Yes, before leaving function Calls: ogfs_dread() to read the block from disk copy_to_user() (kernel) to copy into user-space buffer. Normally, you would think that the superblock should be static, so why lock it for a read? To protect it against filesystem upgrades, as rare as they may be! The only other place SB_LOCK is grabbed is in _ogfs_read_super(), when mounting. See below. recovery.c: ogfs_recover_journal() ... do a replay on a given journal Resource : sdp->sd_jindex[jid]->ji_addr, requested journal's first block Mode: exclusive Type: meta Parent: No Flags: LM_FLAG_NOEXP, always LM_FLAG_TRY, see below gunlock: Yes, before leaving function put_glstruct: Yes, before leaving function LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired", i.e. being recovered from a dead node. See "Expiring Locks and the Recovery Daemon". LM_FLAG_TRY -- Conditionally used. If this is *not* the first node to mount into the cluster, don't block when waiting for the lock. Instead, if the lock is not immediately available, print "OGFS: Busy" to the console, *don't* replay the journal, and exit with a successful return code. This looks at a boolean member of the lock module structure, (sdp->sd_lockstruct.ls_first). When a computer mounts a lock module, the module sets this value to TRUE to indicate that the computer is the first one in the cluster to mount the module. The memexp protocol is accurate in this (it can check with the memexp lock server), but the nolock protocol unconditionally sets this value to TRUE (it has no server to check). The stats protocol passes along the value set by the protocol which *it* mounted (stats is a stacking protocol). When mounting the filesystem, _ogfs_read_super() will replay *all* of the filesystems journals if ls_first is TRUE, calling ogfs_recover_journal() once for each journal. In this case, we must block when waiting for each journal lock (we *must* replay each journal before proceding). Once all journals have been replayed, _ogfs_read_super() calls ogfs_others_may_mount() (allowing other nodes that are blocked within the protocol mount() call to proceed), and sets ls_first to FALSE. If ls_first is FALSE, _ogfs_read_super() will replay only its own journal. In this case, we grab the lock with LM_FLAG_TRY. If we fail to get the lock, it just means some other computer is currently replaying the journal; there's no need for us to replay it, so we return with "success"! Note: The lock could also be held if a computer is doing a filesystem upgrade, but my guess is that the sequence of events would make it impossible for an upgrade to happen at the same time that we're mounting the filesystem??? super.c: ogfs_do_upgrade() -- upgrade a filesystem to a newer version Resource : sdp->sd_jindex[jid]->ji_addr, each journal's first block Mode: exclusive Type: meta Parent: No Flags: LM_FLAG_TRY, see below gunlock: Yes, after ogfs_find_jhead() for each journal put_glstruct: Yes, after ogfs_find_jhead() for each journal This function checks, before upgrading a filesystem, to make sure that each and every journal in the filesystem is unmounted. So, for each journal, it grabs a lock, calls ogfs_find_jhead(), and checks for the OGFS_LOG_HEAD_UNMOUNT flag. This flag is present in a "shutdown" journal header, and indicates that the journal has been unmounted. (Does it mean that the journal is empty?). If it does not find the UNMOUNT flag in the current journal head, or if it can't immediately acquire the journal lock, the function stops and reports an error -EBUSY. ogfs_get_riinode() -- reads resource index inode from disk, inits incore image Resource : resource index inode's block # (sdp->sd_sb.sb_rindex_di.no_formal_ino) Mode: shared (GL_SHARED) Type: inode Parent: No Flags: GLF_STICKY -- applied by set_bit() gunlock: Yes, before leaving function put_glstruct: Yes, before leaving function Calls: ogfs_get_istruct() to read inode from disk. Called fm: filesystem mount function, _ogfs_read_super(). Sets sdp->sd_riinode_vn = gl->gl_vn - 1. Is this to force ogfs_rindex_hold() to read new resource index from disk? This is the same lock grabbed by ogfs_rindex_hold(). ogfs_get_jiinode() -- reads journal index inode from disk, inits incore image Resource : jindex inode (sdp->sd_sb.sb_jindex_di.no_formal_ino) Mode: shared (GL_SHARED) Type: inode Parent: No Flags: GLF_STICKY -- applied by set_bit() gunlock: Yes, before leaving function put_glstruct: Yes, before leaving function Calls: ogfs_get_istruct() to read inode from disk. Called fm: filesystem mount function, _ogfs_read_super(). Sets sdp->sd_jiinode_vn = gl->gl_vn - 1. Is this to force ogfs_jindex_hold() to read new journal index from disk? arch_linux_2_4/file.c: ogfs_open_by_number() ... open a file by inode number grabs a lock on the inode (inum.no_formal_ino), in shared (GL_SHARED) mode. Resource : inum.no_formal_ino Mode: shared (GL_SHARED) Type: inode Parent: No Flags: -- gunlock: Yes, after ogfs_get_istruct() put_glstruct: Yes, after ogfs_get_istruct() Calls: ogfs_get_istruct() to read inode from disk. arch_linux_2_4/super_linux.c: _ogfs_read_super() ... mount the filesystem 1) grabs a lock on OGFS_MOUNT_LOCK (non-disk lock # 0), in exclusive (0) mode, using ogfs_nondisk_glops, with flags: GL_PERM -- LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired", i.e. being recovered from a dead node. See "Expiring Locks and the Recovery Daemon". 2) grabs a lock on OGFS_LIVE_LOCK (non-disk lock # 1), in shared (GL_SHARED) mode, using ogfs_nondisk_glops, with flags: GL_PERM -- GL_DISOWN -- LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired", i.e. being recovered from a dead node. See "Expiring Locks and the Recovery Daemon". 3) grabs a lock on OGFS_SB_LOCK (meta-data lock # 0), in shared (GL_SHARED) mode, using ogfs_meta_glops. Uses exlusive mode if mount argument calls for filesystem upgrade. 4) grabs a lock on the machine's journal (sdp->sd_my_jdesc.ji_addr), in exclusive (0) mode, using ogfs_meta_glops, with flags: GL_PERM -- LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired", i.e. being recovered from a dead node. See "Expiring Locks and the Recovery Daemon". 5) grabs a lock on the root inode (sdp->sd_sb.sb_root_di.no_formal_ino), in shared (GL_SHARED) mode, using ogfs_inode_glops. ogfs_iget_for_nfs() ... get an inode based on its number grabs a lock on the inode (inum->no_formal_ino), in shared (GL_SHARED) mode, using ogfs_inode_glops. Calls to ogfs_glock_m() -------------------------------------------------------------------- arch_linux_2_4/inode_linux.c: ogfs_link() ... link to a file grabs 2 locks, both in exclusive (0) mode, using ogfs_inode_ops, on: 1) inode of directory containing new link 2) inode being linked The function has a local variable array of 2 ogfs_lockop_t structures that it zeroes, then fills the lo_ip fields with inode pointers for the two lock targets. Exclusive mode is set by the zeroes, and the ogfs_inode_ops are selected by the fact that the lo_ip fields are used. See fs/glock.c ogfs_glock_m(). ogfs_unlink() ... unlink a file See ogfs_link(), above. The same locks are grabbed the same way. ogfs_rmdir() ... remove a directory See ogfs_link(), above. Locks for directory and its parent are grabbed the same way. ogfs_rename() ... rename/move a file grabs up to 4 locks, all in exclusive (0) mode, using ogfs_inode_ops, on: 1) inode of old parent directory 2) inode of new parent directory 3) inode of new name(?) (if pre-existing?) 4) inode of old directory(?) (if moving a directory?) Appendix I. Inventory of calls to glops. All calls to go_* are from glock.c. All implementations (except ogfs_free_bh()) are in src/fs/arch_*/glops.c. go_sync: Called from: sync_dependencies() -- sync out any locks dependent on this one ogfs_gunlock() -- unlock a glock Implementations: sync_meta() -- sync to disk all dirty data for a metadata glock used for types: meta, rgrp calls: test_bit() -- GLF_DIRTY (any dirty data to flush?) ogfs_log_flush() -- flush glock's incore committed transactions ogfs_sync_bh() -- flush all glock's buffers clear_bit() -- clear GLF_DIRTY, GLF_SYNC also called by: release_meta(), release_rgrp() sync_inode() -- sync to disk all dirty data for an inode glock used for type: inode calls: test_bit() -- GLF_DIRTY (any dirty data to flush?) ogfs_log_flush() -- flush glock's incore committed transactions ogfs_sync_page() -- flush all glock's pages ogfs_sync_bh() -- flush all glock's buffers clear_bit() -- clear GLF_DIRTY, GLF_SYNC also called by: release_inode() go_acquire: Called from: xmote_glock() -- promote a glock Implementations: acquire_rgrp() -- done after an rgrp lock is acquired used for type: rgrp calls: ogfs_rgrp_save_in() -- read rgrp data from glock's LVB go_release: Called from: cleanup_glock() -- prepare (an exclusive?) glock to be released to another node Implementations: release_meta() used for type: meta calls: sync_meta() -- sync to disk all dirty data assoc with glock ogfs_inval_bh() -- invalidate all buffers assoc with glock release_inode() used for type: inode calls: ogfs_flush_meta_cache() sync_inode() -- sync to disk all dirty data assoc with glock ogfs_inval_pg() -- invalidate all pages assoc with glock ogfs_inval_bh() -- invalidate all buffers assoc with glock release_rgrp() -- prepare an rgrp lock to be released used for type: rgrp calls: sync_meta() -- sync to disk all dirty data assoc with glock ogfs_inval_bh() -- invalidate all buffers assoc with glock ogfs_rgrp_save_out() -- write rgrp data out to glock's LVB release_trans() -- prepare *the* transaction lock to be released used for type: transaction calls: ogfs_log_flush() -- flush glock's incore committed transactions fsync_no_super() -- (kernel) flush this fs' dirty buffers go_lock: -- get fresh copy of inode or rgrp bitmap from disk returns 0 on success, error code on failure of read (this is the only go_* function that returns anything) Called from: ogfs_glock() Implementations: lock_inode() used for type: inode calls: atomic_read() -- gl_locked, recursive cnt of process ownership ogfs_copyin_dinode() -- get fresh copy of inode from disk lock_rgrp() -- done after an rgrp lock is locked by a process used for type: rgrp calls: atomic_read() -- gl_locked, recursive cnt of process ownership ogfs_rgrp_read() -- get fresh copy of rgrp bitmap from disk go_unlock: -- copy inode attributes to VFS inode, or release rgrp bitmap blocks, copy rgrp stats to LVB struct Called from: ogfs_gunlock() Implementations: unlock_inode() used for type: inode calls: atomic_read() -- gl_locked, recursive cnt of process ownership test_and_clear_bit() -- GLF_POISONED (gl_vn++ if so) test_bit() -- GLF_DIRTY (have inode attributes changed?) ogfs_inode_attr_in() -- copy attributes fm dinode -> VFS inode unlock_rgrp() -- prepare an rgrp lock to be unlocked by a process used for type: rgrp calls: atomic_read() -- gl_locked, recursive cnt of process ownership ogfs_rgrp_relse() -- release (i.e. brelse()) rgrp bitmaps test_and_clear_bit() -- GLF_POISONED (gl_vn++ if so) test_bit() -- GLF_DIRTY (have rgrp usage stats changed?) ogfs_rgrp_lvb_fill() -- copy rgrp usage stats to LVB struct go_free: Called from: ogfs_clear_gla() -- clear all glocks before unmounting the lock protocol Implementation (in arch_*/dio_arch.c): ogfs_free_bh() -- free all buffers associated with a G-Lock used for types: meta, inode, rgrp calls: list_del() -- removes private bufdata from glock list ogfs_put_glstruct() -- decrement reference count gl_count ogfs_free_bufdata() -- kmem_cache_free private bufdata from ogfs_bufdata_cache Appendix J. Some info from earlier ogfs-internals doc NOTE (bc): Some of this information is no longer exactly accurate, but provides interesting reading nonetheless. A Guide to the Internals of the OpenGFS File System copyright 2001 The OpenGFS Project. copyright 2000 Sistina Software Inc. The G-Lock Functions The concept of a G-Lock is fundamental to OpenGFS. A G-Lock represents an abstraction of some underlying locking protocol and is essential to maintaining consistency in an OpenGFS filesystem. The G-Lock layer provides the glue required between the abstract lock_harness code and the filesystem operations. The lock_harness itself is the subject of a separate document and not covered here. The G-Locks are held in a hash table contained in the OpenGFS specific portion of the superblock. Each hash chain has three separate lists plus associated counters and a read/write lock. The lists are associated with the state in which the G-Locks happen to be. The not_held state is for locks which are not held by the client, but the structure still exists (used to reduce the number of memory allocations/deallocations). The held state is for locks which are held by the client (although not in use by any processes). These locks can be dropped immediately upon a request from another client or upon memory pressure. The third state (perm) is used for locks which are both locked and in use. In order to release a G-Lock so that another client may access the data which it protects, all the data which that G-Lock covers must be flushed to disk. Also further accesses to the data on the client releasing the lock must be prevented until such time as the client requires the lock. Clients can cache data even when they don't have the G-Lock which covers that data provided they check the validity of the data the next time they acquire the lock and reread it if it has changed on disk. We use various techniques to improve the efficiency of the glock layer. Read/write locks are used upon the G-Lock lists so that the more usual lookup operations can occur in parallel with each other and only write operations (moving G-Locks between lists or creating or deleting them) need the exclusive lock. Also when a lock has been locked, it is not unlocked until it has aged a certain number of seconds. This is done to increase the chances of a future lock request being able to reuse the lock instead of requiring a separate locking operation. Of course, if another client requires the lock, it must post a callback to the lock holding client to request it. This is done by marking the lock with a special flag which causes it to unlock as soon as the current operation has completed (or immediately if there is no current operation). There is a deamon function (glockd) which runs periodically to clear the G-Lock cache of old entries. It does this in a two stage process. The first stage of ogfs_glockd_scan is really a part of the inode functions, and not part of the glock code, but it fits nicely here. The first part of the glockd scanning function looks at all the held locks and demotes any which have been held too long. The second part deletes any G-Locks which have exceeded the time out for not_held locks. see opengfs/src/fs/glock.c