(May 29 2004)
OpenGFS Locking Mechanism
Copyright 2003 The OpenGFS Project
Original authors, early 2003:
Stefan Domthera (sd)
Dominik Vogt (dv)
Updates since June, 2003:
Ben Cahill (bc), ben.m.cahill@intel.com
Introduction
------------
This document contains details of the OpenGFS locking system. The first part
provides and overview of the software layers involved in the locking system.
The second part describes the design of the G-Lock locking layer, and the final
part explains the details of how the G-Locks are used in OpenGFS.
This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of shared file system locking in
OpenGFS.
This document is not intended as a user guide to OpenGFS. Look in the OpenGFS
HOWTO-generic or HOWTO-nopool for details of configuring and setting up OpenGFS.
This document may contain inaccurate statements. Please contact the author
(bc) if you see anything wrong or unclear.
Terminology
-----------
Throughout this document, the combination of a mounted lock module, and - if
applicable - a lock storage facility (e.g. memexpd) is sometimes called "locking
backend" for simplicity.
The OpenGFS filesystem and locking code contains many uses of terms such as
"get", "put", "hold", "acquire", "release", etc., that may be inconsistent
or vague. I (bc) have tried to write enough detail to clearly explain the
true meaning of these terms, depending on the context. Let me know if anything
is unclear or inaccurate.
"Machines", "computers", "nodes", "cluster members", and sometimes "clients",
mean pretty much the same thing: a computer (machine) that is a compute node
within a cluster of computers sharing the filesystem storage device(s). If
using the memexp locking protocol, the memexp module on the computer will be
a "client" of the memexpd lock storage server.
"dinodes" are the OGFS version of an "inode". "inode" is often used to mean
"dinode", but may also mean a struct inode defined by the kernel's Virtual
File System (VFS).
Requirements
------------
In a distributed file system, at least three types of locking are required:
1). Inter-node locking must guarantee file system and data consistency when
multiple computer nodes try to read or write the shared file system in
parallel. In OpenGFS, these mechanisms are implemented by various "lock
modules" that link into the filesystem code with the help of the "lock
harness".
2). The file system code must protect the file system and the data structures
in memory from parallel access by multiple processes on the node. The
"G-Lock" software layer takes care of this protection, and also decides
whether communication with other nodes or a central lock server, via the
lock module, is necessary.
3). The file system structures in kernel memory must be protected from
concurrent access by multiple CPUs (Symmetric Multi-Processing) on the
same node. Linux spinlocks and/or other mutual exclusion (mutex)
methods are used to achieve this.
Overview
--------
The following gives an overview of the locking hierarchy. The left side
shows lock module loading/unloading (registering/unregistering with the
lock harness) and lock protocol mounting, and the right side shows all
other operations:
+--------------------------------------------------------------------------+
| File System Module (ogfs.o) |
| +-------------------+ +--------+
| |superblock, inodes,| | |
| sdp->sd_lockstruct |flocks, journals, |-| |
| / |resource grps, etc.| | glops |
| / +---------|---------+ | |
|-------/-----+ +-----------+ +---------|---------+ | |
| mount | | misc. | | G-Lock |-| |
+-------------+-------------+-----------+---+-------------------+-+--------+
| | |
+--------------------------+ | |
|Harness Module (harness.o)| | others_may_mount | lock & LVB
+--------------------------+ | reset_expired | operations
\ | unmount |
register \ | |
mount \ | |
unregister \ | |
+--------------------------------------+
| Lock Module (e.g. memexp.o) |
+--------------------------------------+
|
+--------------------------------------+
+ Lock Storage (e.g. memexpd) +
+--------------------------------------+
Loading, mounting, unloading <--|--> Ongoing filesystem operations, unmount
Fig. 1 -- Software modules and layers involved in locking
The current locking system is a big clump of different software layers that,
in addition to locking, execute several tasks that are not locking, per se.
Historically, these tasks were clumped together because they all relate to
coordinating the behavior of the cluster member nodes, and the lock server
has served double duty as the cluster membership service.
Enhancement: It might be a good idea to split off the functionality unrelated
to locking into independent components.
The lock harness serves two simple purposes:
- maintaining a list of available-to-mount lock modules
- connecting a selected module to the filesystem at lock protocol mount time
(one of the first things done during the filesystem mount).
After protocol mount time, the harness module's job is done.
The locking modules and lock storage facility take care of:
- Managing and storing inter-node locks and lock value blocks (LVBs)
- Lock expiration (lock request timeout) and deadlock detection
- Heartbeat functionality (are other nodes alive and healthy?)
- Fencing nodes, recovering locks, and triggering journal replay in case
of a node failure
The G-Lock software layer is a part of the file system code. It handles:
- Coordinating and cacheing locks and LVBs among processes on *this* node
- Communication with the locking backend (lock module) for inter-node locks
- Executing glops when appropriate (see below)
- Journal replay in case of a node failure
The glops (G-Lock Operations) layer is also part of filesystem code. It
implements the filesystem-specific, architecture-specific, and protected-item-
specific operations that must occur after locking or before unlocking, such as:
- Reading items from disk, or from another node via Lock Value Block (LVB),
after locking a lock
- Flushing items to disk, or to other nodes via LVB, before unlocking a lock
- Invalidating kernel buffers, once flushed to disk, so a node can't read
them while another node is changing their contents.
Each lock has a type-dependant glops attached to it. This attachment point is
the key to porting the locking system to other environments, and/or creating
different types of locks, and defining their associated behavior.
2. Lock Harness and Lock Modules
--------------------------------
At filesystem mount time, the filesystem uses the lock harness to mount a
locking protocol (see "Lock Harness", below). The harness' mount call,
lm_mount(), fills in the lm_lockstruct contained in the filesystem's incore
superblock structure as sdp->sd_lockstruct. This exposes the chosen lock
module's services to the filesystem.
struct lm_lockstruct contains:
ls_jid - journal ID of *this* computer
ls_first - TRUE if *this* computer is the first to mount the protocol
ls_lockspace - ptr to protocol-specific lock module private data structure
ls_ops - ptr to struct lm_lockops, described below
"ls_jid" indicates the journal ID that should be used for *this* computer.
It is currently the lock module's job to discover this journal ID. The "memexp"
lock module does this by reading information from a cluster information device
(cidev), which is a small disk partition dedicated for that purpose. The
"nolock" lock module provides either "0", or the value of a "jid=" entry (if
used) in the command line for mounting the filesystem. This is some of the
functionality that might be split off from the locking module, although journal
assignment might, as an alternative possibility, be handled dynamically through
the use of locks.
"ls_first" tells the filesystem code, at mount time, that *this* is the first
machine in the cluster to mount the filesystem. If TRUE, this machine
replays *all* journals for the whole cluster, before allowing other machines
to complete their lock protocol mounts (and therefore their filesystem mounts).
If FALSE, the filesystem replays only the journal for *this* computer. See
discussion in Appendix G, under "Calls to ogfs_glock_num()", recovery.c,
ogfs_recover_journal(). This is also some functionality that might be split
off from the locking component, although this functionality could also be
handled through the use of locks.
"ls_lockspace" points to a private data structure contained in, and for use by,
the lock module itself. The filesystem/glock code includes this pointer in
certain calls to the lock module, but never accesses the structure directly.
The private data structure is typically the lock module's incore "superblock"
(as called, perhaps inappropriately, by some of the code), i.e. its master data
structure. Not to be confused with private module data structure relating to
each lock.
"ls_ops" provides the big hook through which the filesystem code (including
G-Lock layer) accesses the lock module. Every lock module must implement
"struct lm_lockops" (see src/include/lm_interface.h) that contains the
following fields. Most (but not all) of these are function calls implemented
by the locking module, and called by filesystem code. See ogfs-memexp document
for details on one implementation of lm_lockops{}:
data fields:
proto_name - unique protocol name for this module, e.g. "nolock"
or "memexp".
local_fs - set to TRUE by nolock, FALSE by other protocols.
Filesystem code checks this when mounting, to enable
"localcaching" and "localflocks" for more efficient
operation in non-cluster (nolock) environment.
See man page for ogfs_mount.
list - element of protocol list maintained by lock harness.
cluster/locking/journaling functions called by lock harness or filesystem:
mount - initialize, start the lock module's locking functionality.
Called by lock harness' lm_mount() when mounting a
protocol onto the file system. This call will block
for non-first-to-mount machines, until the first-to-
mount machine has replayed all journals, and has called
others_may_mount(). This does not apply to the nolock
protocol, but must work this way for any clustered
protocol (e.g. memexp).
others_may_mount - indicate to other nodes that they may mount the filesystem.
Called by filesystem's _ogfs_read_super(), the fs mount
function, after this first-to-mount machine in the
cluster has replayed *all* journals, thus making the
on-disk filesystem ready to use by all nodes. Other
machines will block within the mount() call, until
others_may_mount() is called by the first-to-mount node.
unmount - stop, clean up the lock module's locking functionality.
Called by filesystem's ogfs_unmount_lockproto() when
unmounting lock protocol from file system.
reset_exp - reset expired client node (from EXPIRED to USED / NOTUSED).
Called from filesystem's journal subsystem's
ogfs_recover_journal(), after this node replays journal
of expired node.
locking/LVB (Lock Value Block) functions called by G-Lock layer:
get_lock - allocate and initialize a new lm_lock_t (lock module
per-lock private data) struct on this node. Does *not*
look for pre-existing structure. Does *not* access
lock storage, or make lock known to other nodes.
put_lock - de-allocate an lm_lock_t struct on this node, release
usage of (perhaps de-allocate) an attached LVB
(memexp internally calls memexp_unhold_lvb(), its own
implementation of unhold_lvb, see below).
Accesses lock storage only if LVB action is required.
lock - lock an inter-node lock (alloc lock storage buff if needed)
unlock - unlock an inter-node lock (de-alloc storage buff if posbl)
reset - reset an inter-node lock (unlock if locked)
cancel - cancel a request on an inter-node lock (ends retry loop)
hold_lvb - find an existing, or allocate and initialize a new,
Lock Value Block (LVB)
unhold_lvb - release usage of (perhaps de_allocate) an LVB
sync_lvb - synchronize LVB (make its contents visible to other nodes)
The call prototypes can be found in "src/include/lm_interface.h".
See the ogfs-memexp document for more detail on the use and implementation of
these calls.
Lock Harness
------------
The lock harness is a fairly thin abstraction within the OpenGFS locking
hierarchy. Simply spoken, it is a plug for different locking protocols.
To the lock modules, the harness offers services for protocol registration
and unregistration:
lm_register_proto() -- adds module's lm_lockops->list to harness' list of
available modules
lm_unregister_proto() -- removes module from harness' list of available modules
These are called by the lock modules' module_init() or module_exit() functions.
Typically, these calls are the *only* things done by module_init() (called when
the kernel loads the module) or module_exit() (when unloading). There is no
need to initialize any of the module's locking functionality until the kernel
mounts the filesystem (and the filesystem in turn selects and mounts the
particular module/protocol).
To the filesystem code, the harness offers a locking protocol mounting service:
lm_mount() -- initializes module's locking functionality (via lm_lockops mount),
fills in an lm_lockstruct, exposing the lock module's lm_lockops
functions and a private data pointer, and indicating journal ID
and first-to-mount status.
The protocol is mounted onto a specific file system during the filesystem
mount. You can use different locking protocols, and/or different "lockspaces"
for the same protocol, for different OpenGFS filesystems on the same computer.
"Lockspace" seems to be a combination of Cluster Information Device (cidev) and
protocol instance. A computer node mounting 3 separate OpenGFS filesystems,
each with memexp protocol, would need 3 different cidevs to describe the
clusters, one for each filesystem. This is enforced by code in the memexp
module's memexp_mount(), but is not enforced, nor used at all, by the nolock
module. cidevs are called out by "locktable" when mounting the filesystem
(see man page for ogfs_mount), or "table_name" or "table" within harness and
memexp code.
The calling sequence when mounting is:
_ogfs_read_super() -- mount filesystem, src/fs/arch_*/super_linux.c
ogfs_mount_lockproto() -- select lock protocol, src/fs/locking.c
lm_mount() -- mount lock protocol, src/locking/harness/harness.c
_ogfs_read_super() mounts the filesystem, and is architecture-dependent code.
ogfs_mount_lockproto() determines which lock protocol to mount, either by mount
options set by the user (see man page for ogfs_mount), or by reading the
filesystem superblock (for values set by mkfs.ogfs, see man page for mkfs_ogfs).
lm_mount() calls a module's lm_lockops->mount() function to initialize the
module's locking functionality. It supplies the following parameters in the
call:
table -- Cluster Information Device (cidev) for memexp
data -- "hostdata" for protocol. For memexp: node's IP address
cb -- Callback pointer for module to call fs' G-Lock layer.
fsdata -- Private filesystem data (the incore superblock pointer,
sdp) to be attached to callbacks from module to
G-Lock layer.
&lockstruct->ls_jid -- Journal ID for *this* computer, to be filled in by
module.
&lockstruct->ls_first -- First-to-mount status for *this* computer, to be
filled in by module.
"lockstruct" is actually sdp->sd_lockstruct, contained in the filesystem's
in-core superblock structure. lm_mount() fills in two other other members
of sdp->sd_lockstruct, so that the filesystem can access the newly mounted
locking module's capabilities:
->ls_lockspace -- value returned from module's lm_lockops->mount() call,
module-private data (typically a pointer to the module's
in-core "superblock" structure).
->ls_ops -- pointer to module's lm_lockops structure
The protocol is unmounted during the file system unmount process. The calling
sequence is:
ogfs_put_super() -- src/fs/arch_*/super_linux.c
ogfs_unmount_lockproto() -- src/fs/locking.c
sdp->sd_lockstruct.ls_ops->unmount() -- the lock module's unmount function
Note that the lock harness is not involved here! Its job was done after it
filled in the module information in sdp->sd_lockstruct. After that point,
the filesystem can reach the module directly, without using the harness.
The following use diagram gives a complete overview of using a lock module.
It covers all calls to the module from all parts of filesystem and harness code.
Some calls have no functionality for the nolock module (see discussion on
Lock Modules elsewhere in this document, or ogfs-memexp for more details
on how the memexp module works).
+-------------------------------------+
| super_linux.c (fs mount/unmount) |
+-------------------------------------+
| |
| lock protocol |others_may_mount
| mount/unmount |
v |
+---------------------------+ | +---------+ +-----------+
| locking.c (lock mnt/umnt) | | | glock.c | | journal.c |
+---------------------------+ | +---------+ +-----------+
| | | | |
|mount |unmount | |all |reset_exp(ired
v | | |lock/LVB | node)
+--------------+ | | |operations |
| harness.c | | | | |
+--------------+ | | | |
^ | | | | |
|register |mount | | | |
|unregister | | | | |
| v v v v v
+--------------------------------------------------------+
| lock module (memexp, nolock, or stats) |
+--------------------------------------------------------+
Fig. 2 -- complete register/unregister, and lm_lockops usage
Pertinent source files are:
src/fs/arch_*/super_linux.c (architecture dependent filesystem code)
src/fs/locking.c, glock.c, journal.c (architecture independent filesystem code)
src/locking/harness/harness.c (lock harness kernel module code)
src/locking/modules/*/* (lock module source code)
Lock Modules
------------
Lock modules are kernel implementations of G-Locks (see below). Each provides
a distinct locking protocol that can be used in OpenGFS:
locking/modules/memexp -- provides inter-node locking, for cluster use
locking/modules/nolock -- provides "fake" inter-node locks, not for cluster use
locking/modules/stats -- provides statistics, stacks on top of another protocol
The "memexp" protocol supports clustered operation, and is fairly sophisticated.
The memexp modules, one on each node, work with each other to keep track of
cluster membership, and which member nodes own which locks.
The memexp protocol relies on a central repository of lock data that is shared
among all nodes, but is completely separate from filesystem and journals.
The repository can be either one or more DMEP (Device Memory Export Protocol)
devices (e.g. certain SCSI drives), usually those in an OpenGFS pool (see
ogfs-pool doc), or a "fake" DMEP server, the memexpd server.
The memexpd server runs on one computer, and emulates DMEP operation, but
stores data in memory or local disk storage, rather than shared disk storage,
and communicates with cluster members via LAN rather than SCSI. Most users
use the memexpd server, rather than DMEP devices. Source code is in
locking/servers/memexp directory.
For lots more information on memexp, see the ogfs-memexp document.
The "nolock" protocol supports filesystem operations on a single node, and is
much simpler than the memexp protocol. Many of the lm_lockops functions are
stubbed out. There is no central lock storage, but the module does store
a structure for each lock locally in a hash table.
The "stats" protocol provides statistics (e.g. number of calls to each
lm_lockops function, current and peak values of numbers of locks on inodes,
metadata, etc., and lock latency statistics) for a protocol stacked below it
(the "lower" protocol). It looks like stats are printk()ed when the module
is *unmounted* ... I haven't found any other reporting mechanism.
To mount the stats on top of memexp, try the following options when mounting
the filesystem (see man ogfs_mount):
lockproto=lockstats locktable=memexp:/dev/pool/cidev
(cidev is the device containing the cluster information for memexp, e.g.
/dev/pool/cidev, if you are using pool ... see HOWTO-generic, or HOWTO-nopool).
Code for parsing the lower protocol (e.g. memexp) from the locktable option,
and mounting it is in src/locking/modules/stats/ops.c, stats_mount().
3. G-Locks (global locks)
-------------------------
The G-Lock layer is an abstract locking mechanism that is used by the file
system layer. It provides a service interface that conveniently supports:
-- using one of several available locking protocols (lock modules)
-- executing filesystem- and protected-entity-type-specific actions (glops)
before or after acting on a lock.
-- dependencies and parent/child relationships that the filesystem may wish
to impose on locks
The G-Lock layer's interfaces include:
-- G-Lock services, presented to the filesystem code
-- Lock module interface, between G-Lock layer and lock module
-- Lock and LVB commands to lock module
-- Callback from lock module to request lock release, journal replay
-- glops hook for installing filesystem- and type-specific actions on
each lock
In theory, the G-Lock layer should be usable in any other software, too. The
glops "socket" provides the opportunity to use G-Lock with other filesystems,
and define new lock types and associated actions.
Enhancement: Pull the G-Lock code from the file system sources and put it into
a separate module compiled as a library.
Lock instances
----------------
The G-Lock layer interfaces between the file system layer and the locking
backend. The G-Lock layer decides whether the lock is already within the
node (perhaps owned by another process, perhaps unowned), or whether it needs
to get the lock from "outside", that is, from the inter-node locking protocol.
If going "outside", G-Lock uses the lock module (true inter-node locking for
memexp, or "fake" inter-node for the nolock protocol).
A lock lives in (at least) two instances:
1. In the locking backend, outside of the file system. This inter-node lock
may get passed around between the cluster member nodes by way of a central
lock storage facility (in the case of memexp) or perhaps other methods,
e.g. passing between nodes directly via LAN (for OpenDLM, a distributed
lock manager, if/when a locking module is developed/integrated for OpenDLM).
The backend's lock implementation can vary for different protocol modules.
There are several data types defined in src/include/lm_interface.h as
"void", to support this variability. This allows the G-Lock layer to
pass these private structures around in a generic way, but not to actually
access them:
lm_lock_t -- generic lock. Identifies instance of a lock within
the module. Current implementations:
me_lock_t -- for memexp
nolock_lock -- for nolock
stats_lock -- for stats
lm_lockspace_t -- generic lock "space". Identifies instance of lock
module (there can be several instances of the module
on a given node, one instance for each OGFS filesystem
mounted on the node). Typically, it is the lock
module's "superblock" structure.
Current implementations:
memexp_t -- for memexp
nolock_space -- for nolock
stats_space -- for stats
On the opposite side of the interface, the lock module carries an ID for
the filesystem it is mounted on. Just as filesystem code never accesses
lock-module-specific structures, the lock module never accesses this data:
lm_fsdata_t -- generic filesystem data. Identifies instance of
filesystem (there can be several OGFS filesystems
using the same module on a given node), when module
does a callback to G-Lock layer. OGFS sets
this to be the filesystem's incore superblock
structure, usually seen as "sdp" in fs code.
The representation of a lock within a locking backend is significantly more
primitive than the G-Lock layer's representation; the interface between
G-Lock and locking modules exchanges only a few basic parameters for each
lock, thus limiting the knowledge that a lock module can have about it:
-- lockname (64-bit lock number and 32-bit lock type)
-- lock state (unlocked/shared/deferred/exclusive)
-- attached lock value block (LVB), if any, and PERManent status of LVB
-- cancellation (request from G-Lock to end a lock retry loop)
-- flags attached to lock request from G-Lock:
-- TRY (do not block if lock request can't be immediately granted)
-- NOEXP no expiration (allows dead node's lock to be held by this node)
-- release state (request from backend for G-Lock layer to release lock)
Other data relating to the lock, within the backend, is private to the
lock module, and is used for implementing the locking management of the
specific lock protocol. Note that there is no awareness by the backend
of inter-lock dependencies, parent/child relationships, process ownership,
recursive locking, lock cacheing, glops actions, filesystem transactions,
all of which are handled by, and confined to, the G-Lock layer (see below).
2. Within the filesystem module, as a struct ogfs_glock (ogfs_glock_t) in
kernel memory of a given node. The file system layer knows only about the
ogfs_glock_t structure (and nothing about the representation of a lock
within a locking module). Within the node, the structure is protected by
some code similar to semaphores.
The G-Lock layer handles locks at a significantly more sophisticated
level than does a lock module. It includes support for inter-lock
dependencies, parent/child relationships, process ownership, recursive
locking, lock cacheing, glops actions, filesystem transactions, and more.
This data type is defined independently of the locking protocol, with
no variability in its definition. For a detailed description of this
data type, please refer to "G-Lock Structure" below.
G-Lock Cache, G-Lock Daemon
-----------------------------------
The G-Lock cache stores and organizes ogfs_glock_t structures on *this* computer
node. The cache is implemented as a hash table with 3 chains:
perm -- glocks are currently locked by a process on this node, are
expected to be locked for a long time, and are locked at
inter-node scope.
Filesystem code may request this chain by using the GL_PERM flag
in a call to ogfs_glock() or any of its wrappers. Used for:
OGFS_MOUNT_LOCK, OGFS_LIVE_LOCK, journal index, flocks,
plocks (POSIX locks), and dinodes (when reading into inode cache).
This chain allows searches for glocks to be more efficient.
Some searches start with glock chains ("notheld" or "held") that
are more likely to hold the search target, and leave the "perm"
chain until last. A search for an unlocked glock can skip the
"perm" chain altogether.
Glocks in this chain move immediately to the "held" chain when
unlocked for the last time (recursive locking) by a process.
Glocks in this chain have the GLF_HELD and GLF_PERM flags set.
held -- glocks are currently or recently locked by a process on this node,
and are locked at inter-node scope.
Glocks typically stay in this cache chain for 5 minutes after
being unlocked for the last time (recursive locking) by a
process. This node retains the lock at inter-node scope, so
the glock is ready to be quickly locked again by a process,
without negotiating with the lock module. After the 5 minute
timeout, the glockd() cleanup daemon releases the inter-node
lock, and moves the glock to the "notheld" chain.
The inter-node lock may be released before the 5 minute timeout,
by request of a NEEDS or DROPLOCKS callback from another node.
When the inter-node lock is released, the glock moves to the
"notheld" chain.
Locks in this chain have the GLF_HELD flag set, GLF_PERM unset.
!!!??? (dv) Holding locks may be harmful on systems that write
data more often than they read it. Should this be tuneable?
notheld -- glocks are not locked by any process on this node, and are
*not* locked at inter-node scope.
if gl_count == 1, some process has some interest in the glock,
even though it is not locked (process could be getting ready
to lock a glock, etc.).
if gl_count == 0, no process has an interest in the lock,
contents of lock structure are meaningless,
and the structure is free to be re-used for another glock
(see new_glock() in glock.c) or be de-allocated.
Locks in this chain have the GLF_HELD and GLF_PERM flags unset.
Glock structures are first allocated and placed into the notheld cache via the
ogfs_get_glstruct() call. For kernel-space code, glock structures are
allocated using the following kernel call:
kmem_cache_alloc(ogfs_glock_cachep, GFP_NOFS);
G-Locks are "promoted" from notheld -> held -> perm, and "demoted" from
perm -> held -> notheld, always one step at a time (never moving directly
between notheld and perm). ogfs_glock() handles a GL_PERM request in two
stages, first putting the lock into "held", then bumping it to "perm".
ogfs_gunlock() moves it back down to "held" when a GL_PERM glock is unlocked.
ogfs_put_glstruct() decrements gl->gl_count, the reference/usage/access count
for code accessing the structure contents. ogfs_put_glstruct() does nothing
more than that. The system relies on periodic garbage collection, performed
by the G-Lock kernel daemon, ogfs_glockd(), to de-allocate these structures.
_ogfs_read_super() launches it during filesystem mount, and schedules it to
run once every 5 seconds.
ogfs_glockd() is implemented in src/fs/arch_*/daemon.c, since the hooks to
the daemon scheduling mechanism are architecture-dependent. However, the
real work is done by architecture-independent ogfs_glockd_scan(), in glock.c,
which calls:
-- ogfs_pitch_inodes(). This is arch-independent, in glock.c. It scans
through the "held" and "notheld" glock cache chains, looking to destroy
inactive inode structures that were under the protection of glocks that
are no longer held by any process on this node.
The glock cache scan happens every time ogfs_pitch_inodes() is called
(typically every 5 seconds, when called from glockd daemon). In
addition, no more often than once every 60 seconds, ogfs_pitch_inodes()
calls ogfs_drop_excess_inodes(), which cleans the *kernel's* directory
cache and inode cache.
ogfs_drop_excess_inodes() is an arch-specific (kernel vs. user-space)
routine, see src/fs/arch_linux_2_4. It calls kernel functions:
-- shrink_dcache_sb(), toss out unused(?) directories from kernel's dcache
-- shrink_icache_sb(), toss out unused(?) inodes from the kernel's icache.
Order is important, since a directory uses an inode. Freeing a
directory makes its inode unused, so it in turn can be freed.
-- scan_held_glocks(). This gets called once for every glock cache hash
bucket (all 512 of them). Scans the "held" cache chain for glocks which:
a). Have dependencies. Mark these with GLF_DEPCHECK for a bit later.
b). Are no longer locked (i.e. gl_locked == 0) by any process on this
node. There are two possible situations in this case:
1) (timeout threshold reached || GLF_DEMOTEME) && !GLF_STICKY
In this situation, the glock is free to be dropped, so we drop
it via drop_glock() (see section "Caching G-Locks, Callbacks").
This releases this node's inter-node lock corresponding to the
glock, and moves the glock structure into the "notheld" chain.
A glock normally sits in the "held" chain for a while after
all processes on this node have unlocked it. While held, the
lock does not change state with regard to the locking module
(i.e. the inter-node lock status stays the same). This keeps
the glock ready for use should a process in this node need it
again. The timeout, however, allows the lock to be automatically
dropped (i.e. this node gives up its inter-node lock), if it
hasn't been recently used.
The glock structure's gl_stamp member is used to remember when
major changes of state occur to the glock. G-Lock code marks
the time when it:
-- gets a glock structure via ogfs_get_glstruct()
-- unlocks the lock via ogfs_gunlock()
-- moves lock from "held" to "notheld" cache chain via
unhold_glock()
The unlock and ogfs_get_glstruct() situations are the ones that
apply here (get_glstruct() can find the glock in the held chain),
and this seems to be the only place in which gl_stamp is ever
tested for a timeout.
The timeout is passed as a parameter to scan_held_glocks(),
passed in turn from ogfs_glockd_scan(), passed in turn from:
-- ogfs_glockd(), the glock cleanup daemon, with
timeout = sdp->sd_gl_heldtime. This is set by
_ogfs_read_super(), the filesystem mount function,
to be 300 seconds. So, normally, a no-longer-used lock
will stay in a node's glock "held" chain for 5 minutes.
-- ogfs_glock_cb(), the lock module callback function to the
glock layer, for LM_CB_DROPLOCKS, with timeout = 0. This
will cause any unheld lock to be dropped from the "held" cache
chain, regardless of how long it has been there.
The GLF_DEMOTEME flag is used by *this* node (other nodes use the
DROPLOCKS or the NEED callbacks) to remove a no-longer-used glock
from the "held" cache before it times out. OGFS code sets it in
only one situation. rm_bh_internal() (see fs/arch_*/dio_arch.c)
sets it when it has removed the last buffer from the
arch-specific list, attached to the glock, that is used for
invalidating buffers (when releasing a lock?).
The GLF_STICKY flag is used to keep a glock in the glock cache,
even after it times out after 5 minutes of non-use by this node.
OGFS uses it for only 3 locks that get used throughout an OGFS
session:
-- resource index inode
-- journal index inode
-- transaction
2) Not timed out, or STICKY flag is set. In this situation, we
cannot drop the lock, but we check here for the GLF_DEPCHECK
flag that was set earlier in the function. If set, we sync to
disk all data protected by the locks that are dependent on this
lock, via sync_dependencies().
Note that for the DROPLOCKS callback, the timeout is 0, so
a non-STICKY glock will always be dropped rather than having
dependencies sync'd.
After calling ogfs_pitch_inodes() and scan_held_glocks(), all excess inter-node
locks have been released to the cluster, and all corresponding glocks have been
moved to the "notheld" glock cache chain. ogfs_glockd_scan() then:
-- scans through "notheld" chain, looking for any glock with no access
interest from any process (i.e. gl_count == 0). It removes any such
structures from the glock cache altogether, and calls release_glstruct()
which calls:
-- ls_ops->put_lock() to tell lock module to reclaim its private
data structure attached to the glock.
-- ogfs_free_glock() to de-allocate the glock structure via kernel's
kmem_cache_free() call
The G-Lock cache needs cleaning up in a couple of other situations, as well,
but these are handled outside of the ogfs_glockd daemon, per se:
1. LM_CB_DROPLOCKS callback (see ogfs_glock_cb()) from lock module, when the
lock module's lock storage becomes rather (but before it becomes completely)
full. See ogfs-memexp doc. The callback asks the glock layer to free any
glocks that are in the glock cache, but not actually being locked by any
process. The callback routine calls the following:
-- ogfs_drop_excess_inodes(), to clean out kernel's directory cache and
inode cache (as described above).
Note that ogfs_drop_excess_inodes() is conditionally called by
ogfs_pitch_inodes() (see directly below). This is no more often than
every 60 seconds, however, hard coded in ogfs_pitch_inodes(). Calling
ogfs_drop_excess_inodes() explicitly from the DROPLOCKS callback
immediately cleans up as many inodes as possible, without regard to
how recently it has been done before.
-- ogfs_pitch_inodes(), to clean out inactive inodes attached to glocks
in the "held" and "notheld" cache chains (as described above).
Enhancement: We might be able to leave this call out of the callback
code, since ogfs_glockd_scan() (see directly below) also calls it.
-- ogfs_glockd_scan(). Also arch-independent, in glock.c. This is
the same function called by the glock daemon to clean up no-longer-held
glocks from the glock cache (see above).
2. Filesystem unmount (called from super_linux.c, just prior to lock module
unmount), using:
-- ogfs_clear_gla()
Caching G-Locks, Callbacks
--------------------------
One performance-critical feature of the G-Lock layer is holding (or "caching")
locks. After a node has acquired an inter-node lock, and a process has taken
ownership of the glock, does the needed job, then releases the glock internally,
the inter-node lock is not released immediately to the locking backend (unless
there is a request for the lock from another node). Instead, the G-Lock layer
retains the glock in the G-Lock cache for a while (see "G-Lock Cache, G-Lock
Daemon").
In case some *other* node needs an incompatible lock (e.g. needs a shared lock,
when this node holds an exclusive lock, or needs an exclusive lock, when this
node holds a lock of any sort), the other node's locking backend calls *this*
node, via this node's backend, and thence via the ogfs_glock_cb() function in
glock.c, to ask *this* node to yield the lock.
LM_CB_NEED_E - need exclusive lock
LM_CB_NEED_D - need deferred lock
LM_CB_NEED_S - need shared lock
Note that there is no way for a node to request of the filesystem on another
node to suspend or "hurry up" a current operation (e.g. a write transaction or
a read operation) and give up a lock. The operation continues until the
filesystem completes it and unlocks the lock at the normal time.
The "NEED" messages are simply advisory as to how the other (requesting) node
will use the lock, once the filesystem code on this node is done with it.
The NEED callbacks do the following (see ogfs_glock_cb() in src/fs/glock.c):
1). Search "held" and "perm" glock cache chains for requested lock.
-- If not found, this node doesn't hold the lock any more, simply
return from call.
Caveat: If, somehow, this node thinks it
doesn't hold the lock, but lock storage *does* show this node as a
holder, there is an infinite loop created as the other node keeps
requesting that this node release the lock. This shouldn't happen,
of course, but it actually does seem to happen occasionally.
Enhancement: Allow this node to repeat the lock release attempt,
to eliminate the infinite loop.
-- If found, continue to next step.
2). Mark glock's gl_flags field with GLF_RELEXCL, GLF_RELDFRD, or GLF_RELSHD,
as appropriate, based on the request from the other node. Note that
the flag gets set regardless of whether any process has exclusive access
to the glock structure (via the GLF_LOCK bit in gl_flags). GLF_LOCK has
nothing to do with lock state, but just means that some process is in
the middle of manipulating the glock structure's contents at the moment.
3). Tries to get exclusive access to the structure via GLF_LOCK.
-- If cannot, simply return from call (after decrementing gl_count field,
which was incremented when the glock was found in the glock cache).
We cannot try to release the lock while a process manipulates it.
-- If can, continue to next step.
4). Checks gl_locked to see if any process on this node has the lock locked.
-- If so, we cannot release the lock to another node. We must wait until
all processes on this node have unlocked the lock. Simply return
from call (after releasing exclusive access via GLF_LOCK, and
decrementing gl_count field).
-- If not, continue to next step.
5). If no process (on this node) has the lock locked, we can immediately
proceed to make the lock available to the other node, by releasing our
node's hold on the lock:
a). Check *inter-node* lock state (as we see it in our local glock struct)
to make sure it is indeed locked.
-- if *not* locked, we figure that the other node should already
have what it needs ... simply return from call (after releasing
the exclusive access to the structure, and decrementing gl_count).
Caveat: If, somehow, this node thinks the lock is unlocked,
but lock storage thinks it *is* locked, and shows this node as a
holder, there is an infinite loop created as the other node keeps
requesting that this node release the lock. This shouldn't happen,
of course, but it actually does seem to happen occasionally.
Enhancement: Allow this node to repeat the lock release attempt,
to eliminate the infinite loop.
-- If locked (as expected), proceed to next step.
6). Release this node's hold on the lock at inter-node scope. This is done
in one of two ways, depending on the NEED request:
-- EXCLUSIVE calls drop_glock() function, to unlock this node's lock at
inter-node scope, and remove this lock from this node's glock cache.
drop_glock() does the following, some via cleanup_glock() and
sync_dependencies(), and their calls to glops:
a). sync data associated with glock's dependent glocks, via
gl_ops->go_sync(), to disk.
b). drop_glock() any of this glock's children glocks. This includes
syncing any of their associated data to disk, and that of their
dependencies and children, etc.
c). sync glock's data to disk via gl_ops->go_release(), which also
writes LVB info if glock is on a resource group.
d). call lock module, via ls_ops->unlock(), to unlock this node's hold
on the inter-node lock.
e). move glock to "unheld" glock cache chain in this node.
-- SHARED or DEFERRED calls xmote_glock() function, to change inter-
node lock state, and update the state in this node's glock cache.
xmote_glock() does the following, some via cleanup_glock() and
sync_dependencies(), and their calls to glops:
a). Check if inter-node lock state (as we see it in our local glock
structure) is already in the requested state.
-- if so, we're done. Simply return from the call. (Caveat:
does this have the same problems mentioned above about glock
cache and lock storage disagreeing on state??).
-- if not, proceed to next step.
b). call cleanup_glock() to sync dependents' and children's data to
disk. (same as steps a, b, c above).
c). call lock module, via ls_ops->lock() to change inter-node lock
state to requested one.
d). update cached glock to reflect new status returned from lock
module (including setting GLF_RELEXCL/DFRD/SHRD if locking module
knows of another queued request(?)).
e). call gl_ops->acquire() to load fresh LVB data from locking module
if needed.
If the callback function could not immediately satisfy the request of the other
node, the GLF_RELEXCL/DFRD/SHRD bits store the fact that another node wants the
lock. When the filesystem unlocks a lock, the ogfs_gunlock() function checks
the following in the glock structure:
-- gl_count to see if any other processes have a hold on the lock. If not, we
can release the lock to another requesting node, if there is one.
-- gl_flags field for the GLF_RELEXCL/DFRD/SHRD bits. If so, it calls either
drop_glock() (for exlusive) or xmote_glock() (for deferred or shared).
These are the same functions called by the callback, described above.
G-Lock Structure
----------------
The following paragraphs describe each member of struct ogfs_glock. One
such structure exists for each G-lock.
1. G-Lock cache hash table
struct list_head gl_list -- Hash table hook
unsigned int gl_bucket -- Hash bucket that we inhabit
See "G-Lock Cache", above.
2. Lock name
lm_lockname_t gl_name -- Unique "name" (but not a string!) for lock
The lockname structure has two components:
uint64 ln_number -- lock number
unsigned int ln_type -- type of protected entity
For most locks, the lock number is the block number (within the filesystem's
64-bit linear block space, which can span many storage devices) of the
protected entity, left shifted to be equivalent to a 512-byte sector.
Details are in src/fs/glock.c, ogfs_blk2lockname().
As an example, if we wanted to protect an inode at block 0x100, and we
are using 4-kByte blocks, the lock number would be 0x0800 (0x100 << 3).
I believe the block-to-sector conversion is for support of hardware-based
DMEP protocols, which address the DMEP storage space in terms of 512-byte
sectors. This could turn out to be problematic in *very large* 64-bit
filesystems, if they want to use the upper 3 bits of the 64-bit block
number.
There is a special lock for the disk-based superblock, defined in
src/fs/ogfs_ondisk.h. Note that this lock is not based on the block
number (the superblock is *not* stored in block 0):
OGFS_SB_LOCK (0) -- protects superblock read accesses from fs upgrades
In addition to the block-based number assignments, OpenGFS uses some
special, non-disk lock numbers. They are defined in src/fs/ogfs_ondisk.h
(even though they don't show up on disk!):
OGFS_MOUNT_LOCK (0) -- allows only one node to mount at a time.
Note: same lock number as OGFS_SB_LOCK,
but different type, so different lock!
OGFS_LIVE_LOCK (1) -- protects??
OGFS_TRANS_LOCK (2) -- protects journal recovery from journal transactions
OGFS_RENAME_LOCK (3) -- protects file/directory renaming/moving operations
See "Special Locks" below for more details.
The lock type is determined by the glops attached to the ogfs_glock()
call to request the lock. See "glops", elsewhere in this document. Lock
types are defined in src/include/lm_interface.h:
LM_TYPE_RESERVED (0x00) -- not used by OpenGFS
LM_TYPE_CIDBUF (0x01) -- cluster information device, used by memexp
LM_TYPE_MOUNT (0x02) -- mount, used by memexp
LM_TYPE_NONDISK (0x03) -- special locks
LM_TYPE_INODE (0x04) -- inodes
LM_TYPE_RGRP (0x05) -- resource groups
LM_TYPE_META (0x06) -- metadata
LM_TYPE_NOPEN (0x07) -- n-open
LM_TYPE_FLOCK (0x08) -- Linux flock
LM_TYPE_PLOCK (0x09) -- POSIX file lock
LM_TYPE_PLOCK_HEAD (0x0A) -- POSIX file lock head
LM_TYPE_LVB_MASK (0x80) -- Lock Value Block, ORd with other type number
Note that there is no lock type for individual data blocks. The glock
layer inserts individual data blocks into a list of protected blocks
associated with each glock. For example, a locked inode may have many
data blocks attached to its glock.
Since the lock name is dependent on *both* the lock number and the type,
ogfs can request more than one unique lock (each of a different type) on
the same filesystem block or static lock number.
As an example, ogfs_createi() (create a new inode), locks two locks on
the same lock number (current OpenGFS implementation sets
inum.no_formal_ino = inum.no_addr), but different lock types/glops:
-- (inum.no_formal_ino), in exclusive (0) mode, using ogfs_inode_glops
-- (inum.no_addr), in shared (GL_SHARED) mode, using ogfs_nopen_glops
Even though one of them is exclusive, they will both succeed, since they
are, indeed, different locks.
3. Reference count
atomic_t gl_count - Reference/usage count of ogfs_glock structure
This represents a depth of reference/usage/access for code reading or
writing the structure contents. It does *not* represent anything regarding
lock state, recursive locking, or exclusive access to a glock structure.
ogfs_get_glstruct() increments gl_count if structure found in glock cache,
or sets gl_count = 1 if new alloc (and does lots more!)
ogfs_put_glstruct() decrements gl_count() (and does nothing more!)
ogfs_hold_glstruct() increments gl_count() (and does nothing more)
gl_count > 0 keeps the glockd daemon from removing a glock from the
"notheld" glock cache chain and de-allocating its structure.
4. Flags
unsigned long gl_flags - Flags
These appear to be mostly (except for LOCK, SYNC, DIRTY, POISONED,
RECOVERY(?)) for glock cache maintenance.
GLF_HELD - lock is held by a process (in "held" or "perm" glock cache)
set/reset only within glock.c by:
hold_glock(), unhold_glock()
GLF_PERM - lock is expected to be held for a long time (in "perm" cache)
set/reset only within glock.c by:
perm_glock(), unperm_glock()
GLF_LOCK - mutex for exclusive access to all glock structure fields.
set/reset only within glock.c by:
try_lock_on_glock(), lock_on_glock(), unlock_on_glock()
GLF_SYNC - sync data and metadata to disk when process releases lock
GLF_DIRTY - the incore data/metadata !!!??? has changed
GLF_POISONED - transaction failed
GLF_RELEXCL - another computer node needs this lock in exclusive mode.
don't cache it (just drop it) when process releases it.
GLF_RELDFRD - another computer node needs this lock in deferred mode.
keep it cached in this node's "held" chain when process
releases it, in case this node needs it again.
GLF_RELSHRD - another computer node needs this lock in shared mode.
keep it cached in this node's "held" chain when process
releases it, in case this node needs it again.
GLF_STICKY - don't demote this glock. Used only in glocks for riinode,
jiinode, and transaction.
GLF_DEMOTEME - demote this glock. Used by arch-specific code to indicate
that there are no more buffers covered by this glock.
GLF_DEPCHECK - indicates that lock has dependencies. Used only within
scan_held_glocks().
GLF_RECOVERY - Set by ogfs_glock() when lock request has LM_FLAG_NOEXP.
Normally, ogfs_glock() resets this before returning.
In some error cases, though, it does not.
5. Lock Structure Ownership (Locking by a process)
long gl_pid - Process ID of process, if any, that owns the struct,
NULL if no owner, -1 if GL_DISOWN
atomic_t gl_locked - recursive count of process ownership
spinlock_t gl_head_lock - spinlock that covers above 2 fields (only)
An audit shows that gl_pid is always covered by spinlock gl_head_lock.
gl_locked is sometimes covered by GLF_LOCK (which covers *entire* struct)
instead of gl_head_lock.
Once a node has acquired a lock, it must prevent corruption of its protected
resource (inode, block, etc.) by multiple processes on the node (which can
have more than one CPU). This protection is achieved through the concept of
ownership of the ogfs_glock_t structure.
The requesting process can ask that it is not recorded as the owner of the
structure with the GL_DISOWN flag. This effectively prevents the same process
from further requesting recursive ownership of the structure, and allows other
processes to unlock the lock (is this sharing, or not? !!!investigate how
this is really used).
Caveat: There is no concept of a shared (read only) ownership of the
structure within a node. Thus, all read operations on the protected resource
are serialised within the node. !!!Investigate how much of a performance
penalty this is.
Caveat: Because of a race condition between the request_glock_ownership() and
request_glock_wait_or_abort() functions, requests for ownership can be
processed out of order, i.e. a process that requests ownership later than
another process may be granted ownership first. !!!Investigate if this can
cause deadlocks. !!!Investigate if a simple semaphore could be used instead.
Caveat: A deadlock occurs if a process requests ownership with GL_DISOWN and
later requests the same ownership again. !!!Investigate if this can happen.
6. Waiting for process' exclusive access to structure
wait_queue_head_t gl_wait_lock - Wait queue for exclusive access to glock
fields (see GLF_LOCK)
gl_wait_lock is a wait queue used for inter-process (not inter-node)
coordination. This is used with the GLF_LOCK bit in gl_flags, which provides
exclusive access to the fields of the glock structure, but does *not*
indicate anything relating to lock state!
7. Waiting for process' ownership of lock
wait_queue_head_t gl_wait_unlock - Wait queue for glock to be unlocked
by another process.
This is internal to this node, and does not relate to the inter-node lock
state, which must be locked if a process owns it, and will continue to be
locked as the new process takes ownership.
This has nothing to do with gl_wait_lock! This is the wait queue for a
process to wait until another process is done with the lock.
8. Lock operations
ogfs_glock_operations_t *gl_ops - Operations which get called at certain
events over the lifetime of a glock (e.g. just after locking
a lock, or just before unlocking one).
See separate section on glops.
9. Inter-node Lock State
unsigned int gl_state - The inter-node state of the lock
On each node, a lock can be in one of the following states:
LM_ST_UNLOCKED -- the node has not acquired the lock.
LM_ST_EXCLUSIVE -- the node has acquired the lock and no other node
may own or acquire the lock before it is released (write lock).
LM_ST_SHARED -- the node has acquired the lock and other nodes may own or
acquire it while this node owns it (read lock).
LM_ST_DEFERRED -- another shared mode, but cannot be shared with LM_ST_SHARED.
Note: It is unclear to me if and how this mode is used. If it is used,
the memexpd server seems to be the one to request that mode on its own,
without being told to do so by a node.
Currently, lock modes refer only to inter-node (not inter-process nor SMP)
locking. Therefore, a node may own the lock and hold it in exclusive, shared,
or deferred state, even though no process on the node currently has the glock
locked. The glock will be on the "held" glock cache chain in this situation.
10. Lock Module Private data
lm_lock_t *gl_lock - Per-lock private data for the lock module
The private data is never accessed by glock or filesystem layer code,
but these layers may pass around the pointer for use by the lock module.
This pointer is included in almost every call from the G-Lock layer to
the lock module (usually seen as gl->gl_lock). Not to be confused with
module private data pointer saved as sdp->sd_lockstruct.ls_lockspace,
which is not per-lock data, but rather module instance data.
11. Lock Relationships
struct ogfs_glock *gl_parent - This lock's parent lock (NULL if no parent)
struct list_head gl_parlist - Parent's list of children
struct list_head gl_children - List of children of this lock
Locks may be attached to a parent when allocating a lock with
ogfs_get_glstruct(). This fills the gl_parent member of this lock's
glstruct, and adds this lock to the parent's gl_children list.
Locks with identical gl_name values (i.e. identical lock number and type),
but attached to different parents, are considered unique and separate locks.
See find_glstruct().
I (bc) haven't been able to find where this is used by OpenGFS.
12. Lock Value Blocks
unsigned int gl_lvb_count - Number of LVB references held on this glock
lm_lvb_t *gl_lvb - LVB descriptor (which points to data)
The inter-node lock has a data area that can be used to store global data and
communicate that data to other nodes that acquire the lock. This "lock
value block" (LVB) currently has a size of 32 bytes. The G-Lock layer provides
a function interface to attach and detach data to/from a lock's LVB.
LVBs are used with inter-node locks on resource groups, to pass resource usage
statistics from node to node, when exchanging locks (see "Locking Resource
Groups"). LVBs are also used for plocks (POSIX locks).
13. Version number
uint64 gl_vn - Version number (incremented when cache is not valid any more)
14. Timestamp
osi_clock_ticks_t gl_stamp - Time of create or last unlock
The glock structure's gl_stamp member is used to remember when
major changes of state occur to the glock. G-Lock code marks
the time when it:
-- gets a glock structure via ogfs_get_glstruct()
-- unlocks the lock via ogfs_gunlock()
-- moves lock from "held" to "notheld" cache chain via
unhold_glock()
15. Protected Object
void *gl_object - The object the glock is protecting
16. Transaction being built
/* Modified under the glock (i.e. gl_locked > 0) */
struct list_head gl_new_list - List of glocks in transaction being built
struct list_head gl_new_bufs - List of buffers for this lock in transaction
being built
ogfs_trans_t gl_trans_t - The transaction being built
17. In-core Transaction
/* Modified under the log lock */
struct list_head gl_incore_list - List of glocks in incore transaction
struct list_head gl_incore_bufs - List of buffers for this lock in
incore transaction
ogfs_trans_t gl_incode_tr - The incore transaction
18. Dependent G-Locks
atomic_t gl_num_dep - The number of glocks that need to be synced
before this one can be released
struct list_head gl_depend - The list of glocks that need to be synced
before this one can be released
OGFS uses this to make sure that all inodes (with all associated data pages
and buffers) in a resource group are flushed to disk before the resource group
can be released. These fields are set by ogfs_add_gl_dependency(), which is
called only from blklist.c functions:
ogfs_blkfree() -- free a piece of data
ogfs_metafree() -- free a piece of metadata
ogfs_difree() -- free a dinode
19. Architecture-specific (i.e. kernel 2.4 vs. user-space) data
ogfs_glock_arch_t gl_arch - Pointer to struct ogfs_glock_arch
Kernel-space (src/fs/arch_linux_2_4) uses this for a list of filesystem buffers
associated with the glock, for the purpose of interacting with the kernel
buffer cache. The list contains entries of type ogfs_bufdata_t, which is a
private data structure that filesystem code attaches to Linux kernel buffer
heads.
struct ogfs_glock_arch {
struct list_head gl_bufs; /* Buffer list for caching */
};
typedef struct ogfs_glock_arch ogfs_glock_arch_t;
User-space (src/fs/arch_user) defines ogfs_glock_arch as empty.
Expiring Locks and the Recovery Daemon
--------------------------------------
The lock module is responsible for detecting dead (expired) nodes. The memexp
protocol does this with a heartbeat counter for each client node (see
ogfs-memexp for more info). Note that there is no timeout on individual locks,
and no time restriction for how quickly a filesystem operation must complete.
Once a node is detected as "expired", each of the locks that it held in
shared (read) mode is freed, and each of the locks that it held in exclusive
(write) mode is marked as "expired". This is done by another node (the
"cleaner" node, assigned by the lock module). After freeing/marking,
recovery of the dead node's journal may be performed.
The ogfs_glock_cb() function provides an interface for the locking module
to inform the G-Lock layer, via the LM_CB_EXPIRED callback command, to replay
a dead node's journal. When the callback occurs, ogfs_glock_cb() sets a bit
in sdp->sd_dirty_j, a bitmap that indicates which journal needs recovery, and
then wakes up the process of the journal recovery daemon, ogfs_recoverd().
The recovery daemon normally runs every 60 seconds, and normally finds, when
checking sdp->sd_dirty_j, that no journals need to be replayed. The callback
is the only place where code sets a bit in sdp->sd_dirty_j, thus the callback
is the only method for triggering journal recovery of an expired node (is
there a need for the periodic daemon, then?).
There is no need for more than one node to replay the dead node's journal.
The assignment to replay the journal (that is, the recipient(s) of the
LM_CB_EXPIRED callback) depends on the implementation of the locking backend.
When replaying a dead node's journal, the dead node's "expired" (i.e. exclusive
lock held by the dead node) journal lock is needed by the "cleaner" node to
write journal replay results to the filesystem. The special flag LM_FLAG_NOEXP,
contained in a call to the backend's lock function, allows the backend to grant
the lock, even though the lock is "expired". See comments in src/fs/glock.h.
LM_FLAG_NOEXP is also used during filesystem mount to obtain some special
locks that are absolutely needed at mount time, and which may be expired
due to the death of this or another node.
LM_FLAG_NOEXP is used to obtain the following locks:
during filesystem mount:
OGFS_MOUNT_LOCK -- exclusive lock owned only when mounting filesystem
(recovers lock from any node that died while mounting)
OGFS_LIVE_LOCK -- shared lock owned through lifetime of filesystem on this
node (recovers lock from any node that died while
mounting).
(since this lock is always shared, never exclusive,
would it ever be put in "expired" state? might
depend on implementation of locking module?)
journal lock -- exclusive lock for *this* machine's journal, owned
through lifetime of filesystem (NOEXP needed only
if this node died and fs is being remounted).
during journal recovery for any node's journal, *this* or other:
transaction lock -- exclusive lock when doing journal recovery,
keeps all other machines from writing to filesystem.
journal lock -- exclusive lock when doing journal recovery,
allows node to use and modify the journal.
Note that journal recovery is performed without regard to locks on any of the
recovered items. ogfs_recover_journal() grabs only the journal and transaction
locks mentioned above, then calls replay_metadata(), which writes to the
filesystem without grabbing locks on anything it is writing. This is why
it is important to stop all writes across the filesystem before doing a journal
replay.
G-Lock Interfaces
-----------------------
The G-Lock layer defines a set of operations which an underlying locking
protocol must implement. These were described in section 2, "Lock Harness
and Lock Modules".
The G-Lock layer also offers a set of services that can be used by the file
system, independent of the underlying architecture and mounted locking
protocol:
Basic lock functions:
ogfs_get_glstruct - locate a pre-existing glock struct in G-Lock cache,
*or* allocate a new one from kernel, init it,
link with parent glock (if parent is in call),
call lock module to allocate a per-lock private data
structure, attach private data to glock, and place
into "notheld" chain in glock cache. Note that this
does not make this lock visible to other nodes, nor
does it fill in any current lock status.
ogfs_put_glstruct - decrement process ownership count (gl_count)
Note that this does *not* de-allocate the structure,
even if count decrements to 0. This is *not* the
opposite of ogfs_get_glstruct. De-alloc relies on
ogfs_glockd() daemon, which runs once every 5 seconds,
or LM_CB_DROPLOCKS callback from lock module,
to perform garbage collection.
ogfs_hold_glstruct - increment process ownership count (gl_count)
ogfs_glock - lock a lock
ogfs_gunlock - unlock a lock
"Wrappers" for basic lock functions. All except ogfs_glock_num() require
that glock structure has already been allocated via ogfs_get_glstruct():
ogfs_glock_i - lock an inode
ogfs_gunlock_i - unlock an inode
ogfs_glock_rg - lock a resource group
ogfs_gunlock_rg - unlock a resource group
ogfs_glock_num - lock a lock, given its number
ogfs_gunlock_num - unlock a lock, given its number
ogfs_glock_m - lock multiple locks, given a list
ogfs_gunlock_m - unlock multiple locks, given a list
LVB functions:
ogfs_hold_lvb - attach lock value block (LVB) to a glock
ogfs_unhold_lvb - detach lock value block (LVB) from a glock
ogfs_sync_lvb - sync a LVB (to lock storage, visible to other nodes)
Lock Dependency functions:
ogfs_add_gl_dependency - make release ordering dep. between two glocks
sync_dependencies - sync out dependent locks (to lock storage? fs?)
Callback:
ogfs_glock_cb - callback used by lock modules
For prototypes and flags see "src/fs/glock.h".
glops
-----
Each G-Lock has a vector of functions ("operations") attached to it via gl_ops.
These functions handle all the interesting behavior of the filesystem and
journal that must occur just after getting a lock or or just before letting
one go, such as:
Just after getting a lock:
-- reading items from disk
-- reading LVB contents (rgrp usage statistics), sent from old lock owner
Just before giving up a lock:
-- flushing items to disk, so another computer can read them
-- invalidating local buffers, so we don't try to read them
while another computer is modifying their contents
-- filling LVB contents (rgrp usage statistics), for new lock owner to use
The operations are architecture-dependent (arch_user vs. arch_linux_2_4),
and are type-dependent on the protected resource (inode, resource group, etc.).
The operations are called only from within glock.c, and are all (except for one)
implemented in arch_*/glops.c.
The operations are defined in "struct ogfs_glock_operations". A
short description is given here, see src/fs/incore.h for details:
operations called by G-Lock layer:
go_sync - synchronise/flush dirty data (protected by a lock) to disk
meta, rgrp: sync glock's incore committed transaction logs
sync all glock's protected dirty data bufs to disk
inode: same, plus sync protected dirty *pages* to disk
other types: no action
go_acquire - create a lock
rgrp: copy rgrp usage data from LVB (loaded from lock
storage, contains latest data from any node)
other types: no action
go_release - release a glock to another node that needs it in exclusive mode.
meta, inode: sync glock's incore committed transactions
sync all glock's protected dirty data to disk
invalidate all glock's buffers/pages
rgrp: same, plus copy rgrp usage data to LVB (to
store and make visible to other nodes)
other types: no action
go_lock - if this is process' first (recursive) lock on this glock:
inode: read fresh copy of inode from disk
rgrp: read fresh copy of rgrp bitmap from disk
other types: no action
go_unlock - if this process is finished with this glock:
inode: copy OGFS dinode attributes to kernel's VFS inode
(so kernel can pass it to other process, and/or
write to disk)
rgrp: brelse() rgrp bitmap (so kernel can pass it to
another process, and/or write it to disk)
copy usage stats to LVB structure (so glock layer
can pass to other process or node)
other types: no action
go_free - free all buffers associated with a glock.
used only when unmounting the lock protocol.
meta, inode, rgrp: free all buffers
other types: no action
data fields:
go_type - type of entity protected (e.g. inode, resource group, etc.)
go_name - human-readable string name for particular ogfs_glock_operations
definition
Different implementions exist for different types of entities to be protected,
e.g. inode, resource group, etc. Many of these types require only a few, or
none, of these operations, in which case the respective fields contain NULLs.
The go_type and go_name fields, however, are defined for each and every
ogfs_glock_operations implementation.
When requesting a lock structure, filesystem code selects the
ogfs_lock_operations implementation to be attached to the lock, via the
parameter "glops" in the call to ogfs_get_glstruct().
See Appendix H for an inventory and analysis of glops calls.
4. Using G-Locks in OpenGFS
---------------------------
The file system code uses G-Locks to protect the various structures on the disk
from concurrent access by multiple nodes. The protected ondisk objects can be
dinodes (including the resource group and journal index dinodes), resource
group headers, the superblock, buffer heads (blocks), and journals. In
addition, glocks are used for non-disk locks for cluster coordination.
The following information describes the locking strategies for various types
of locks and protected entities.
Special Non-Disk Locks
----------------------
In addition to the block-based number assignments, OpenGFS uses some
special, non-disk lock numbers. They are defined in src/fs/ogfs_ondisk.h
(even though they don't show up on disk), and are all of LM_TYPE_NONDISK:
OGFS_SB_LOCK (0) -- protects superblock read accesses from fs upgrades
that would re-write the superblock.
OGFS_MOUNT_LOCK (0) -- allows only one node to be mounting the filesystem
at any time. Locked in exclusive mode, with
nondisk glops, when mounting. Unlocked when mount
is complete, allowing another node to go ahead
and mount.
OGFS_LIVE_LOCK (1) -- protects?? Locked in shared mode, with nondisk
glops, when mounting. Unlocked when unmounting.
Indicates that at least one node has the
filesystem mounted.
OGFS_TRANS_LOCK (2) -- protects journal recovery operations from new
transactions. Used in shared mode by transactions,
so many transactions may be created simultaneously.
Used in exclusive mode by ogfs_recover_journal(),
to force other nodes and processes to finish
current transactions before journal recovery
begins, and keep them from starting new
transactions until the recovery is complete.
This allows the recovery process to have exclusive
write access to the entire filesystem. Note that
the recovery process does *not* bother to grab
locks for protected entities (inodes, etc.) that
it writes.
Always uses trans glops, and is the only lock
to do so.
The glock structure for this lock is allocated
during the filesystem mount, and stays attached
to the incore superblock structure as
sdp->sd_trans_gl.
OGFS_RENAME_LOCK (3) -- protects file/directory renaming/moving operations
from clobbering one another. Always used in
exclusive mode.
The glock structure for this lock is allocated
during the filesystem mount, and stays attached
to the incore superblock structure as
sdp->sd_rename_gl.
Unique On-Disk Locks
--------------------
Resource Group Index -- Protects filesystem reads of the resource group index
from filesystem expansion (or shrinking) operations.
Filesystem expansion cannot proceed until all nodes
unlock this lock, therefore all locks must be
temporary.
All filesystem accesses to the rindex, during the
normal course of filesystem operations, are read
accesses, protected in shared mode. The lock is
on the the resource index's dinode (LM_TYPE_INODE),
identified by its filesystem block number. rindex
locking and unlocking is done by:
ogfs_get_riinode() -- initial read-in of rindex'
dinode (but not the rindex file itself),
during the filesystem mount sequence
(see _ogfs_read_super() in
src/fs/arch_linux_2_4/super_linux.c). Attaches
OGFS dinode structure to superblock structure as
sdp->sd_riinode, and the inode's glock structure
is attached to that structure.
Unlocks lock before leaving function, but sets
GLF_STICKY bit so it will stay in glock cache.
This and ogfs_get_jiinode() are the only two
functions that set the GLF_STICKY bit.
ogfs_rindex_hold() -- makes sure we have latest
rindex file contents in-core. Does *not*
unlock the lock unless error. Called from
many places in code that need to access
resource groups. THIS FUNCTION DETECTS
FILESYSTEM EXPANSION (or shrinkage). See below.
ogfs_rindex_release() -- unlocks lock asserted by
ogfs_rindex_hold(). An ogfs_rindex_hold() /
ogfs_rindex_release() pair (often/always?)
surrounds a transaction.
If a user invokes the user-space
ogfs_expand utility (see man page for ogfs_expand,
and source in src/tools/ogfs_expand/main.c),
it writes new resource groups headers out to the
new space on disk. These are done outside of the
space that the filesystem knows about (yet), are
written using lseek() and write() calls to the
raw filesystem device, and require no locks.
Once done with resource groups, it writes a new
rindex, appending the descriptions of the new
resource groups to the current rindex file.
This is, of course, written to the filesystem proper
(i.e. not to the raw device, but rather to a file),
using an ioctl OGFS_JWRITE. This ioctl (see
ogfs_jwrite_ioctl() in src/fs/ioctl.c) grabs an
exclusive lock on the rindex inode (using its
block # as the lock #), and also creates a journal
transaction around the write. The exclusive lock
keeps the rindex write from proceeding until all
nodes have completed accessing resource groups.
Finally, the ioctl increments the version number of
the inode's glock, gl->gl_vn++. This is what tells
ogfs_rindex_hold that the rindex has changed. If
so, ogfs_rindex_hold reads the new rindex from disk.
Journal Index -- Protects filesystem reads of journal index from
journal addition (or removal) operations. Journal
addition/removal cannot proceed until all nodes
unlock this lock, therefore all locks must be
temporary.
Journal protection works just the same way as
resource index protection, with some name changes:
ogfs_get_riinode() -- initial read-in of rindex inode
ogfs_jindex_hold() -- lock jindex, get latest data,
THIS FUNCTION DETECTS JOURNAL ADDITION/REMOVAL!
ogfs_rindex_release() -- unlock jindex lock
The user space utility that adds journals is
ogfs_jadd. See man page for ogfs_jadd,
and source in src/tools/ogfs_jadd/main.c.
Locking Dinodes
---------------
Dinodes are locked in shared mode for read access, or in exclusive mode for
write access. Since dinodes are involved in almost all file system operations,
they are locked quite often in either mode.
Locking Resource Groups
-----------------------
Files and dinodes are stored in "resource groups" on disk (see OGFS "Filesystem
On-Disk Layout"). A resource group is a large set of contiguous blocks that
are managed together. The filesystem is divided into a number of equal sized
(except, perhaps, the first) resource groups. Each resource group on disk has
a header that contains information about used and free blocks within the
resource group.
A file may be spread over a number of resource groups. When a file or dinode
is manipulated, all resource groups that contain the file's data blocks or
meta data must be locked.
Since resource groups are spread evenly over the disks, reading them into
core memory each time they are accessed would incur a horrible performance
penalty. However, only a few members of the information stored in a resource
group header ever change, namely the usage statistics.
To get around the need to ever read this information from disk once the file
system is mounted, the statistics are stored in the lock value block (LVB) of
the ogfs_glock_t structure that protects the resource group. When a node
modifies the resource group, it writes the new statistics to disk *and* into
the LVB of the G-Lock. The next node that acquires the lock can read this
information from the G-Lock instead of reading the disk block.
Locking the Superblock
----------------------
The superblock is read and written only when the file system is mounted or
unmounted. It is locked only at these times.
Locking Journals
----------------
The journals usually need not be protected because they are used by only one
node each.
!!!
Locking Buffer Heads
--------------------
!!!
Appendices
----------
Appendix A. G-Lock Call Flags
Filesystem calls to ogfs_lock(), and its various wrappers (e.g. ogfs_glock_i()),
may use the following flags. If the GL_SHARED or GL_DEFERRED flag is not used,
then the request is for an exclusive lock:
GL_SHARED - the lock may be shared between processes / nodes
GL_DEFERRED - special lock mode, different from SHARED or EXCLUSIVE, but not
currently used by OpenGFS (so we don't know what it means!).
GL_PERM - lock will be held for a long time, and will reside in PERM cache
GL_DISOWN - disallow recursive locking, allow other process to unlock
GL_SKIP - skip "go_lock()" and "go_unlock()". In particular, used for
grabbing locks so LVBs are accessible, while skipping any
disk reads or flushes/writes that might otherwise occur.
Currently used only for resource group locks when doing
statfs() whole-filesystem block usage statistics gathering
operations, skipping time-consuming reads/writes of
rgrp header and block usage bitmaps.
Filesystem calls to ogfs_unlock(), and its various wrappers
(e.g. ogfs_gunlock_i()) may use the following flags:
GL_SYNC - all data and metadata protected by this lock shall be
synced to disk before the lock is released
GL_NOCACHE - lock shall be dropped (*not* cached by the lock layer) after
it has been unlocked
Appendix B. G-Lock States
A G-Lock can be in any of the following states:
LM_ST_UNLOCKED - Unlocked
LM_ST_EXCLUSIVE - Exclusive lock
LM_ST_SHARED - Shared lock
LM_ST_DEFERRED - Another shared lock mode which does not share with
the LM_ST_SHARED mode.
Appendix C. Callback Types
A lock module on a given node can use the "ogfs_glock_cb()" interface of
the node's G-Lock layer, to notify G-Lock about various situations. In
most cases, these messages originate from other nodes. The following types
of messages are specified so far:
LM_CB_EXPIRED - another node has expired (died), recover its journal
LM_CB_NEED_E - another node needs an exclusive lock
LM_CB_NEED_D - another node needs a deferred lock
LM_CB_NEED_S - another node needs a shared lock
LM_CB_DROPLOCKS - drop unused cached glocks (used when lock storage is
getting full)
The "EXPIRED" callback is discussed in section "Expiring Locks and the
Recovery Daemon".
The "NEED" callbacks are discussed in section "Caching G-Locks, Callbacks".
The "DROPLOCKS" callback is discussed in section "G-Lock Cache, G-Lock Daemon".
Appendix D. Example "touch foo" (dv and bc)
The following sequence of calls to ogfs_glock() has been observed after creating
a new OpenGFS filesystem (i.e. running mkfs.ogfs), mounting it at /ogfs, then
running the following command:
$ touch /ogfs/foo
ogfs_glock gets called for the following lock numbers (in the order listed):
20 20 19 17 21 21 2 21 20 21 2
Block 17 is the header of resource group 0 (block bitmap)
Block 19 is the resource group index inode.
Block 20 is the root directory inode.
Block 21 is the block with the new inode for foo.
Lock # 2 is the transaction lock (OGFS_TRANS_LOCK).
Appendix E. Flaws in the current design and implementation of locking
and related to locking (dv)
* The locking code takes care of way too many things:
- Inter-node locks
- Inter-process locks
- Caching locks
- Deadlock detection
- Watching node heartbeat
- Stomithing dead nodes
- Calling the journal recovery code
* All layers involved in the locking system are both active and passive
(calling and being called by each adjacent other layer).
* Deadlock detection places the restriction that single read() or write()
operations (or any other that uses locks) must be completed before the
lock expires. That limits the possible size of atomic transfers drastically
and can cause problems on systems with poor response times.
* Resource group locking has deadlocks with transactions that span multiple
resource groups.
* Due to potential deadlocks, any write() operation to a file must be served
by a single resource group. This further limits the flexibility of the file
system and, I think, violates (which?) specs. For example, on a 100 MB OGFS
with ten resource groups, the largest possible chunk that can be written with
a single write() call is 10 MB, or less than that if none of the resource
groups is empty.
* Inter-node locks are needed too often. A simple "touch /foo" on a pristine
ogfs file system needs no less than eleven locks.
* The default block allocation policy is that each node uses a single resource
group, selected at mount time based on the node's unique journal/cluster ID,
so each node uses a different resource group. This policy crams all
directories and inodes (created by a single cluster node) into a
single resource group, causing a catastrophic increase in lock collisions.
Other policies (random, and round-robin selection of resource group) are
available, but ignore the layout of the data on disk, possibly replacing
delays caused by locking with delays caused by disk seeks.
* There is no concept of shared ownership of inter-process locks. Sharing
such locks, instead of serializing them, would enhance read performance.
Appendix F. Analysis of potential resource group deadlocks (dv)
Fact 1:
Resource groups need to be locked only to allocate or deallocate blocks. It
is not necessary to lock the rg just to modify an inode or data block.
Fact 2:
There potentially are deadlocks if two or more resource groups are locked in
random order.
Fact3:
When a new directory entry is created, the hash table of the directory might
grow, requiring allocation of additional blocks.
Fact 4:
When any data or meta data is allocated, the resource groups are locked
exclusively, one by one, until one with enough space is found. This can
cause lots of inter node locks when the file system becomes full.
Now let us see how many *resource groups* are locked by various operations.
a) Modifying data blocks or inodes
No rg locks required.
b) Allocating an inode (mkdir(), create(), link(), symlink())
Creates a new directory entry in the parent directory. In current code,
if the directory grows (and thus needs new meta data blocks), the whole
directory hash table, plus any other new blocks are moved to/allocated in the
same resource group as the new inode. This localization, once accomplished,
minimizes the number of rgs that must be locked when accessing directory
entries ...
... It may not seem like such a big limitation, but the current code
tries to reserve enough space in that rg for the worst case of directory
growth (hash table is created and immediately explodes to maximum size). In
other words: in order to create a new inode, the target resource group must
have about 1 MB of free data plus meta data blocks.
c) Deallocating an inode
Locks the inode's rg to update the block bitmap. Since ogfs never frees the
space that has become unused in directories, the dir's rg is *not* locked.
d) Allocating file data / write() / writepage()
Only one rg is locked. A single ogfs_write() call never writes to more than
a single resource group. This is an unacceptable limitation of the write()
system call.
e) Truncating a file (spanning multiple rgs)
May need many rg locks. Sorts them before locking them.
f) Removing a file or directory / unlink()
Is done in two steps:
1). The directory entry is removed (no rg locks required, see above).
2). The inodes scheduled for removal are listed in the log, and their
blocks are freed only after the transaction has been completed
(i.e. flushed to disk?).
This second stage needs to truncate each file (to 0 size) and remove its
inode, sorting the corresponding rgs before locking them. (This description
may be a bit inaccurate).
g) Renaming plain files / rename()
Needs one rg lock (see (f)).
h) Renaming directories / rename()
Needs one rg lock (see (f)). In addition, another lock serializes directory
renaming operations.
i) statfs()
Locks all resource groups, one at a time, in order, while accumulating
statistics from each group.
j) mmap() shared writable
Would need many rg locks which could be ordered. Not implemented since it
would lock large parts of the file system for possibly long times.
k) flock()
Does not need any rg locks. It also prevents non locking file access by
other processes and is thus not POSIX conformant.
Summary
In the current code, rg deadlocks are not possible, at least not
with above operations. But the price one pays is high:
- write() never writes more data than fitting into the rg with
the most free space.
- Inodes can be created only in resource groups that have at least
1 MB of free space.
- Once allocated, empty directory blocks are never freed.
- A directory hash table is never shrunk.
- Meta data blocks are never converted back to data blocks.
- When a directory hash table grows it is copied to the same rg
as the new inode en bloc.
- When a new directory leaf is allocated, it is created in the
same rg as the new inode. This has the potential to scatter
the directory leaves all over the file system.
Appendix G. Inventory of calls to ogfs_get_glstruct()
glock.c:
ogfs_glock_num() -- Find (alloc if needed) and lock a lock, given number/type
Number: From caller
Type: From caller
Parent: From caller
CREATE: Yes
Calls: ogfs_get_glstruct()
ogfs_lock()
ogfs_glock_num() -- Find and unlock a lock, given number/type
Number: From caller
Type: From caller
Parent: From caller
CREATE: No
Calls: ogfs_get_glstruct()
ogfs_unlock()
ogfs_put_glstruct()
inode.c:
ogfs_lookupi() -- look up a filename in a directory, return its inode
Number: inum.no_formal_ino
Type: inode
Parent: No
CREATE: Yes
plock.c:
find_unused_plockgroup() -- find a value to use for the plock group
Number: bid, composed of journal id, plock group id, and lock id
Type: plock
Parent: No
CREATE: Yes
hold_plockgroup() --
Number: bid, composed of journal id, plock group id, and lock id
Type: plock
Parent: No
CREATE: Yes
load_plock_head() --
Number: ip->i_num.no_formal_ino
Type: inode
Parent: No
CREATE: Yes
load_plock_jid_head() --
Number: bid, composed of journal id, plock group id, and lock id
Type: plock
Parent: No
CREATE: Yes
load_other_plock_elems() --
Number: bid, composed of journal id, plock group id, and lock id
Type: plock
Parent: No
CREATE: Yes
add_plock_elem() -- adds a plock to a journal id's chain
Number: bid, composed of journal id, plock group id, and lock id
Type: plock
Parent: No
CREATE: Yes
recovery.c:
ogfs_get_log_header() -- read the log header for a given journal segment
Number: jdesc->ji_addr, block # of journal index
Type: meta
Parent: No
CREATE: No
Calls: ogfs_dread() to read the header block of the journal segment
ogfs_put_glstruct() as soon as read is done
foreach_descriptor() -- go through the active part of the log
Number: jdesc->ji_addr, block # of journal index
Type: meta
Parent: No
CREATE: No
Calls: ogfs_dread() to read a block of the journal
ogfs_put_glstruct() as soon as read is done
do_replay_local() -- replay a metadata block (when in local fs mode)
Number: jdesc->ji_addr, block # of journal index
Type: meta
Parent: No
CREATE: No
Calls: ogfs_get_glstruct() to get ptr to glstruct
ogfs_put_glstruct() as soon as ptr is obtained
Comments from code: The lock should (already) be held, so we don't need
to hold a count on the structure. We do need its pointer, though.
do_replay_multi() -- replay a metadata block (when in multi-host mode)
Number: jdesc->ji_addr, block # of journal index
Type: meta
Parent: No
CREATE: No
Calls: ogfs_get_glstruct() to get ptr to glstruct
ogfs_put_glstruct() as soon as ptr is obtained
Comments from code: The lock should (already) be held, so we don't need
to hold a count on the structure. We do need its pointer, though.
replay_metadata() -- replay a metadata block (when in multi-host mode)
Number: jdesc->ji_addr, block # of journal index
Type: meta
Parent: No
CREATE: No
Calls: ogfs_get_glstruct() to get ptr to glstruct
ogfs_put_glstruct() as soon as ptr is obtained
Comments from code: The lock should (already) be held, so we don't need
to hold a count on the structure. We do need its pointer, though.
clean_journal() -- mark a dirty journal as being clean
Number: jdesc->ji_addr, block # of journal index
Type: meta
Parent: No
CREATE: No
Calls: lots of things
ogfs_put_glstruct() at end of function
collect_nopen() -- called by foreach_descriptor to get nopen counts
Number: jdesc->ji_addr, block # of journal index
Type: meta
Parent: No
CREATE: No
Calls: ogfs_get_glstruct() to get ptr to glstruct
ogfs_put_glstruct() as soon as ptr is obtained
Comments from code: The lock should (already) be held, so we don't need
to hold a count on the structure. We do need its pointer, though.
arch_linux_2_4/super_linux.c:
_ogfs_read_super -- the filesystem mount function
Number: OGFS_TRANS_LOCK, the cluster-wide transaction lock
Type: trans
Parent: No
CREATE: Yes
_ogfs_read_super -- the filesystem mount function
Number: OGFS_RENAME_LOCK, the cluster-wide file rename/move lock
Type: nondisk
Parent: No
CREATE: Yes
Appendix H. Inventory of calls to ogfs_glock(), either direct, or indirect via:
glock.h:
ogfs_glock_i() -- lock glock for this inode (given ptr to inode struct)
ogfs_glock_rg() -- lock glock for this resource group (given ptr to rg struct)
glock.c
ogfs_glock_num() -- lock glock for this lock # (often, lock # = block #)
ogfs_glock_m() -- lock multiple glocks (given list of glock ptrs)
All except ogfs_glock_num() require that glock structure has already been
allocated via ogfs_get_glstruct().
Calls to ogfs_glock() directly (excluding those from ogfs_glock_*()):
--------------------------------------------------------------------
inode.c:
inode_dealloc()
ogfs_lookupi()
grabs a lock on the found inode, in shared (GL_SHARED) mode.
There are two cases here:
1. Found inode's lock name (block #) is < directory's lock name (block #):
Unlock directory's inode (give opportunity for someone else to
change the directory's knowledge of the inode's block location?)
Lock (1st) found inode
Lock directory's inode in shared (GL_SHARED) mode.
Do another ogfs_dir_search() for the inode (same name)
Compare 2nd found inode number with 1st found inode number
If same, we've found what we're looking for
If different, restart the search
2. Found inode's lock name (block #) is > directory's lock name (block #):
Lock (1st) found inode ... this is the one we're looking for
Case 1 seems to be a situation in which the inode is moving from block
to block, and the code is looking for the directory to be stable as to
the inode's final location (?).
Once the inode is found, the function reads the inode structure into core,
using ogfs_get_istruct().
ogfs_lookupi() has an option to release the locks on the directory and
the found inode at the end of the function.
Comments from Dominik: I think the dir lock is released and reacquired
after the file lock to keep acquiring locks in ascending order (deadlock
prevention). However, I'm not sure that nothing bad can happen between
releasing and reacquiring the lock.
See also ogfs_lookupi() discussed under "Calls to ogfs_glocki()"
plock.c:
find_unused_plockgroup()
load_plock_head()
load_plock_jid_head()
load_other_plock_elems()
load_my_plock_elems()
add_plock_elem()
recovery.c:
ogfs_recover_journal() ... replay a journal to recover consistent state
grabs the transaction lock (sdp->sd_trans_gl) in exclusive (0) mode,
with flags:
LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired",
i.e. being recovered from a dead node. See "Expiring
Locks and the Recovery Daemon".
Grabbing this lock in exclusive mode prevents other nodes and processes
from creating new transactions while the journal recovery is proceeding.
This is the only function in which the transaction lock is grabbed in
*exclusive* mode. The lock is unlocked by this function as soon as the
journal replay is complete.
This is a special (non-disk) lock, ID #2:
From src/fs/ondisk.h: #define OGFS_TRANS_LOCK (2).
src/fs/arch_linux_2_4, _ogfs_read_super() assigns:
sdp->sd_trans_gl = ogfs_get_glstruct(sdp, OGFS_TRANS_LOCK, ...);
trans.c:
ogfs_trans_begin() ... begin a new transaction
grabs the transaction lock (sdp->sd_trans_gl) in shared (read) mode
(GL_SHARED).
Grabbing this lock in shared mode allows other nodes and processes to
create transactions simultaneously, unless and until a journal recovery
occurs. See comments above.
This is the only function in which the transaction lock is grabbed in
*shared* mode. The lock is normally unlocked by ogfs_trans_end(),
but will be unlocked by ogfs_trans_begin() itself if a failure occurs.
arch_linux_2_4/inode_linux.c:
ogfs_rename() ... rename a file
grabs the rename lock (sdp->sd_rename_gl) in exclusive (0) mode.
Grabbing this lock in exclusive mode prevents other nodes and processes
from doing any renaming while this renaming is proceeding.
This is the only function in which the rename lock is grabbed at all.
The lock is released by this function, once the renaming is complete.
From src/fs/ondisk.h: #define OGFS_RENAME_LOCK (3).
src/fs/arch_linux_2_4, _ogfs_read_super() assigns:
sdp->sd_rename_gl = ogfs_get_glstruct(sdp, OGFS_RENAME_LOCK, ...);
This function creates a complete transaction!
arch_linux_2_4/super_linux.c:
_ogfs_read_super() ... set up in-core superblock, mount the fs
grabs this node's journal lock (sdp->sd_my_jnl_gl) in exclusive (0) mode,
called with disown flag (GL_DISOWN).
The lock is then immediately unlocked!
The journal lock is a lock on the first block of this node's journal,
created and grabbed earlier, in exclusive mode, within the same function,
using ogfs_glock_num().
Comments in src/fs/glock.h say that GL_DISOWN tells the lock layer
to disallow recursive locking, and allow a different process to
unlock the lock. So, it seems, this negates the exclusivity of the lock
grabbed earlier in the function, while still holding a lock!?!
The earlier exclusive lock is unlocked by ogfs_put_super(), when
unmounting the filesystem.
Calls to ogfs_glock_i()
--------------------------------------------------------------------
blklist.c:
ogfs_rindex_hold() -- locks resource index, makes sure we have latest info
Resource : resource group index inode (sdp->sd_riinode) block #
Mode: shared (GL_SHARED)
Type: inode
Flags: --
gunlock: only in case of error doing ogfs_rgrp_update()
put_glstruct: No
unlock: elsewhere, by ogfs_rindex_release(), via ogfs_gunlock_i()
(no put_glstruct)
Calls: If resource group info is out-of-date (i.e. filesystem has
been expanded or shrunk), calls ogfs_rgrp_update() to read
new info from disk.
Called fm: ogfs_inplace_reserve(), blklist.c *
do_strip(), bmap.c
leaf_free(), dir.c
dinode_dealloc(), inode.c
ogfs_stat_rgrp_ioctl(), ioctl.c
ogfs_reclaim_one_ioctl(), ioctl.c
ogfs_reclaim_all_ioctl(), ioctl.c
ogfs_setup_dameth(), super.c
ogfs_stat_ogfs(), super.c
* calls ogfs_rindex_release() *only* in case of error. Normally relies
on ogfs_inplace_release() to do the unlock. All other functions unlock
before exiting. None does a put_glstruct().
Comments from code: we keep this lock for a long time
compared with other locks, since it is shared and very, very rarely
accessed in exclusive mode.
Comments (bc): What do they mean by "long time", or by "we"?
It looks to me that the lock is not very long-lived.
This function is the one that detects that a filesystem has grown or
shrunk! Filesystem size change requires the addition or removal of
resource groups, which in turn requires a change to the resource index.
This function compares a version number held in the filesystem superblock
structure (sdp->sd_riinode_vn) with a version number associated with
the glock (gl->gl_vn) to detect a change in the resource index.
This is the same lock grabbed by ogfs_get_riindex().
bmap.c:
ogfs_truncate() ... change file size
grabs a lock on the file's inode in exclusive (0) mode.
Comments: file size can grow, shrink, or stay the same.
glock.c:
ogfs_glock_m()
wrapper ... for each lock in the list, calls ogfs_glock_i() if the lock is
for an inode. Mode is determined by flags contained in the list.
inode.c:
ogfs_lookupi() ... look up a filename in a directory
grabs a lock on the directory inode, in shared mode (GL_SHARED).
In a special case (that the found inode is located in a lower block # than
the searched directory's inode), this function gives up the directory
lock, then re-aquires it to try the search again. Does the location
relationship indicate that something else is messing with the directory??
See discussion of ogfs_lookupi() under "Calls to ogfs_glock directly".
ogfs_create_i() ... find requested inode, or create a new one
grabs a lock on the directory inode, in exclusive (0) mode.
If requested inode matches a name search of the directory, this function
releases this exclusive lock before calling ogfs_lookupi(), which places
a shared lock on the same directory.
Comment from Dominik: It would be good to be able to downgrade the lock
from exclusive to shared, without first needing to unlock it entirely
(i.e. keep a lock locked while transitioning from exclusive to shared).
This feature is not currently provided in OGFS locking.
ogfs_update_atime() ... update inode's atime, if needed
grabs a lock on the inode, in exclusive (0) mode, if it needs to write
an update to the inode.
Called only from src/fs/arch_linux_2_4/file.c | inode_linux.c, mostly
via OGFS_UPDATE_ATIME macro (conditional on whether fs was mounted
noatime), but directly from ogfs_file_mmap() (also conditional, just
doesn't use the macro).
In all cases, the caller holds a shared lock on the inode, so
ogfs_update_atime() must release that shared lock just before grabbing the
exclusive lock, if it needs to write an update to the inode.
ogfs_update_atime() returns with a lock held (either the original shared
lock or the replacement exclusive lock). So, the calling function is
responsible for releasing the lock. It doesn't matter if the lock is held
as shared or exclusive at the time of release.
ioctl.c:
ogfs_print_frag() ... print info about block locations for an inode
grabs a lock on the inode, in shared (GL_SHARED) mode, to read it.
ogfs_jread_ioctl() ... read from a journaled file, via ioctl
grabs a lock on the file's inode, in shared (GL_SHARED) mode, to read
the inode and file, using ogfs_readi().
ogfs_jwrite_ioctl() ... write to a journaled file, via ioctl
grabs a lock on the file's inode, in exclusive (0) mode, to write it.
Lots of interesting stuff going on in this function, look again later!
super.c:
ogfs_jindex_hold() ... grab a lock on the journal index (ondisk)
grabs a lock on the ondisk journal index inode (sdp->sd_jiinode), in
shared (GL_SHARED) mode. Also uses LM_FLAG_TRY if caller does *not* want
to wait for the lock if it is currently unavailable.
Function compares version numbers of incore superblock's sdp->sd_jiinode_vn
and inode's glock version # ip->i_gl->gl_vn. If they are out of sync, then
incore journal index is out-of-date relative to ondisk jindex(?). To read
new journal index into core, function calls ogfs_ji_update().
arch_linux_2_4/dcache.c:
ogfs_drevalidate() ... validate lookup path from parent directory to inode
grabs a lock on parent directory, in shared (GL_SHARED) mode.
Function uses ogfs_dir_search(), and kernel's BKL (lock_kernel()).
arch_linux_2_4/file.c:
ogfs_read() ... read bytes from a file
grabs a lock on file's inode, in shared (GL_SHARED) mode.
Function uses kernel's generic_file_read(), and BKL (lock_kernel()).
ogfs_write() ... write bytes to a file
grabs a lock on file's inode, in exclusive (0) mode. Releases when done.
Calls ogfs_inplace_reserv(), ogfs_inplace_release(), ogfs_trans_begin(),
ogfs_trans_end(), ogfs_get_inode_buffer(), ogfs_trans_add_bh(),
ogfs_dinode_out()
Creates a complete transaction!
Function uses kernel's generic_file_write_nolock(), brelse(),
and BKL (lock_kernel()).
ogfs_readdir() ... read directory entries from a directory
grabs a lock on directory inode, in shared (GL_SHARED) mode. Releases
lock when done reading.
Function uses ogfs_dir_read().
ogfs_sync_file() ... sync file's dirty data to disk (across the cluster)
grabs a lock on file's inode, in exclusive (0) mode. Does *not*
release the lock, unless there is an error.
Interesting use of exclusive lock! Function does no more than just grab
the lock. This forces any other cluster member (that might own the file
lock) to flush data to disk.
Function uses kernel's BKL (lock_kernel()).
See also ogfs_irevalidate(), which uses a shared lock in a similar
way for opposite reasons!
ogfs_shared_nopage() ... support shared writeable mappings (mmap)
graps a lock on vm area's inode (area->vm_file->f_dentry->d_inode),
in exclusive (0) mode. Releases lock when done.
Function uses kernel's filemap_nopage(), and BKL (lock_kernel()).
ogfs_private_nopage() ... do safe locking on private mappings (mmap)
graps a lock on vm area's inode (area->vm_file->f_dentry->d_inode),
in shared (GL_SHARED) mode.
Function uses kernel's filemap_nopage(), and BKL (lock_kernel()).
ogfs_file_mmap() ... memory map a file, no sharing
grabs a lock on file's inode, (file->f_dentry->d_inode),
in shared (GL_SHARED) mode.
Shared lock is grabbed after mapping, before calling ogfs_update_atime()
(see ogfs_update_atime(), above).
Once the call returns to ogfs_file_mmap(), we release *a* lock ...
it may be the shared one we originally grabbed, or the exclusive
one that ogfs_update_atime() grabbed if it needed to write.
arch_linux_2_4/inode_linux.c:
ogfs_set_attr() ... change attributes of an inode
grabs a lock on the inode, in exclusive (0) mode, to write the inode.
Creates an entire transaction (ogfs_trans_begin() to ogfs_trans_end())
in a certain case.
ogfs_irevalidate() ... check that inode hasn't changed (ondisk?)
grabs a lock on the inode, in shared (GL_SHARED) mode, to read the inode.
Interesting use of shared lock! Function does no more than just grab
the lock. This forces the incore image of the inode to sync up with
the disk, if it's not already in sync.
Function uses kernel's BKL (lock_kernel()).
See also ogfs_sync_file(), which uses an exclusive lock in a similar
way for opposite reasons!
ogfs_readlink() ... read the value of a symlink (and copy_to_user).
grabs a lock on the link's inode, in shared (GL_SHARED) mode.
Function calls OGFS_UPDATE_ATIME (see ogfs_update_atime(), above),
and ogfs_get_inode_buffer().
Function uses kernel's vfs_follow_link(), and BKL (lock_kernel()).
ogfs_follow_link() ... follow a symbolic link (symlink)
grabs a lock on the link's inode, in shared (GL_SHARED) mode.
Function calls OGFS_UPDATE_ATIME (see ogfs_update_atime(), above),
and ogfs_get_inode_buffer().
Function uses kernel's BKL (lock_kernel()).
ogfs_readpage() ... read a page of data for a file
grabs a lock on the file's (?) inode (page->mapping->host), in shared
(GL_SHARED) mode.
Function calls stuffed_readpage() (src/fs/arch_linux_2_4/inode_linux.c).
Function uses kernel's UnlockPage(), block_read_full_page(),
and BKL (lock_kernel()).
ogfs_writepage() ... write a complete page for a file
grabs a lock on the file's (?) inode (page->mapping->host), in exclusive
(0) mode.
Calls ogfs_inplace_reserv(), ogfs_inplace_release(), ogfs_trans_begin(),
ogfs_trans_end(), and either stuffed_writepage(), or
block_write_full_page(), depending on whether file is "stuffed" in inode
block.
Creates a complete transaction!
Function uses kernel's BKL (lock_kernel()).
ogfs_bmap() ... block map
grabs a lock on the file (?) (mapping->host), in shared (GL_SHARED) mode.
Function uses kernel's generic_block_bmap(), and BKL (lock_kernel()).
Calls to ogfs_glock_rg()
--------------------------------------------------------------------
blklist.c:
ogfs_rgrp_lvb_init() ... init the data of a resource group lock value block
grabs 1 or 2 locks on the resource group:
If !force, grab a lock in shared (GL_SHARED) mode, with GL_SKIP flag.
GL_SKIP, used for both the lock and unlock phase, keeps the glops
lock_rgrp() from reading or writing resource group header/bitmap data
to or from disk ... all we need to get is the LVB data.
For all cases (except error), grab an exclusive (0) lock. Releases
lock when done.
Calls ogfs_rgrp_lvb_fill(), ogfs_rgrp_save_out(), ogfs_sync_lvb().
__get_best_rg_fit() ... find and lock the rg that best fits a reservation
grabs locks on all(?) rgs, one at a time in ascending order, in
exclusive (0) mode. The loop accumulates locks, without releasing
them unless/until a "best fit" rg is found, that is, an rg that can
accomodate the complete reservation.
Releases locks on all but the selected "best fit" rg.
Called only by ogfs_inplace_reserve(), as the "plan C" last resort
method of reserving space for something.
ogfs_inplace_reserve() ... reserve space in the filesystem
grabs locks on a series of rgs, in exclusive (0) mode.
Calls ogfs_rgrpd_get_with_hint(), try_rgrp_fit(), __get_pure_metares(),
__get_best_rg_fit().
bmap.c:
do_strip() ... strip off a particular layer(?) of the file
grabs locks on a list of rgs, in exclusive (0) mode.
Calls ogfs_rindex_hold(), ogfs_rlist_add(), ogfs_rlist_sort(),
ogfs_trans_begin(), ogfs_trans_end(), ogfs_get_inode_buffer(),
ogfs_trans_add_bh(), ogfs_blkfree(), ogfs_metafree(), ogfs_dinode_out().
Calls kernel's brelse().
Creates a full transaction!
Comments from Dominik: I think "strips off a particular layer" refers
to layers of indirect data blocks. That is, when a file shrinks, the
number of indirections may be reduced, too. See "Filesystem On-Disk
Layout" for info on indirect data blocks.
dir.c:
leaf_free() ... deallocate a directory leaf
grabs locks on a list of rgs, in exclusive (0) mode.
Calls ogfs_rindex_hold(), ogfs_rlist_add(), ogfs_rlist_sort(),
ogfs_trans_begin(), ogfs_trans_end(), ogfs_get_leaf(), ogfs_leaf_in(),
ogfs_trans_add_bh(), ogfs_metafree(), ogfs_internal_write().
Calls kernel's brelse().
Creates a full transaction!
inode.c:
dinode_dealloc() ... deallocate a dinode
grabs a lock on the inode's resource group, in exclusive (0) mode.
Calls ogfs_rindex_hold(),
ogfs_trans_begin(), ogfs_trans_end(), ogfs_difree(),
ogfs_trans_nopen_change(), ogfs_trans_add_gl().
Creates a full transaction!
super.c:
ogfs_stat_ogfs() ... do a statfs, adding up statistics from all rgrps.
grabs a lock on each resource group in the filesystem, one by one,
in shared (GL_SHARED) mode, and with GL_SKIP flag. GL_SKIP skips any
reads or writes of resource group data on disk ... all we need to use
is the lock's LVB data.
Releases each lock after adding (accumulating) stats for its rgrp.
Calls ogfs_rindex_hold(), ogfs_rgrpd_get_with_hint(),
ogfs_lvb_init(), if it thinks that a node crashed when writing the LVB,
to read rgrp statistics from disk, and re-initialize the corrupt LVB.
Calls to ogfs_glock_num()
ogfs_glock_num() embeds an ogfs_get_glstruct() within the call.
ogfs_gunlock_num() does the unlock, and embeds ogfs_put_glstruct() within call.
--------------------------------------------------------------------
flock.c:
ogfs_flock() -- acquire an flock on a file
Resource : ip->i_num.no_formal_ino
Mode: shared (GL_SHARED), if flock is shared (otherwise exclusive)
Type: flock
Parent: No
Flags: GL_PERM -- for all flocks
GL_DISOWN -- for all flocks
LM_FLAG_TRY -- if !wait
gunlock_num: only on error
unlock: in ogfs_funlock(), via ogfs_gunlock_num()
See several interesting comments in this function.
inode.c:
read_dinode() -- read an inode from disk into the incore OGFS inode cache
Resource : ip->i_num.no_formal_ino
Mode: shared (GL_SHARED)
Type: nopen
Parent: No
Flags: GL_PERM --
GL_DISOWN --
gunlock_num: see below
put_glstruct: Yes
unlock: in ??()
Calls: ogfs_copyin_dinode().
In error situation, this function calls ogfs_gunlock() with GL_NOCACHE
flag, since the lock will not be used in the future. It then calls
ogfs_put_glstruct() to decrement the usage count on the glock structure
(that's *all* that ogfs_put_glstruct() does).
In normal situation, this function just calls ogfs_put_glstruct(), to
decrement the usage count. It does not unlock the lock, since presumably
something else wants to do something with the block, after it's been
read in.
Where/when would this lock be unlocked??
inode_dealloc() -- deallocate an inode
Resource : inum.no_formal_ino
Mode: exclusive
Type: inode
Parent: No
Flags: --
gunlock: Yes, before leaving function
put_glstruct: Yes, before leaving function
Calls:
ogfs_glock() to get nopen lock
ogfs_get_istruct() to read inode (ip) from disk
ogfs_gunlock() to unlock nopen lock
ogfs_dir_exhash_free() ???
ogfs_shrink() to truncate the file to 0 size (deallocate data blocks)
dinode_dealloc() to deallocate the inode block
ogfs_put_istruct() to decrement usage count on ip istruct
ogfs_destroy_istruct() to deallocate istruct from memory
ogfs_dealloc_inodes() ... go through the list of inodes to be deallocated
Resource : inum.no_addr
Mode: exclusive
Type: nopen
Parent: No
Flags: LM_FLAG_TRY, see below
gunlock: Yes, before leaving function
put_glstruct: Yes, before leaving function
LM_FLAG_TRY -- if lock not immediately available, the function makes
note of this as a "stuck" inode. This keeps us from spinning if the
list can't be totally purged. (Why would an inode have a lock on it if
it is de-allocatable?).
Calls:
ogfs_pitch_inodes() to throw away any inodes flagged to be discarded
ogfs_nopen_find() to search sdp->sd_nopen_ic_list for a deallocatable
inode.
ogfs_gunlock() to unlock the inode (so following call can lock it
in exclusive mode).
inode_dealloc() to remove the inode and associated data
ogfs_put_glstruct() to deallocate the lock structure
Unlocks the lock, and and calls ogfs_put_glstruct() when done with each
inode, as noted in Calls above.
ogfs_createi() ... create a new inode
grabs a lock on the inode, in two ways, one right after the other:
Resource : inum.no_formal_ino
Mode: exclusive
Type: inode
Parent: No
Flags: --
gunlock: No (where unlocked?), except after error
put_glstruct: Yes(!), before leaving function
Resource : inum.no_addr
Mode: shared (GL_SHARED)
Type: nopen
Parent: No
Flags: --
gunlock: Yes, before leaving function
put_glstruct: Yes, before leaving function
Note: Current implementation sets inum.no_formal_ino = inum.no_addr
(see fs/inode.c pick_formal_ino()). These two locks are differentiated
only by their glops/type, since the lock number is the same!
This function creates a complete transaction!
ioctl.c:
ogfs_get_super() ... dump disk superblock into user-space buffer
Resource : OGFS_SB_LOCK (lock "0"), the cluster-wide superblock lock
Mode: shared (GL_SHARED)
Type: meta
Parent: No
Flags: --
gunlock: Yes, before leaving function
put_glstruct: Yes, before leaving function
Calls: ogfs_dread() to read the block from disk
copy_to_user() (kernel) to copy into user-space buffer.
Normally, you would think that the superblock should be static, so why
lock it for a read? To protect it against filesystem upgrades, as rare
as they may be! The only other place SB_LOCK is grabbed is in
_ogfs_read_super(), when mounting. See below.
recovery.c:
ogfs_recover_journal() ... do a replay on a given journal
Resource : sdp->sd_jindex[jid]->ji_addr, requested journal's first block
Mode: exclusive
Type: meta
Parent: No
Flags: LM_FLAG_NOEXP, always
LM_FLAG_TRY, see below
gunlock: Yes, before leaving function
put_glstruct: Yes, before leaving function
LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired",
i.e. being recovered from a dead node. See "Expiring
Locks and the Recovery Daemon".
LM_FLAG_TRY -- Conditionally used. If this is *not* the first node to
mount into the cluster, don't block when waiting for the lock.
Instead, if the lock is not immediately available, print
"OGFS: Busy" to the console, *don't* replay the journal, and exit
with a successful return code.
This looks at a boolean member of the lock module structure,
(sdp->sd_lockstruct.ls_first). When a computer mounts a lock module,
the module sets this value to TRUE to indicate that the computer is
the first one in the cluster to mount the module. The memexp protocol
is accurate in this (it can check with the memexp lock server), but the
nolock protocol unconditionally sets this value to TRUE (it has no
server to check). The stats protocol passes along the value set
by the protocol which *it* mounted (stats is a stacking protocol).
When mounting the filesystem, _ogfs_read_super() will replay *all*
of the filesystems journals if ls_first is TRUE, calling
ogfs_recover_journal() once for each journal. In this case,
we must block when waiting for each journal lock (we *must* replay
each journal before proceding).
Once all journals have been replayed, _ogfs_read_super() calls
ogfs_others_may_mount() (allowing other nodes that are blocked
within the protocol mount() call to proceed), and sets ls_first
to FALSE.
If ls_first is FALSE, _ogfs_read_super() will replay only its own
journal. In this case, we grab the lock with LM_FLAG_TRY.
If we fail to get the lock, it just means some other computer is
currently replaying the journal; there's no need for us to replay it,
so we return with "success"!
Note: The lock could also be held if a computer is doing a filesystem
upgrade, but my guess is that the sequence of events would make it
impossible for an upgrade to happen at the same time that we're
mounting the filesystem???
super.c:
ogfs_do_upgrade() -- upgrade a filesystem to a newer version
Resource : sdp->sd_jindex[jid]->ji_addr, each journal's first block
Mode: exclusive
Type: meta
Parent: No
Flags: LM_FLAG_TRY, see below
gunlock: Yes, after ogfs_find_jhead() for each journal
put_glstruct: Yes, after ogfs_find_jhead() for each journal
This function checks, before upgrading a filesystem, to make sure that
each and every journal in the filesystem is unmounted. So, for each
journal, it grabs a lock, calls ogfs_find_jhead(), and checks for the
OGFS_LOG_HEAD_UNMOUNT flag. This flag is present in a "shutdown"
journal header, and indicates that the journal has been unmounted.
(Does it mean that the journal is empty?).
If it does not find the UNMOUNT flag in the current journal head, or
if it can't immediately acquire the journal lock, the function stops
and reports an error -EBUSY.
ogfs_get_riinode() -- reads resource index inode from disk, inits incore image
Resource : resource index inode's block #
(sdp->sd_sb.sb_rindex_di.no_formal_ino)
Mode: shared (GL_SHARED)
Type: inode
Parent: No
Flags: GLF_STICKY -- applied by set_bit()
gunlock: Yes, before leaving function
put_glstruct: Yes, before leaving function
Calls: ogfs_get_istruct() to read inode from disk.
Called fm: filesystem mount function, _ogfs_read_super().
Sets sdp->sd_riinode_vn = gl->gl_vn - 1. Is this to force
ogfs_rindex_hold() to read new resource index from disk?
This is the same lock grabbed by ogfs_rindex_hold().
ogfs_get_jiinode() -- reads journal index inode from disk, inits incore image
Resource : jindex inode (sdp->sd_sb.sb_jindex_di.no_formal_ino)
Mode: shared (GL_SHARED)
Type: inode
Parent: No
Flags: GLF_STICKY -- applied by set_bit()
gunlock: Yes, before leaving function
put_glstruct: Yes, before leaving function
Calls: ogfs_get_istruct() to read inode from disk.
Called fm: filesystem mount function, _ogfs_read_super().
Sets sdp->sd_jiinode_vn = gl->gl_vn - 1. Is this to force
ogfs_jindex_hold() to read new journal index from disk?
arch_linux_2_4/file.c:
ogfs_open_by_number() ... open a file by inode number
grabs a lock on the inode (inum.no_formal_ino), in shared (GL_SHARED) mode.
Resource : inum.no_formal_ino
Mode: shared (GL_SHARED)
Type: inode
Parent: No
Flags: --
gunlock: Yes, after ogfs_get_istruct()
put_glstruct: Yes, after ogfs_get_istruct()
Calls: ogfs_get_istruct() to read inode from disk.
arch_linux_2_4/super_linux.c:
_ogfs_read_super() ... mount the filesystem
1) grabs a lock on OGFS_MOUNT_LOCK (non-disk lock # 0), in exclusive (0)
mode, using ogfs_nondisk_glops, with flags:
GL_PERM --
LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired",
i.e. being recovered from a dead node. See "Expiring
Locks and the Recovery Daemon".
2) grabs a lock on OGFS_LIVE_LOCK (non-disk lock # 1), in shared (GL_SHARED)
mode, using ogfs_nondisk_glops, with flags:
GL_PERM --
GL_DISOWN --
LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired",
i.e. being recovered from a dead node. See "Expiring
Locks and the Recovery Daemon".
3) grabs a lock on OGFS_SB_LOCK (meta-data lock # 0), in shared (GL_SHARED)
mode, using ogfs_meta_glops.
Uses exlusive mode if mount argument calls for filesystem upgrade.
4) grabs a lock on the machine's journal (sdp->sd_my_jdesc.ji_addr),
in exclusive (0) mode, using ogfs_meta_glops, with flags:
GL_PERM --
LM_FLAG_NOEXP -- Always used. Grab this lock even if it is "expired",
i.e. being recovered from a dead node. See "Expiring
Locks and the Recovery Daemon".
5) grabs a lock on the root inode (sdp->sd_sb.sb_root_di.no_formal_ino),
in shared (GL_SHARED) mode, using ogfs_inode_glops.
ogfs_iget_for_nfs() ... get an inode based on its number
grabs a lock on the inode (inum->no_formal_ino), in shared (GL_SHARED)
mode, using ogfs_inode_glops.
Calls to ogfs_glock_m()
--------------------------------------------------------------------
arch_linux_2_4/inode_linux.c:
ogfs_link() ... link to a file
grabs 2 locks, both in exclusive (0) mode, using ogfs_inode_ops, on:
1) inode of directory containing new link
2) inode being linked
The function has a local variable array of 2 ogfs_lockop_t structures
that it zeroes, then fills the lo_ip fields with inode pointers for the
two lock targets. Exclusive mode is set by the zeroes, and the
ogfs_inode_ops are selected by the fact that the lo_ip fields are used.
See fs/glock.c ogfs_glock_m().
ogfs_unlink() ... unlink a file
See ogfs_link(), above. The same locks are grabbed the same way.
ogfs_rmdir() ... remove a directory
See ogfs_link(), above. Locks for directory and its parent are grabbed
the same way.
ogfs_rename() ... rename/move a file
grabs up to 4 locks, all in exclusive (0) mode, using ogfs_inode_ops, on:
1) inode of old parent directory
2) inode of new parent directory
3) inode of new name(?) (if pre-existing?)
4) inode of old directory(?) (if moving a directory?)
Appendix I. Inventory of calls to glops.
All calls to go_* are from glock.c.
All implementations (except ogfs_free_bh()) are in src/fs/arch_*/glops.c.
go_sync:
Called from:
sync_dependencies() -- sync out any locks dependent on this one
ogfs_gunlock() -- unlock a glock
Implementations:
sync_meta() -- sync to disk all dirty data for a metadata glock
used for types: meta, rgrp
calls: test_bit() -- GLF_DIRTY (any dirty data to flush?)
ogfs_log_flush() -- flush glock's incore committed transactions
ogfs_sync_bh() -- flush all glock's buffers
clear_bit() -- clear GLF_DIRTY, GLF_SYNC
also called by: release_meta(), release_rgrp()
sync_inode() -- sync to disk all dirty data for an inode glock
used for type: inode
calls: test_bit() -- GLF_DIRTY (any dirty data to flush?)
ogfs_log_flush() -- flush glock's incore committed transactions
ogfs_sync_page() -- flush all glock's pages
ogfs_sync_bh() -- flush all glock's buffers
clear_bit() -- clear GLF_DIRTY, GLF_SYNC
also called by: release_inode()
go_acquire:
Called from:
xmote_glock() -- promote a glock
Implementations:
acquire_rgrp() -- done after an rgrp lock is acquired
used for type: rgrp
calls: ogfs_rgrp_save_in() -- read rgrp data from glock's LVB
go_release:
Called from:
cleanup_glock() -- prepare (an exclusive?) glock to be released to
another node
Implementations:
release_meta()
used for type: meta
calls: sync_meta() -- sync to disk all dirty data assoc with glock
ogfs_inval_bh() -- invalidate all buffers assoc with glock
release_inode()
used for type: inode
calls: ogfs_flush_meta_cache()
sync_inode() -- sync to disk all dirty data assoc with glock
ogfs_inval_pg() -- invalidate all pages assoc with glock
ogfs_inval_bh() -- invalidate all buffers assoc with glock
release_rgrp() -- prepare an rgrp lock to be released
used for type: rgrp
calls: sync_meta() -- sync to disk all dirty data assoc with glock
ogfs_inval_bh() -- invalidate all buffers assoc with glock
ogfs_rgrp_save_out() -- write rgrp data out to glock's LVB
release_trans() -- prepare *the* transaction lock to be released
used for type: transaction
calls: ogfs_log_flush() -- flush glock's incore committed transactions
fsync_no_super() -- (kernel) flush this fs' dirty buffers
go_lock: -- get fresh copy of inode or rgrp bitmap from disk
returns 0 on success, error code on failure of read
(this is the only go_* function that returns anything)
Called from:
ogfs_glock()
Implementations:
lock_inode()
used for type: inode
calls: atomic_read() -- gl_locked, recursive cnt of process ownership
ogfs_copyin_dinode() -- get fresh copy of inode from disk
lock_rgrp() -- done after an rgrp lock is locked by a process
used for type: rgrp
calls: atomic_read() -- gl_locked, recursive cnt of process ownership
ogfs_rgrp_read() -- get fresh copy of rgrp bitmap from disk
go_unlock: -- copy inode attributes to VFS inode, or
release rgrp bitmap blocks, copy rgrp stats to LVB struct
Called from:
ogfs_gunlock()
Implementations:
unlock_inode()
used for type: inode
calls: atomic_read() -- gl_locked, recursive cnt of process ownership
test_and_clear_bit() -- GLF_POISONED (gl_vn++ if so)
test_bit() -- GLF_DIRTY (have inode attributes changed?)
ogfs_inode_attr_in() -- copy attributes fm dinode -> VFS inode
unlock_rgrp() -- prepare an rgrp lock to be unlocked by a process
used for type: rgrp
calls: atomic_read() -- gl_locked, recursive cnt of process ownership
ogfs_rgrp_relse() -- release (i.e. brelse()) rgrp bitmaps
test_and_clear_bit() -- GLF_POISONED (gl_vn++ if so)
test_bit() -- GLF_DIRTY (have rgrp usage stats changed?)
ogfs_rgrp_lvb_fill() -- copy rgrp usage stats to LVB struct
go_free:
Called from:
ogfs_clear_gla() -- clear all glocks before unmounting the lock protocol
Implementation (in arch_*/dio_arch.c):
ogfs_free_bh() -- free all buffers associated with a G-Lock
used for types: meta, inode, rgrp
calls: list_del() -- removes private bufdata from glock list
ogfs_put_glstruct() -- decrement reference count gl_count
ogfs_free_bufdata() -- kmem_cache_free private bufdata from
ogfs_bufdata_cache
Appendix J. Some info from earlier ogfs-internals doc
NOTE (bc): Some of this information is no longer exactly accurate, but
provides interesting reading nonetheless.
A Guide to the Internals of the OpenGFS File System
copyright 2001 The OpenGFS Project.
copyright 2000 Sistina Software Inc.
The G-Lock Functions
The concept of a G-Lock is fundamental to OpenGFS. A G-Lock represents an
abstraction of some underlying locking protocol and is essential to
maintaining consistency in an OpenGFS filesystem. The G-Lock layer provides
the glue required between the abstract lock_harness code and the
filesystem operations. The lock_harness itself is the subject of
a separate document and not covered here.
The G-Locks are held in a hash table contained in the OpenGFS specific
portion of the superblock. Each hash chain has three separate lists
plus associated counters and a read/write lock. The lists are
associated with the state in which the G-Locks happen to be. The
not_held state is for locks which are not held by the client, but
the structure still exists (used to reduce the number of memory
allocations/deallocations). The held state is for locks which are
held by the client (although not in use by any processes). These
locks can be dropped immediately upon a request from another client
or upon memory pressure. The third state (perm) is used for locks
which are both locked and in use.
In order to release a G-Lock so that another client may access the data
which it protects, all the data which that G-Lock covers must be flushed
to disk. Also further accesses to the data on the client releasing the lock
must be prevented until such time as the client requires the lock. Clients
can cache data even when they don't have the G-Lock which covers that
data provided they check the validity of the data the next time they
acquire the lock and reread it if it has changed on disk.
We use various techniques to improve the efficiency of the glock
layer. Read/write locks are used upon the G-Lock lists so that the
more usual lookup operations can occur in parallel with each other
and only write operations (moving G-Locks between lists or creating
or deleting them) need the exclusive lock. Also when a lock has
been locked, it is not unlocked until it has aged a certain number
of seconds. This is done to increase the chances of a future lock
request being able to reuse the lock instead of requiring a separate
locking operation. Of course, if another client requires the lock, it
must post a callback to the lock holding client to request it. This is
done by marking the lock with a special flag which causes it to unlock as
soon as the current operation has completed (or immediately if there is
no current operation).
There is a deamon function (glockd) which runs periodically to clear
the G-Lock cache of old entries. It does this in a two stage process.
The first stage of ogfs_glockd_scan is really a part of the inode
functions, and not part of the glock code, but it fits nicely here.
The first part of the glockd scanning function looks at all the held
locks and demotes any which have been held too long. The second part
deletes any G-Locks which have exceeded the time out for not_held
locks.
see opengfs/src/fs/glock.c