OS_Interface_Operations

OS_Interface_Operations (Oct 18 2003)
Contact the author.

                OpenGFS Methods (interfaces to Linux kernel)

Copyright 2003 The OpenGFS Project

Author:
Ben Cahill (bc), ben.m.cahill@intel.com

1.0  Introduction
------------

This document contains details of the OpenGFS "ops" interfaces presented
to the Linux kernel.

This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of shared file system locking in
OpenGFS.

This document is not intended as a user guide to OpenGFS.  Look in the OpenGFS
WHATIS-ogfs document for an overview of OpenGFS.  Look in the OpenGFS
HOWTO-generic or HOWTO-nopool for details of configuring and setting up OpenGFS.

This document may contain inaccurate statements, based on the author's limited
understanding.  Please contact the author (bc) if you see anything wrong or
unclear.


2.0. Operations overview

OpenGFS presents several ops structures to the Linux kernel, exposing
functions within OpenGFS that perform standard methods expected by the 
operating system's Virtual Filesystem (VFS) layer.

Method categories include:

-- Miscellaneous (proc, SSI mnt)
-- Superblock (s)
-- File (f)
-- Inode (i)
-- Address Space (a)
-- Directory Entry (dentry, d)
-- Page/mmap (vm)

Each category exposes a structure containing the functions that OpenGFS needs
to substitute (hook) in place of default functions used by the kernel's VFS.
Sometimes, OpenGFS simply wraps a glock (see ogfs-locking) around one of the
kernel's default functions.  In other cases, OpenGFS executes a complete file
operation itself.

It's interesting to consider not only the functions that OpenGFS hooks,
but also those that it does not.


3.0  Miscellaneous (proc, SSI mnt) Operations:

3.1. Proc

OGFS uses the following, in fs/arch_linux_2_4/main.c, to call out the proc
entry point for OGFS:

	ogfs_proc_entry =
	    create_proc_read_entry("fs/ogfs", S_IFREG | 0200, NULL, NULL, NULL);
	ogfs_proc_entry->write_proc = ogfs_proc_write;

The proc entry point /proc/fs/ogfs supports setting up mount options when they
cannot be called out from a mount command line, i.e. from an initial ramdisk.


3.2. Single System Image (SSI) Operations

This set of operations, in fs/arch_linux_2_4/super_linux.c, supports OGFS
in a SSI environment.  I'll skip these for now.

#ifdef CONFIG_SSI
struct vfs_operations ogfs_mnt_ops = {
	get_uniqueid: 	ogfs_get_uniqueid,
	cluster_mount:	ogfs_notify_all,
	cluster_umount:	ogfs_client_umount,
	get_mount:	ogfs_get_mount,
};
#endif /* CONFIG_SSI */


4.0  Superblock (s) Operations:

OpenGFS has 1 set of superblock operations, in fs/arch_linux_2_4/super_linux.c.

struct super_operations ogfs_sops = {
	read_inode:	ogfs_panic_inode,
	read_inode2:	ogfs_read_inode,
	put_super:	ogfs_put_super,
	statfs:		ogfs_statfs,
	clear_inode:	ogfs_clear_inode,
	remount_fs:	ogfs_remount_fs,
	write_super:	ogfs_write_super
};

In addition, the mount operation ogfs_read_super() is called out in
fs/arch_linux_2_4/main.c as:

static DECLARE_FSTYPE_DEV(ogfs_fs_type, "ogfs", ogfs_read_super);
	error = register_filesystem(&ogfs_fs_type);


4.1.  Hooked Superblock functions

4.1.0.  ogfs_read_super()

This is the call to mount the filesystem.  The mount procedure and mount
options deserve a doc of their own, since they set up all the components
that make OGFS run.

In case of SSI support, there is a top level ogfs_read_super() that calls
the regular mount call, renamed to _ogfs_read_super(), with various options
depending on SSI cluster state.


4.1.0.  ogfs_panic_inode() (read_inode)

According to comments in code, OGFS never gets called for this, but it's
needed to make NFS happy.


4.1.1.  ogfs_read_inode() (read_inode2)

???

4.1.2.  ogfs_put_super()

This is the unmount function.  Undoes all the stuff done by ogfs_read_super().


4.1.3.  ogfs_statfs()

OGFS keeps block usage statistics distributed among the resource groups
in the filesystem.  OGFS looks at each resource group descriptor, accumulating
the stats before filling in the stats buffer passed from the OS.


4.1.4.  ogfs_clear_inode()

Called after VFS is done with an inode.  OGFS puts its own inode structure,
by decrementing its usage count (but not deallocating it from memory).

OGFS keeps a count of how many times it has freed inode structures.  Once the
count reaches 50, OGFS launches the glockd daemon to clean them out, along
with associated glock structures.


4.1.5.  ogfs_remount_fs()

OGFS does not remount when this is called.  It just copies the access time
(atime) flags (MS_NOATIME, MS_NODIRATIME), sent from VFS, into the OGFS incore
superblock structure, and sets the incore superblock's sd_sync_mount flag to
FALSE.  TRUE would force a sync-to-disk of any glock-protected data, when the
glock is unlocked by a local process.  FALSE allows the sync to be determined
on an individual glock basis.

It also sets the MS_NOATIME and MS_NODIRATIME flags to return to VFS, to
prevent VFS from updating the atime (OGFS takes responsibility for updating).


4.1.6.  ogfs_write_super()

OGFS flushes its journal to disk.


4.2.  Non-Hooked Superblock functions

4.2.0.  alloc_inode()

4.2.1.  destroy_inode()

4.2.2.  dirty_inode()

4.2.3.  write_inode()

4.2.4.  put_inode()

OGFS does not need to know when an individual process releases an inode, just
when VFS is done with it altogether, via release_inode().


4.2.5.  delete_inode()

4.2.6.  write_super_lockfs()

4.2.7.  unlockfs()

4.2.8.  umount_begin()

NFS only.

4.2.9.  fh_to_dentry()

4.2.10.  dentry_to_fh()

4.2.11.  show_options()



5.0  File (file, f) Operations

OpenGFS has 2 sets of file ops vectors, in fs/arch*/file.c.

One set for files:

struct file_operations ogfs_file_fops = {
	read:		ogfs_read,
	write:		ogfs_write,
	ioctl:		ogfs_ioctl,
	mmap:		ogfs_file_mmap,
	open:		ogfs_open,
	fsync:		ogfs_sync_file,
	lock:		ogfs_lock_file
};


and one for directories:

struct file_operations ogfs_dir_fops = {
	readdir:	ogfs_readdir,
	ioctl:		ogfs_ioctl,
	fsync:		ogfs_sync_file
};

The ioctl and fsync ops/functions are common for both sets, but all the other
ops/functions are unique.


5.1.  Hooked File Ops functions


5.1.0. ogfs_read()

OGFS wraps a shared glock, on the file's inode, around a call to the kernel's
generic_file_read() function.

It also updates the access time (atime) for the file.

It also wraps a BKL around the whole thing.


5.1.1. ogfs_write()

OGFS wraps the following around a call to the kernel's 
generic_file_write_nolock() function:

-- down/up on kernel inode structure's exclusive access semaphore.
-- exclusive glock on inode (no other node may access file while we're writing).
-- block reservation/release (reserve the max # blocks that we may need,
      then release the excess that we didn't use, after we're done).
-- journal transaction begin/end.

After the generic_file_write_nolock(), OGFS updates its on-disk inode with
a new size, mode, and modify and create times (mtime, ctime).

It also wraps a BKL around the whole thing.


5.1.2. ogfs_ioctl()

OGFS performs the specific ioctl.  These are used by OGFS user-space utilities
for filesystem expansion, gathering current block usage statistics, debugging,
etc.


5.1.3. ogfs_file_mmap()

OGFS attaches one of two vm_ops structures to the vm_area_struct provided
by the kernel:

-- ogfs_shared_mmap
-- ogfs_private_mmap

These are described in section 9.

OGFS also wraps a shared glock, on the mmapped file's inode, around an update
to the file's access time (atime).  Note that the glock is *shared*.  If
the atime is actually updated (depending upon how long it's been since the
last update), the update function releases the shared lock, then immediately
grabs an exclusive lock.  A lock *conversion* might work well here.


5.1.4. ogfs_open()

Very lightweight function.  OGFS just verifies a couple of things related
to LARGEFILE (size > 0x7fffffff bytes), and exclusive access.


5.1.5. ogfs_sync_file()

OGFS syncs the file to disk, across all computers in the cluster, in a two-step
operation:

-- grab exclusive glock on the file's inode.  This will cause any other computer
   that is writing, or has recently written to the file, to flush data to disk,
   before releasing the lock to *this* machine.  Note that the glock could be
   sitting in another machine's glock cache for several minutes after that
   machine performs a write to the file, before the data gets flushed to disk.
   Grabbing the exclusive forces any other machine to release the lock from
   its cache, and flush.

-- release exclusive glock, with GL_SYNC flag.  This will force *this*
   computer to flush all data to disk.  Without the flag, the flush might
   not occur until *this* machine's glock cache finally releases the glock.

It also wraps a BKL around the whole thing.


5.1.6. ogfs_lock_file()

OGFS does either an flock or a plock operation (lock, unlock, etc.), depending
on kernel file structure's flag field.  OGFS uses special types of glocks
(FLOCK or PLOCK) to extend the lock across the entire cluster.

It also wraps a BKL around the whole thing.


5.1.7. ogfs_readdir()

OGFS wraps a shared glock on the directory's inode, around an atime update,
and a call to ogfs_dir_read().


5.2.  Non-hooked functions

5.2.0.  llseek()

No need for OGFS to be aware of the file pointer until it's time to read or
write.


5.2.1.  poll()

No need for OGFS to be aware that a process is waiting for something.


5.2.2.  flush()

No need for OGFS to know when a reference to an open file is closed.


5.2.3.  release()

No need for OGFS to know when the last reference to an open file is closed.


5.2.4.  fasync()

No need for OGFS to be aware that I/O notification by signals is being
enabled or disabled.


5.2.5.  readv()

5.2.6.  writev()

5.2.7.  sendpage()

5.2.8.  get_unmapped_area()





6.0.  Inode (i) Operations

OGFS has 4 sets of inode operations, in src/fs/arch_linux_2_4/inode_linux.c.

One set for files (identical to set for device nodes):

struct inode_operations ogfs_file_iops = {
	revalidate:	ogfs_irevalidate,
	setattr:	ogfs_setattr
};


One set for device nodes (identical to set for files):

struct inode_operations ogfs_dev_iops = {
	revalidate:	ogfs_irevalidate,
	setattr:	ogfs_setattr
};


One set for directories (the last two entries are same as above):

struct inode_operations ogfs_dir_iops = {
	create:		ogfs_create,
	lookup:		ogfs_lookup,
	link:		ogfs_link,
	unlink:		ogfs_unlink,
	symlink:	ogfs_symlink,
	mkdir:		ogfs_mkdir,
	rmdir:		ogfs_rmdir,
	mknod:		ogfs_mknod,
	rename:		ogfs_rename,
	revalidate:	ogfs_irevalidate,
	setattr:	ogfs_setattr
};


And one set for symlinks (the last entry is same as other 3 sets):

struct inode_operations ogfs_symlink_iops = {
	readlink:	ogfs_readlink,
	follow_link:	ogfs_follow_link,
	revalidate:	ogfs_irevalidate
};


6.1.   Hooked inode functions
6.1.0. ogfs_irevalidate()
6.1.1. ogfs_setattr()
6.1.2. ogfs_create()
6.1.3. ogfs_lookup()
6.1.4. ogfs_link()
6.1.5. ogfs_unlink()
6.1.6. ogfs_symlink()
6.1.7. ogfs_mkdir()
6.1.8. ogfs_rmdir()
6.1.9. ogfs_mknod()
6.1.10. ogfs_rename()
6.1.11. ogfs_readlink()
6.1.12. ogfs_follow_link()

6.2.   Non-hooked inode functions
6.2.0. truncate()
6.2.1. permission()
6.2.2. getattr()
6.2.3. setxattr()
6.2.4. getxattr()
6.2.5. listxattr()
6.2.6. removexattr()




7.0.  Address Space (a) Operations

OGFS has 1 set of address space operations, along with the inode ops,
in src/fs/arch_linux_2_4/inode_linux.c.

One set for files:

struct address_space_operations ogfs_file_aops = {
	writepage:	ogfs_writepage,
	readpage:	ogfs_readpage,
	sync_page:	block_sync_page,
	prepare_write:	ogfs_prepare_write,
	commit_write:	ogfs_commit_write,
	bmap:		ogfs_bmap,
};


7.1.  Hooked address space functions

7.1.0.  ogfs_writepage()

7.1.1.  ogfs_readpage()

7.1.2.  block_sync_page() (kernel function??)

7.1.3.  ogfs_prepare_write()

7.1.4.  ogfs_commit_write()

7.1.5.  ogfs_bmap()


7.2.   Non-hooked address space functions

7.2.0.  flushpage()

7.2.1.  releasepage()

7.2.2.  direct_IO()

7.2.3.  removepage()




8.0  Directory Entry (dentry, d) Operations:

OGFS has one set of dentry operations, in src/fs/arch_linux_2_4/dcache.c:

struct dentry_operations ogfs_dops = {
	d_revalidate:	ogfs_drevalidate,
	d_release:	ogfs_drelease
};


8.1  Hooked Dentry Operations:


8.1.0. ogfs_drevalidate()

Verifies validity of incore dentry object before VFS uses it for translating a
file pathname.  While the default VFS function does nothing (VFS keeps dentries
current within a single computer node), OGFS requires, and performs, a
revalidation, in case another node has changed the directory tree.

OGFS keeps track of dentry currency by means of two version numbers, one
attached to the VFS dentry's d_fsdata (fs private data), and one attached
to the glock of the inode represented by the dentry.  The glock version #
increments whenever ????.

If the dentry version # does not match the glock version #, ogfs_drevalidate()
performs a new directory search, then updates the dentry version # to match the
glock's version #.

OGFS determines if the dentry is current by comparing a dentry version number
(that OGFS attaches to the VFS dentry structure) with the version number
attached to the OGFS inode's glock structure.


8.1.1. ogfs_drelease()

Called just before VFS frees a dentry object.  OGFS frees the small
ogfs_dcached_t structure that it had previously attached to the dentry object.




9.0. Page/mmap (virtual memory, vm) Operations:


static struct vm_operations_struct ogfs_shared_mmap = {
	nopage:		ogfs_shared_nopage,
};

static struct vm_operations_struct ogfs_private_mmap = {
	nopage:		ogfs_private_nopage,
};