Filesystem_On-Disk_Layout (Aug 30 2003)


Contact the author.

		OpenGFS File System Layout on Disk

Copyright 2003 The OpenGFS Project.

Introduction
------------

This document contains details of the data and structures contained in the
OpenGFS file system on the physical disks.  This document is aimed at
developers, potential developers, students, and anyone who wants to know about
the low level details of the file system.

This document is not intended as a user guide to OpenGFS.  Look in the OpenGFS
HOWTO-generic for details of configuration and setting up OpenGFS.

Terminology
-----------

The term "sector" refers to a 512 block as it is used in the Linux kernel.  The
term "block" refers to a block in the OpenGFS file system.  The size of a block
can be a power of two, from 512 to 65536.  The terms log and journal mean the
same throughout this document.

General Information
-------------------

1. Byte Order

All data in the OpenGFS structures is stored in big endian byte order on
disk.  In memory, host byte order is used.  The disk byte order is defined by
the OGFS_ENDIAN_BIG macro in fs/ogfs_ondisk.h.  No guarantees can be made that
the current code works with little endian structures too.

Pool Layout
-----------

1. Real Pool Layout

The OpenGFS file system is usually not stored directly on a disk partition, but
uses a pool device that is provided by the pool kernel module.  Please refer to
the ogfs-pool document for details.  A pool consists of a number of subpools.
Each subpool contains one or more disk partitions.

The pool device /dev/pool/ supplies a linear address space for the
sectors available on the disk partitions.  Example: Let us assume that the pool
was configured so that subpools of type ogfs_data are at the beginning of the
address space, while subpools of type ogfs_journal are placed at the end.  Let
us further assume the pool has n data subpools with a total of N sectors and m
journal subpools with a total of M sectors.  The picture below shows the layout
of the example pool device:

  /dev/pool/:
  +-----------+---------+-----------+-----------+---------+-----------+
  |   data    |   ...   |   data    |  journal  |   ...   |  journal  |
  | subpool 1 |         | subpool n | subpool 1 |         | subpool m |
  +-----------+---------+-----------+-----------+---------+-----------+
  ^                                 ^                     ^
  |_ sector 0                       |_ sector N           |_ sector N+M-1

  Pool layout example


The subpools may appear in any order, mixing data and journal subpools.

2. Faked Pool Layout

When the file system is created, the mkfs.ogfs utility can be told to ignore
the pool layout.  This can be useful for testing on a real pool, without
worrying about the actual pool setup.  Another application is to run an OpenGFS
file system locally on a disk partition or a file.  In any case, mkfs.ogfs can
be told to ignore the normal pool layout and fake one without actually
accessing the pool driver:

  $ mkfs.ogfs -i -j m -J s -b bs target
              ^    ^    ^    ^   ^
              |    :    |    :   |___ target device or file
              |    :    |    :....... block size
              |    :    |____________ size of each journal in MB
              |    :................. number of journals
              |______________________ ignore pool layout

To initialise the file system, mkfs.ogfs internally fakes a pool layout similar
to a real pool:  (m*s) MB are reserved at the end of the available space for
the journals.  The remaining sectors are put in a data subpool that starts at
the beginning of the target.  Assuming the target has T sectors and (m*s) MB
consist of M sectors:

  target:
  +-----------+-----------+---------+-----------+
  |   data    |  journal  |   ...   |  journal  |
  |  subpool  | subpool 1 |         | subpool m |
  +-----------+-----------+---------+-----------+
  ^           ^                     ^
  |_ sector 0 |_ sector T-M         |_ sector T-1


3. Subpool Limitations Imposed by mkfs.ogfs

The mkfs.ogfs utility imposes a number of limits on the actual or faked pool
geometry:

 * Each journal must at least have a size of 32 MB.  (The default without the
   -J option is 128 MB).
 * Subpools bust be aligned to block boundaries.  The mkfs.ogfs utility
   internally cuts off the excess sectors at the beginning and the end of a
   subpool.  These sectors are never used by the file system.
 * Each data subpool must consist of at least 100 blocks.
 * The first data subpool must provide extra space for the superblock and other
   information.  The first 64 kB of the file system are filled with zeros.  In
   total, the size of the first data subpool must be at least (101 * block
   size) + (128 * sector size).

Overall File System Layout
--------------------------

The OpenGFS file system consists of four top level components:  the superblock,
resource groups, journals and an unused area at the beginning of 64 kB at the
beginning.  The distribution of the journals among the journal subpools has
been discussed above, so let us focus on the layout of the data subpools.

  data subpool 1                 data subpool n    journal subpools
  +------------------------+-----+-----------------+----------+-----+----------+
  |+---+--+---------------+|     |+---------------+|+---------+     +---------+|
  ||   |  |    |     |    ||     ||   |     |     |||         |     |         ||
  || 0 |sb|rg 1| ... |rg r|| ... ||rg | ... | rg  |||journal 1| ... |journal m||
  ||   |  |    |     |    ||     ||r+t|     |r+t+u|||         |     |         ||
  |+---+--+---------------+|     |+---------------+|+---------+     +---------+|
   ^   ^  ^          ^            ^         ^       ^               ^
   :   |  :          :            :         :       |               |
   :   |  :          :            :         :       |___________ journals
   :   |  :..........:............:.........:................... resource groups
   :   |________________________________________________________ superblock
   :............................................................ 64kB zeros


The available blocks in each data subpool are divided into resource groups.
Please refer to the "Resource Groups" section for details.

The first data subpool holds some additional information:  The superblock
begins at a 64 kB offset from the beginning of the subpool.  The remaining
blocks are split into resource groups as described above.  The first resource
group also contains the journal index and the resource group index (note that
both indices may be spread over several resource groups by the ogfs_expand
tool).  These two structures are discussed later.

Caveat:  Upon file system creation, the first resource group must be big enough
to hold both, the journal index and the resource group index.  At the moment,
mkfs.ogfs prints an error message when it is already half through file system
creation.  However, it seems to be possible to start with only one data
subpool, create the file system with mkfs.ogfs, then add more data subpools to
the pool and use ogfs_expand.  This can happen if you have a lot of data
subpools with a lot of resource groups, but the first data subpool is very
small.

Dinodes
-------

1. Distributed Inodes (Dinodes)

OpenGFS stores file meta data in inode like structures called distributed
inodes or dinodes in OpenGFS terminology.  Dinodes are similar to inodes stored
in other file systems.

Every dinode occupies a whole block.  An advantage is that this reduces the
likelihood of multiple clients trying to lock the same block.  Another one is
that the block address of the dinode uniquely identifies the dinode and is thus
used as the number of the dinode.  Therefore inodes are allocated and
deallocated when they are needed and any block can contain either data or a
dinode.  On the other hand, a lot of space is wasted in comparison to other
file systems, at least in some cases.  When the file system is created with
mkfs.ogfs, only three dinodes are generated:

  journal index dinode        - first data block in first resource group
  resource group index dinode - second data block in first resource group
  root directory dinode       - third data block in first resource group

As the size of a block is chosen when the file system is created, the size of a
dinode varies.  A constant portion of the available space is consumed by meta
data, and the remaining space is used for data or block addresses.  As of 25th
of February, 2002, the meta data header occupies 232 bytes.

  dinode layout:
  +-----------+------------------------+
  | meta data | data area              |
  +-----------+------------------------+
  0           232                      bs  <- byte offset


Since block addresses use 64 bits or 8 bytes, 35 block addresses can populate a
dinode with a block size of 512 bytes.  For a size of 65536 bytes, 8163
addresses can be stored.

Note:  Proper tuning of the block size is essential to overall performance.

2. Freeing Dinodes

When a dinode is freed, the block is stored at the beginning of a list of free
dinodes in the resource group header and the OGFS_DIF_UNUSED flag is set.

Note:  Once converted to a dinode, a block can not hold data anymore.  For
example - with an bs of 8192 - if at one time a directory with a million files
reside in a resource group, the dinodes occupy 8 GB.  This space becomes
unavailable for file data, even if the directory is deleted.  But you can online
reclaim these freemeta blocks by using "ogfs_tool" utility to return the lost
space.

3. Allocating Dinodes

When a dinode is needed, the first unused dinode is taken off the head of the
free dinode list in the target resource group.  In case there are no free
dinodes left, the resource group bitmap is scanned for a free meta data block
to be converted to a dinode.  Should this search bring up no free block, the
first 64 (OGFS_META_CLUMP) free non-meta data blocks (type OGFS_BLKST_FREE) in
the resource group are converted to type OGFS_BLKST_FREEMETA and the new dinode
is allocated from the newly available space.

Caveat:  The decision in which resource group the dinode is to be allocated is
made before the list of free dinodes is examined.  This may be somewhat
inefficient if there were another resource group that can accomodate the data
equally well but also has a free dinode.

4. Dinode Incarnations

When a dinode is created initially, the incarnation number (di_incarn field in
the ogfs_dinode_t) is set to zero.  Freeing the dinode keeps the incarnation
number intact.  When a new file or directory is created with a previously
created inode, the incarnation number is increased by one.

The incarnation number is used to initialise the i_incarnation field in a Linux
inode structure.

Files
-----

Where and how a file is stored depends on whether the file is larger than (bs -
232) bytes or not.  If it is small enough, the data is stored directly in the
dinode ("stuffed").  This significantly reduces the need for block reads for
small files and especially directories.

  stuffed dinode layout:
  +-----------+------------------------+
  | meta data | direct data            |
  +-----------+------------------------+
  0           232                      bs  <- byte offset


Caveat:  When a file grows beyond what fits in the dinode, the data has to be
copied ("unstuffed") into a separate data block.

Caveat:  Once a file has outgrown the dinode and has been unstuffed, it is not
stuffed again if it shrinks enough to fit into the dinode.  To stuff the file
again, it must be deleted and created from scratch.

For larger files, the data area is filled with direct on indirect file system
numbers.  The depth of indirection is practically unlimited, allowing any
number of blocks per file, up to the theoretical maximum of 16 exabytes (2^64
bytes).  block addresses occupy 8 bytes.  Therefore, (bs - 232) / 8 addresses
can be stored in a stuffed dinode.

  unstuffed dinode layout with direct data blocks:
  +-----------+------------------------+
  | meta data | block addresses        |
  |           |_|_|_|_|_|_|_|_| ...    |
  +-----------+|-|-|-|-|---------------+
               | | | | |
    ___________| | | | |___________   ...
   |        _____| | |_____        |
   |       |       |       |       |
   v       v       v       v       v
   +-----+ +-----+ +-----+ +-----+ +-----+  data blocks
   |  1  | |  2  | |  3  | |  4  | |  5  |  ...
   +-----+ +-----+ +-----+ +-----+ +-----+


Up to (bs / 8) =: b block addresses can be stored in a file system.  In case
indirect data blocks are needed:

  unstuffed dinode layout with indirect data blocks:
  +-----------+------------------------+
  | meta data | block addresses        |
  |           |_|_|_|_|_|_|_|_| ...    |
  +-----------+|-|---------------------+
               | |
   ____________| |________   ...
  |                       |
  v                       v
  +--------------------+ +--------------------+
  | block addresses    | | block addresses    |  indirect blocks
  |_|_|_|     ...    |_| |_|_|_|     ...    |_|  ...
  +|------------------|+ +|------------------|+
   |                  |   |                  |
   |              ____|   |              ____|
   |             |        |             |
   v             v        v             v
   +-----+       +-----+  +-----+       +-----+  data blocks
   |  1  |  ...  |  b  |  | b+1 |  ...  | 2b  |  ...
   +-----+       +-----+  +-----+       +-----+


Additional levels of indirection are added as needed, but all file data is
always stored at the same level of indirection.  The following table shows the
maximum file size given a block size bs and the desired level of indirection:

     bs | stuffed      direct    indirect     2x ind.     3x ind.     4x ind.
  ------+---------------------------------------------------------------------
    512 |     280       17920     1120 KB       70 MB     4480 MB      280 GB
   1024 |     792      101376     6336 KB      792 MB       99 GB    12672 GB
   2048 |    1816      464896    29056 KB     7264 MB     1816 GB      454 TB
   4096 |    3864     1978368   123648 KB    61824 MB    30912 GB    15456 TB
   8192 |    7960     8151040   509440 KB   509440 MB   509440 GB   509440 TB
  16384 |   16152    33079296     2019 MB     4038 GB     8076 TB    16152 PB
  32768 |   32536   133267456     8134 MB    32536 GB   130144 TB      *16 EB
  65536 |   65304   534970368    32652 MB   261216 GB  2089728 TB      *16 EB

     bs |  4x ind.     5x ind.     6x ind.     7x ind.     8x ind.
  ------+----------------------------------------------------------
    512 | 17920 GB     1120 TB       70 PB     4480 PB      *16 EB
   1024 |  1584 TB      198 PB      *16 EB         N/A         N/A
   2048 | 16224 TB      *16 EB         N/A         N/A         N/A
   4096 |  7728 PB      *16 EB         N/A         N/A         N/A
   8192 |   *16 EB         N/A         N/A         N/A         N/A
  16384 |   *16 EB         N/A         N/A         N/A         N/A
  32768 |      N/A         N/A         N/A         N/A         N/A
  65536 |      N/A         N/A         N/A         N/A         N/A

  Maximum file size for a given block size bs and level of indirection

  * Actual value is higher, but the file size is limited to a 64 bit value or
    16 exabytes.


Note:  The journal index and the resource group index are stored as normal
files.  They have a dinode each with associated data.  Instead of a directory
entry somewhere below the root directory, the associated dinodes are stored at
a fixed position in the file system.

Resource Group Index
--------------------

The resource group index is a special file; the dinode is not stored in any
directory but directly in the superblock.  There is one index entry for each
resource group in the file system.  The index entries describe the position and
size of the resource groups, the position and number of the data blocks in the
resource groups and the size of the block bitmaps.  For more details refer to
appendix A.  Depending on the number of resource groups and the available space
in a dinode, the index is stored directly in the dinode or in separate blocks.

Small resource group index in the dinode:

  resource group index in a stuffed dinode:
  +------------------+---------------------------------+
  | dinode meta data |__|__|__|__|    ...    |__|__|__||
  |                  | resource group index entries    |
  +------------------+---------------------------------+


Larger resource group index in explicit data blocks:

  resource group index with an unstuffed dinode:
  +------------------+---------------------------------+
  | dinode meta data |__|__|__|__|    ...              |
  |                  | |  |  |  |   block addresses    |
  +------------------+-|--|--|--|----------------------+
         ______________|  |  |  |______________
        |             ____|  |____             | ...   possibly via indirect
        |            |            |            |       blocks
        |            |            |            |
        |            |            |            |
        v            v            v            v resource index blocks
  +--+---------+ +--+---------+ +--+---------+ +--+---------+
  |md|i| ... |i| |md|i| ... |i| |md|i| ... |i| |md|i| ... |i|  ...
  +--+---------+ +--+---------+ +--+---------+ +--+---------+
  ^  ^       ^   ^  ^       ^   ^  ^       ^   ^  ^       ^
  |  :       :   |  :       :   |  :       :   |  :       :
  |  :.......:...|..:.......:...|..:.......:...|..:.......:.. index entries
  |______________|______________|______________|_____________ meta data


Note:  Index entries may be split between two adjacent blocks.

Resource Groups
---------------

1. General Information

Data subpools are divided into resource groups.  The file system core tries to
keep the blocks of a file in the same resource group to reduce necessary disk
seeks and to reduce the number of locks required to protect the data from other
clients and thus reducing overall performance degradation due to locking.  A
resource group is made up of three components:  the resource group header, the
block bitmap and the data blocks.

The number of resource groups in each data subpool depends on the number f of
available blocks in the subpool.  If f is equal to or greater than 573440, the
number of resource groups in that subpool is (f / 57344), rounded down.  If f
is smaller than that, the number of resource groups is 10.  The remaining (f %
57344) blocks are given to the first resource group of the subpool.  All
resource groups in all the data subpools are numbered continuously, beginning
with 1 for the resource group following the superblock.

The table below shows the typical resource group size for each possible block
size.

        | typical resource
     bs | group size
  ------+------------------
    512 |    28 MB
   1024 |    56 MB
   2048 |   112 MB
   4096 |   224 MB
   8192 |   448 MB
  16384 |   896 MB
  32768 |  1792 MB
  65536 |  3584 MB

  Typical resource group size for the possible block sizes.

Caveat:  The typical resource group size seems to be too large to efficiently
reduce the chance of multiple clients trying to lock the same resource group in
parallel.  More investigation and possibly tuning options are necessary.

  resource group layout:
  +------+------------+-------------------------------------------------------+
  | meta |    block   |____|____|____|____|____|____|      ...      |____|____|
  | data |   bitmap   | data blocks                                           |
  +------+------------+-------------------------------------------------------+


2. Resource Group Header

The resource group header contains information about the number of free and
used block for data, dinodes and meta data and keeps a list of unused dinodes
in another dinode.

3. Block Bitmap

Each block is represented by a pair of bits in the bitmap.  The lower bit
indicates whether the block is free (0) or used (1).  The upper bit indicates
whether the block (potentially) contains meta data (1) or normal data (0).
Normal data includes blocks with direct data or indirect data.  All other
blocks, including all dinodes, have the meta data flag set.

The bitmap begins in the first block of the resource group, right after the
resource group header.  In case it does not fit into that block, further blocks
are allocated which carry a header structure and more of the bitmap data.  The
header structure is a plain ogfs_meta_header_t structure without any bitmap
specific information.

  Bitmap layout:
  +-----------------------+----------------------------------------+---
  | resource |00 10 10 01 |  bitmap  |10 10<11 01 11 01 10 10 11 01|
  |  group   |11          |   block  |00 00|01 10 00 00 01         |   ...
  |  header  |10   bitmap |  header  |00 01|11 10 01 00 01  bitmap |    more
  |          |01          |          |01>01|10 01 01 01 01         |    blocks
  +-----------------------+-------------:--|-----------------------+---
             ^^           ^           ^ :  |
             |:           |           : :  |__ lowest bit pair in second byte
             |:           |           : :..... highest bit pair in second byte
             |:           |___________________ dedicated bitmap block
             |:.......................:....... bytes in the bitmap
             |________________________________ part of bitmap in first block


The mapping between the bytes in the bitmap and the blocks is linear:  the
first byte corresponds to the first four blocks, the second byte to the second
four blocks, and so on.  The lower bits within a byte stand for lower block
numbers:

  Bit pair layout:
     ________________________ high bit in byte
    |                    ____ low bit in byte
    |                   |
    v   bitmap byte     v
  +-----------------------+
  | M U | M U | M U | M U |
  +-----------------------+
    ^ ^   ^ ^   ^ ^   ^ ^
    : |   : |   : |   : |
    : |___:_|___:_|___:_|____ lower bits: block unused (0) or used (1)
    :.....:.....:.....:...... higher bits: used for data (0) or meta data (1)


  Example bitmap byte:
     ________________________ high bit
    |                    ____ low bit
    |                   |
    v   bitmap byte n   v
  +-----------------------+
  | 1 0 | 0 0 | 0 1 | 1 1 |
  +-----------------------+
     ^     ^     ^     ^
     :     |     :     |_____ block 4n is used by meta data
     :     |     :........... block 4n+1 is used by data
     :     |_________________ block 4n+2 is an unused data block
     :....................... block 4n+3 is an unused meta data block


Possible bit combinations:

  bits | C code macro         meaning
  -----+-------------------------------------------------
   00  | OGFS_BLKST_FREE       free block
   01  | OGFS_BLKST_USED       block occupied by data
   10  | OGFS_BLKST_FREEMETA   free meta data block
   11  | OGFS_BLKST_USEDMETA   block occupied by meta data


4. Data Blocks

The blocks not used for the resource group header and the bitmap are
used for data.  If the number of data blocks is not a multiple of
four, the excess blocks are not used.

Journal Index
-------------

The journal index is very similar to the resource group index, but stores
information about the journals instead of the resource groups.  The layout is
identical except that a journal index entry has a different format.

Journals
--------

1. Journal Segmentation

The journal of each node is divided into suqeuences of blocks called segments.
The number of blocks in each segments is specified when the file system is
created and must be at least two.  The default value is 16.  All journals have
the same segment size.  In case the first block number of the journal is not a
multiple of the segment size, the surplus blocks up to the first segment
boundary from the start of the journal are cut off and not used.  Likewise,
some blocks may be cut from the end of the journal.

  journal:
  +-----+-----------------+---------------------------+-----------------+-----+
  |__|__|__|__|__|__|__|__|            ...            |__|__|__|__|__|__|__|__|
  |     | segment blocks  |                           | segment blocks  | pad |
  +-----+-----------------+---------------------------+-----------------+-----+
  ^     ^                                           ^                 ^
  :     |                                           |    pad. blks ...:
  :     |___________________________________________|___ journal segments
  :..................................................... padding blocks


The segments are used to store meta data information about unfinished write
transactions.  The data itself is not logged.

2. Segments

A segment is a sequential range of blocks.  The first block acts as a header
block while the following blocks store the journal data.  The header block
carries a ogfs_log_header_t structure at the beginning of the block and an
exact copy at the end.  When the file system is mounted, the two copies of the
header structure are compared.  In case they do not match, the segment is
invalid, for example if the machine crashed before the transaction was
completely written to disk.

  segment:
  +--------------------+--------------------+-----------+--------------------+
  | hdr | unused | hdr |    journal data    |    ...    |    journal data    |
  |     | space  |     |                    |           |                    |
  +--------------------+--------------------+-----------+--------------------+
  ^                    ^                                ^
  :                    |________________________________|_ data blocks
  :....................................................... segment header block


Apart from the usual meta data that describes the block type, the header
contains information about the sequence number of the transaction the segment
belongs to, a flag whether the journal was cleanly unmounted (!!!??? in which
segment is this flag valid?), !!!???.

3. Transactions and Log Entries

For every write operation in the file system, a transaction structure is built
in memory and a log entry describing the transaction is written to the
journal.

The journal is written as a ring.  New transactions are appended to the current
head segment of the log, pushing it forwards and wrapping around at the last
log segment.  Transactions that have not been finished follow the head up to
the tail segment and are completed from the tail towards the head (first in
first out).

For each transaction, the following steps are performed:

  a) The maximum possible disk space for a transaction is reserved on the disk.
  b) Locks and buffers for the transaction are collected in memory.
  c) The log information about the transaction is written to the log.
  d) An end marker is written to the log.
  e) The write operations associated with the transaction is performed.
  f) The log entry is closed.

If the machine drops out of the cluster beforce (d) has been completed, no data
has been written to disk and no recovery is necessary for this transaction.  It
may be necessary to identify the incomplete log entry, though.  After that
point, the log entry is replayed when the node fails.  After step (f) the log
is again in a clean state and the log entry can be overwritten if need be.

Bug: If the transaction sequence number has an integer overflow, the current
code will not properly repair the file system.

The log entry for a transaction may consist of one or more segments.  The
header of each segment carries the sequence number of the log entry which is
the sequence number of the last log entry plus one.  The head of the log is
identified when the journal is mounted by looking for the log entry with the
highest sequence number that was completely written to the log.  Once all data
has been written to a log entry, a new sequence number is written to the
header of the first segment after the just written log entry.  This indicates
that the log entry has been written successfully.

  Log entry:
  +---------+---------+-------+---------+---------+
  |  first  | second  |  ...  |  last   |  empty  |
  | segment | segment |       | segment | segment |
  +---------+---------+-------+---------+---------+
  ^         ^                 ^         ^
  |         |                 |         :..........  segment with end mark
  |_________|_________________|____________________  segments of the log entry

  Physical structure of a log entry


4. Transaction Chunks

As it was pointed out above, the first block of each segment is reserved for
the segment header and carries only journal meta data.  The real transaction
data is stored in the blocks following the segment headers.  Disregarding the
headers, a log entry can be thought of as a sequence of blocks with the
information needed to bring the file system to a sane state if the node fails
before the transaction was completely written to disk.  This sequence of blocks
is divided into several chunks that each contain information of a different
kind.

  Log entry:
  +---------------------+---------------------+-------+---------------------+
  | block | ... | block | block | ... | block |  ...  | block | ... | block |
  |   1   |     |   k   |   1   |     |   l   |       |   1   |     |   m   |
  +---------------------+---------------------+-------+---------------------+
  ^                     ^                             ^
  |                     :                             |_________  last chunk
  |                     :.......................................  second chunk
  |_____________________________________________________________  first chunk

  Logical structure of a log entry


The first block of each chunk has a header of type ogfs_log_descriptor_t that
denotes the chunk type, the length of the chunk in blocks and holds some chunk
specific information (see Appendix A for structure details).

!!!log entry chunk types

!!!recovery process
When the file system is mounted or recovered, it is essential to identify the
head and tail segments of the log.

Directories
-----------

Directories in OpenGFS come in three flavours, depending on their size.  Each
directory has a dinode that stores the usual data like permission and ownership
information and other meta data.  If a directory is small enough, the data area
of the directory dinode is used to store the directory entries directly.  If the
directory is too big to fit or when the directory grows too much, the file
names of the directory entries are hashed with a 32 bit CRC algorithm.  The hash
table is stored as if the directory dinode were a plain file dinode and the
hashtable its data.  That means, that the hash table is stored directly in the
data area of the dinode if possible, and direct or indirect data blocks are
allocated if it does not fit.  For details about the hash table, please refer
to the Hash Table Layout section below.

Note:  Directory entries are always contained in a single block.

1. Stuffed Directories

A directory is stuffed into its dinode if the total length of all its directory
entries (ogfs_dirent_t) is no larger than (bs - 232):

  stuffed directory dinode:
  +------+-------------------------+
  | meta | . | .. | foobar |  ...  |
  | data |    directory entries    |
  +------+-------------------------+


2. Directories with a Stuffed Hashtable

If the directory needs a hash table, but the hash table is up to (bs - 232)
bytes large, it is stored in the dinode:

  directory dinode with stuffed hash table:
  +------+-------------------------+
  | meta |      hash      | unused |
  | data |      table     |  area  |
  +------+-------------------------+


3. Directories with an Unststuffed Hashtable

Once the hash table grows beyond the free space in the directory dinode, it is
stored in direct or indirect data blocks pointed to from the dinode:

  directory dinode and unstuffed hash table:
  +-----------+------------------------+
  | meta data | block addresses        |
  |           |_|_|_|_|_|_|_|_| ...    |
  +-----------+|-|-|-|-|---------------+
               | | | | |
    ___________| | | | |___________   ... (indirection levels in between)
   |        _____| | |_____        |
   |       |       |       |       |
   v       v       v       v       v
   +-----+ +-----+ +-----+ +-----+ +-----+  hash table blocks
   |  1  | |  2  | |  3  | |  4  | |  5  |  ...
   +-----+ +-----+ +-----+ +-----+ +-----+


Directory Hash Tables
---------------------

1. Hash Table Layout

OpenGFS calculates hash values of the file names with a 32 bit CRC algorithm
and uses an arbitrary number of the 32 hash bits in the hash table (higher bits
are used first).  Each hash value is associated with a block (a "leaf") via the
hash table.

When a hash table is created, its initial size is half of the block size.

  hash table entries:
  +--------------+--------------+--------------+
  | leaf address | leaf address | leaf address |  ...
  +--------------+--------------+--------------+
  0              8              16             24 <- byte offset


To reduce disk space consumption, multiple hash values can be linked to the
same leaf.  The directory entries are stored sequentially in the leaves:

  hash table entries:
  +-------+-------+-------+-------+
  |   00  |   01  |   10  |   11  |
  | addr. | addr. | addr. | addr. |
  +-------+-------+-------+-------+
      |       |       |       |
      |_ _____|       |_ _____|
        |               |
        v               v
  +-----------+   +-----------+  leaves
  | 000 "aaa" |   | 110 "xxx" |
  | 100 "bbb" |   | 110 "yyy" |
  | 001 "ccc" |   |           |
  +-----------+   +-----------+

  Example with a maximim hash size of 3 bits


2. Adding Directory Entries to a Hash Table

After adding a new directory named "zzz" with the hash value 110 the hash table
looks like this:

  hash table entries:
  +-------+-------+-------+-------+
  |   00  |   01  |   10  |   11  |
  | addr. | addr. | addr. | addr. |
  +-------+-------+-------+-------+
      |       |       |       |
      |_ _____|       |_ _____|
        |               |
        v               v
  +-----------+   +-----------+  leaves
  | 000 "aaa" |   | 110 "xxx" |
  | 100 "bbb" |   | 110 "yyy" |
  | 001 "ccc" |   | 110 "zzz" |
  +-----------+   +-----------+


Algorithm (adding a new directory entry)

Step 1

The target leaf for the new entry is located by looking up the file name's hash
value in the hash table.

Step 2

Try to insert the directory entry into the target leaf.  If it fits, we are
done.  But if the target leaf overflows, proceed with step 3.

Step 3

Check whether the target leaf is pointed to by multiple entries in the hash
table and continue with step 4 if not.  For such a leaf, the depth of the leaf
(lf_depth member of ogfs_leaf_t) is smaller than the depth of the directory
dinode (di_depth member of ogfs_dinode_t).  A new leaf is allocated, the depth
of both leaves is increased by one, and the corresponding directory entries are
migrated from the target leaf to the new one.  Set the target leaf to the new
leaf and go back to step 2.

Note:  The moved entries leave gaps in the old target leaf as if they had been
deleted, adding to leaf fragmentation.  Since both leaves have to be written
anyway, this would be a good time to strip the entries in the old leaf of their
free space and add it all to the last entry.

Example:  above hash table after adding file "ddd" with hash value 100:

  hash table entries:
  +-------+-------+-------+-------+
  |   00  |   01  |   10  |   11  |
  | addr. | addr. | addr. | addr. |
  +-------+-------+-------+-------+
      |       |       |       |
      |_      |       |_______|_________
        |     |_________                |
        |               |               |
        v               v               v
  +-----------+   +-----------+   +-----------+  leaves
  | 000 "aaa" |   | 001 "ccc" |   | 110 "xxx" |
  | 100 "bbb" |   |           |   | 110 "yyy" |
  | 100 "ddd" |   |           |   | 110 "zzz" |
  +-----------+   +-----------+   +-----------+


Step 4

If the hash table has reached the allowed maximum size, proceed with step 5.  If
not, double the size of the hash table and go back to step 2.  Example: adding
file "eee" with hash value 000:

  hash table entries:
  +-------+-------+-------+-------+-------+-------+-------+-------+
  |  000  |  001  |  010  |  011  |  100  |  101  |  110  |  111  |
  | addr. | addr. | addr. | addr. | addr. | addr. | addr. | addr. |
  +-------+-------+-------+-------+-------+-------+-------+-------+
      |       :       |       |       ;       :       |       |
      |       :       |_______|_______________:_______|_______|
      |       :                       ;       :         |
      |_      :................................         |
        |                _._._._._._._; :               |
        |               ;               :               |
        v               v               v               v
  +-----------+   +-----------+   +-----------+   +-----------+  leaves
  | 000 "aaa" |   | 100 "bbb" |   | 001 "ccc" |   | 110 "xxx" |
  | 000 "eee" |   | 100 "ddd" |   |           |   | 110 "yyy" |
  |           |   |           |   |           |   | 110 "zzz" |
  +-----------+   +-----------+   +-----------+   +-----------+


Step 5

If the target target leaf is the last leaf in its chain, proceed with step
6.  If not, the target leaf points to another leaf in the chain, set the target
leaf to the next leaf and go back to step 2.

Step 6

Allocate a new leaf and chain it to the target leaf of the current chain of
leaves.  Set the target leaf to the new leaf and go back to step 2.

The maximum hash table size has been reached.  The leaves are chained to the
overflowing leaf.  Example:  adding file "vvv" with hash value 110:

  hash table entries:
  +-------+-------+-------+-------+-------+-------+-------+-------+
  |  000  |  001  |  010  |  011  |  100  |  101  |  110  |  111  |
  | addr. | addr. | addr. | addr. | addr. | addr. | addr. | addr. |
  +-------+-------+-------+-------+-------+-------+-------+-------+
      |       |       :       :       !       |       |       :
      |       |       :       :       !       |       |       :
      |       |       :...............!.......|.......|..................
      |       |_______________________!_______|       |                 :
      |_                              ! |             |_                :
        |                _.__.__.__.__! |               |               :
        |               !               |               |               :
        v               v               v               v               v
  +-----------+   +-----------+   +-----------+   +-----------+   +-----------+
  | 000 "aaa" |   | 100 "bbb" |   | 001 "ccc" |   | 110 "xxx" |   |           |
  | 000 "eee" |   | 100 "ddd" |   |           |   | 110 "yyy" |   |           |
  |           |   |           |   |           |   | 110 "zzz" |   |           |
  +-----------+   +-----------+   +-----------+   +-----------+   +-----------+
                                                        |
                                                        v
                                                  +-----------+
                                                  | 110 "vvv" |
                                                  |           |
                                                  |           |
                                                  +-----------+


3. Splitting and Merging Leaves

Cost of splitting a leaf:

 * Read one block
 * Write two blocks
 * Read all entries in old leaf
 * Possibly copy entries to new leaf
 * Overall cost: O(bs) disk I/O, O(bs) CPU, O(hs) memory

 hs:  hash table size
 bs: block size

Caveat:  Once two leaves have been split, they are never merged again.

4. Growing and Shrinking a Hash Table

As the hash table is stored as regular file data, doubling the hash table is
quite simple.  The total length of the file is doubled, and its file system are
written from last to first.

Cost of doubling a hash table of h bytes:

 * (hs / 2bs) block reads (rounded up)
 * 2(hs /bs) block writes
 * hs memory byte reads
 * 2hs memory byte writes
 * Overall cost: O(hs) disk I/O, O(hs) CPU, O(bs) memory

 hs: hash table size
 bs: block size

Note:  The number of disk writes could be cut in half and the memory writes
could be eliminated completely if the low bits of the CRC were used in the hash
table instead of the high bits.

Caveat:  Hash tables can only grow, never shrink.  Once a hash table has
reached a certain size, the only way to free the occupied space is to delete
the directory.

Directory Leaves
----------------

Unless the directory is stuffed into the dinode, the directory entries are
stored in the leaves.  The meta data of a leaf contains information about the
number of entries in this leaf, and other information.

  directory leaf block:
  +------+-----------------------------------------------------+
  | meta |__|_____|_|______|________|  ...  |_____| free space |
  | data | directory entries                                   |
  +------+-----------------------------------------------------+


Directory Entries
-----------------

1. Directory Entry Layout

The contents of a directory entry are shown in the figure below.

  directory entry layout:
  +------------------------------------------------+-----------+
  | dinum | hash  | dirent | name   | file | spare | "foobar"  |
  |       | value | length | length | type | space | file name |
  +------------------------------------------------+-----------+
  0       16      20       22       24     26      40


The size of a directory entry (ogfs_dirent_t) is the size of the file name plus
40 bytes for the header, rounded up to a multiple of 8 (OGFS_DIR_REC_LEN
macro).  The file name is not terminated with a zero byte.

2. Directory Entry Allocation and Deallocation

When a directory entry is deallocated, the space it occupied (de_rec_len field
in the ogfs_dirent_t) is added to the previous directory entry in the same
leaf.  The first directory entry of the leaf has no predecessor.  Instead, the
dinode number is set to zero.

Note:  Since the first entry in a directory is "." entry, it is unlikely that
this ever happens.

  before removing the second entry:
  +-------------------------+-------------------------+-----------------------+
  | header | "file 1" | pad | header | "file 2" | pad | free space            |
  +-------------------------+-------------------------+-----------------------+
  leaf

  after removing the entry:
  +---------------------------------------------------------------------------+
  | header | "file 1" | pad | free space                                      |
  +---------------------------------------------------------------------------+
  leaf

  Deleting a directory entry


When an entry is added to a specific chain of leaves, the chain is searched
from the start for an allocated entry or one with the dinode number set to zero
that has enough free space to store the new directory entry.  If such an entry
is found, the new entry is stored at the beginning of the unused space held by
in the old entry.  The remaining unsued space is stored in the new entry:

  old directory entry:
  +---------------------------------------------------------------------------+
  | header | "oldfile" | pad | free space                                     |
  +---------------------------------------------------------------------------+
  leaf

  after adding the new entry:
  +-----------------------------------------------------+---------------------+
  | header | "oldfile" | pad | header | "newfile" | pad | free space          |
  +-----------------------------------------------------+---------------------+
  leaf

  Adding a directory entry in an existing entry


Caveat:  This algorithm can cause fragmentation of the space in the leaf chains
and almost double the space required to store the directory
entries.  Furthermore, fragmentation increases the average number of block reads
required to locate a directory entry in a chain of leaves.

Caveat:  The space occupied by a chain of leaves can only grow, but not shrink.
To free the occupied space, the directory has to be deleted.

Caveat:  If a new directory entry is too big to be stored in the free space of
any other entry, but the last directory entry in the chain of leaves has some
free space, the new entry does not use the remaining space but is appended in
new space.

  old directory entry:
      +---------------------------------------+
  ... | header | "oldfile" | pad | free space |
      +---------------------------------------+
                                              ^
                          end of leaf chain __|

      +---------------------------------------+-------------------------+
  ... | header | "oldfile" | pad | free space | header | "newfile | pad |
      +---------------------------------------+-------------------------+
                                       ^                                ^
               possibly wasted space __|            end of leaf chain __|

  Adding an entry at the end of a chain of leaves.


If no leaf is found that can hold the new directory entry, a new leaf is
allocated and all available space in the leaf is given to the new directory
entry:

  adding a directory entry to a new leaf:
  +---------------------------------------------------------------------------+
  | header | "newfile" | pad | free space                                     |
  +---------------------------------------------------------------------------+
  ^                                                                           ^
  |__ start of new leaf                                         end of leaf __|


Special Layout of the First Resource Group
------------------------------------------

When the file system is created with mkfs.ogfs, the first resource group has a
special layout.  In addition to normal data, it contains special inodes for the
journal index, the resource group index and the root directory.  The journal
and resource group indices are initially created in the first resource group
but may be extended to other resource groups later with ogfs_expand.

The first data blocks are reserved for the three special inodes and the two
indices:

  first resource group:
  +----+------+------+------+------+----------------------------------------+
  |meta|fs blk|jindex|rindex| root |__|__|__|__|__|__|__|__|__|  ...  |__|__|
  |data|bitmap|dinode|dinode|dinode| jindex | rindex | other data blocks    |
  +----+------+------+------+------+----------------------------------------+
  ^    ^      ^      ^      ^      ^        ^        ^
  :    :      |      |      |      :        :        |___ normal data
  :    :      |      |      |      :........:............ index data
  :    :      |______|______|____________________________ special dinodes
  :....:................................................. normal meta data


Meta data generations
---------------------

All the OpenGFS meta data headers carry a so called generation number.  The
generation number starts with zero when a block is filled with meta data
(including dinodes) for the first time.  Every time a (logged) transaction is
flushed to disk, the generation in each of the involved meta data blocks is
increased by one.  Since the generation is a 64 bit value, an overflow is
prectically impossible.

Generations are used only during the recovery process.  The generation of the
transactions in the log of a failed node is compared to the actual generation
of the data on disk to determine which transactions have to be replayed.

APPENDIX

A. Structure Details
--------------------

This section contains a complete list of the structures stored in the file
system.  Unless otherwise stated, all types are unsigned integers of the given
length.

Security risk:  At least in some places, padding bytes and the reserved space
in some of the structures are filled with random data from the stack.  This
might contain sensitive information.

1. Generic Headers (ogfs_meta_header_t and ogfs_inum_t)

These structures are used in many of the other structures.

Generic block header structure:

       ogfs_meta_header_t:
   0   +-----------------+
       | 0x01161970      | Magic number
   4   +-----------------+
       | meta type       | Type of header (OGFS_METATYPE_XX)
   8   +-----------------+
       | generation      | Generation number
  16   +-----------------+
       | format          | Structure layout of header (OGFS_FORMAT_XX)
  20   +-----------------+
       | pad             | Padding bytes
  24   +-----------------+
  ^
  |___ byte offset


Dinode address type (ogfs_inum_t):

      ogfs_inum_t:
  0   +-------------------+
      | no_formal_ino     | Formal dinode number (0 or value of next field)
  8   +-------------------+
      | block address     | Address of the dinode's block
  16  +-------------------+


2. Superblock (ogfs_sb_t)

Note:  The block holding the superblock is only partially filled by the
superblock data.

      ogfs_sb_t:
  0   ++-----------------++
      || 0x01161970      || Magic number
  4   |+-----------------+|
      || OGFS_METATYPE_SB|| Type of header
  8   |+-----------------+|
      || generation      || Generation number
  16  |+-----------------+|
      || OGFS_FORMAT_SB  || Format of header
  20  |+-----------------+|
      || pad             || Padding bytes
  24  ++-----------------++
      | file system fmt.  | Superblock format* (OGFS_FORMAT_FS)
  28  +-------------------+
      | multihost format  | Multihost format* (OGFS_FORMAT_MULTI)
  32  +-------------------+
      | flags             | No flags yet
  36  +-------------------+
      | block size        | Size of blocks in bytes (512 - 65536)
  40  +-------------------+
      | block size shift  | log2(block size); (9 - 16)
  44  +-------------------+
      | segment size      | Journal segment size in blocks
  48  +-------------------+
      | jindex dinode     | Journal index dinode (ogfs_inum_t)
  64  +-------------------+
      | rindex dinode     | Resource group index dinode (ogfs_inum_t)
  80  +-------------------+
      | root dir dinode   | Root directory dinode (ogfs_inum_t)
  96  +-------------------+
      | locking protocol  | Name of the locking protocol to use (char[])
  160 +-------------------+
      | lock table name   | Name of lock table for this FS (char[])
  224 +-------------------+
      | reserved          | Reserved for future use
  352 +-------------------+

  * The format fields are hard coded to a constant version number.  When the
    superblock is read, the version numbers on disk are compared with these
    constants.  This mechanism could be used to automatically upgrade the
    superblock format.  Currently, the new version numbers are written.

3. Resource Group Index Block Header (ogfs_jdata_t)

In case the resource group index is stored in multiple blocks, each of these
blocks has this header:

      ogfs_jdata_t:
  0   ++-----------------++
      || 0x01161970      || Magic number
  4   |+-----------------+|
      || OGFS_METATYPE_JD|| Type of header
  8   |+-----------------+|
      || generation      || Generation number
  16  |+-----------------+|
      || OGFS_FORMAT_JD  || Format of header
  20  |+-----------------+|
      || pad             || Padding bytes
  24  ++-----------------++


4. Resource Group Index Entry (ogfs_rindex_t)

      ogfs_rindex_t:
  0   +-------------------+
      | starting block    | Starting block of the resource group
  8   +-------------------+
      | num. header blks. | Size of resource group header in blocks
  12  +-------------------+
      | pad               | Padding bytes
  16  +-------------------+
      | first data loc.   | First data block in resource group
  24  +-------------------+
      | num. data blks.   | Number of data block in resource group
  28  +-------------------+
      | bitmap size       | Size of block bitmap in bytes
  32  +-------------------+
      | reserved          | Reserved for future use (char[])
  96  +-------------------+


5. Resource Group Header (ogfs_rgrp_t)

      ogfs_rgrp_t:
  0   ++-----------------++
      || 0x01161970      || Magic number
  4   |+-----------------+|
      || OGFS_METATYPE_RG|| Type of header
  8   |+-----------------+|
      || generation      || Generation number
  16  |+-----------------+|
      || OGFS_FORMAT_RG  || Format of header
  20  |+-----------------+|
      || pad             || Padding bytes
  24  ++-----------------++
      | flags             | No flags yet
  28  +-------------------+
      | num. free         | Number of free data blocks
  32  +-------------------+
      | num. dinodes      | Number of dinodes
  36  +-------------------+
      | num. free dinodes | Number of unused dinodes
  40  +-------------------+
      | free dinode list  | List of unused dinodes (ogfs_inum_t)
  56  +-------------------+
      | num. used meta    | Number of used meta data blocks (except dinodes)
  60  +-------------------+
      | num. free meta    | Number of unused meta data blocks
  64  +-------------------+
      | reserved          | Reserved for future use (char[])
  128 +-------------------+


6. Bitmap Block Header (ogfs_jdata_t)

      ogfs_rgrpbits_t:
  0   ++-----------------++
      || 0x01161970      || Magic number
  4   |+-----------------+|
      || OGFS_METATYPE_RB|| Type of header
  8   |+-----------------+|
      || generation      || Generation number
  16  |+-----------------+|
      || OGFS_FORMAT_RB  || Format of header
  20  |+-----------------+|
      || pad             || Padding bytes
  24  ++-----------------++


7. Journal Index Block Header (ogfs_jdata_t)

Same as a resource group index block header.

8. Journal Index Entry (ogfs_jindex_t)

      ogfs_jindex_t:
  0   +-------------------+
      | starting block    | Starting block of the journal
  8   +-------------------+
      | num. segments     | Number of segments in the journal
  12  +-------------------+
      | pad               | Padding bytes
  16  +-------------------+
      | reserved          | Reserved for future use (char[])
  80  +-------------------+


9. Dinode (ogfs_dinode_t)

      ogfs_dinode_t:
  0   ++-----------------++
      || 0x01161970      || Magic number
  4   |+-----------------+|
      || OGFS_METATYPE_DI|| Type of header
  8   |+-----------------+|
      || generation      || Generation number
  16  |+-----------------+|
      || OGFS_FORMAT_DI  || Format of header
  20  |+-----------------+|
      || pad             || Padding bytes
  24  ++-----------------++
      | dinode num.       | Number of this dinode*
  40  +-------------------+
      | mode              | UNIX mode and permission bits
  44  +-------------------+
      | owner uid         | Owner's user id
  48  +-------------------+
      | owner gid         | Owner's group id
  52  +-------------------+
      | link count        | Number of dinodes of this file (hard link count)
  56  +-------------------+
      | file size         | File size in bytes
  64  +-------------------+
      | num blocks        | Number of blocks in file (1 = stuffed dinode)
  72  +-------------------+
      | atime             | Last access time (int64)
  80  +-------------------+
      | mtime             | Last modification time (int64)
  88  +-------------------+
      | ctime             | Last change time (int64)
  96  +-------------------+
      | device major num. | Device major number
  100 +-------------------+
      | device minor num. | Device minor number
  104 +-------------------+
      | res. group blk.   | First block of dinode's resource group
  112 +-------------------+
      | goal resource grp.| Resource group to allocate next block from
  120 +-------------------+
      | goal data blk.    | block num. in resource group to allocate data
  124 +-------------------+
      | goal meta blk.    | block num. in to allocate meta data
  128 +-------------------+
      | flags             | Flags**
  132 +-------------------+
      | payload format    | Content type of the dinode (see below***)
  136 +-------------------+
      | file type         | Type of file (FILE_XXX macros in ogfs_ondisk.h)
  138 +-------------------+
      | tree height       | Height of data block tree (0 = stuffed dinode)
  140 +-------------------+
      | incarnation       | Incarnation number
  144 +-------------------+
      | pad               | Padding bytes
  146 +-------------------+
      | directory depth   | Number of used bits in hashtable (only in dirs)
  148 +-------------------+
      | num. direntries   | Number of entries in directory (only in dirs)
  152 +-------------------+
      | next unused dinum | Next unused dinode (in unused dinodes; ogfs_inum_t)
  168 +-------------------+
      | reserved          | Reserved for future use (char[])
  232 +-------------------+

  * The formal dinode number is set to zero when the block is first converted
    to a dinode.  It is set to the block address as soon as a file or directory
    is created in the dinode and never changes again.

  ** Flag bits

    OGFS_DIF_JDATA (0x01)
      1: Set for the journal and resource index and all directory inodes
      0: File dinode
      This flag is used only in the ogfsck program.

    OGFS_DIF_EXHASH (0x02)
      1: Directory uses a hash table
      0: Directory is stuffed into the dinode

    OGFS_DIF_UNUSED (0x04)
      1: Dinode is unused.  A list of unused dinodes is kept in the resource
         group header.
      0: Dinode is in use

  *** Valid payload formats

    OGFS_FORMAT_JI (1000)
      Used in the journal index dinode.

    OGFS_FORMAT_RI (1100)
      Used in the resource group index dinode.

    OGFS_FORMAT_DE (1200)
      Used in stuffed directory inodes.

    none (0)
      Used for directory inodes with a hash table and file inodes.

10. Indirect Block Header (ogfs_indirect_t)

      ogfs_indirect_t:
  0   ++-----------------++
      || 0x01161970      || Magic number
  4   |+-----------------+|
      || OGFS_METATYPE_IN|| Type of header
  8   |+-----------------+|
      || generation      || Generation number
  16  |+-----------------+|
      || OGFS_FORMAT_IN  || Format of header
  20  |+-----------------+|
      || pad             || Padding bytes
  24  ++-----------------++
      | reserved          | Reserved for future use (char[])
  88  +-------------------+


11. Directory Leaf Node Block Header (ogfs_leaf_t)

      ogfs_leaf_t:
  0   ++-----------------++
      || 0x01161970      || Magic number
  4   |+-----------------+|
      || OGFS_METATYPE_LF|| Type of header
  8   |+-----------------+|
      || generation      || Generation number
  16  |+-----------------+|
      || OGFS_FORMAT_LF  || Format of header
  20  |+-----------------+|
      || pad             || Padding bytes
  24  ++-----------------++
      | depth             | Number of hash table bits used to address the leaf
  26  +-------------------+
      | num. entries      | Number of directory entries in this leaf node
  28  +-------------------+
      | entry format      | Always OGFS_FORMAT_DE
  32  +-------------------+
      | next leaf         | Next leaf block address if this one overflowed
  40  +-------------------+
      | reserved          | Reserved for future use (char[])
  104 +-------------------+


12. Directory Entry (ogfs_dirent_t)

      ogfs_dirent_t:
  0   +-------------------+
      | dinum             | Dinode number*
  16  +-------------------+
      | file name hash    | Hash of the file name
  20  +-------------------+
      | length            | Length of the directory entry structure in bytes
  22  +-------------------+
      | name length       | Length of the file name in bytes
  24  +-------------------+
      | file type         | Type of the dinode this directory entry points to
  26  +-------------------+
      | reserved          | Reserved for future use (char[])
  40  +-------------------+
      | file name         | File name
      +-------------------+
      | pad               | Padding bytes (to multiples of 8 bytes)
      +-------------------+

  * The formal dinode number is set to zero when the last directory entry in a
    leaf is deleted.


13. Journal Segment Header (ogfs_log_header_t)

      ogfs_log_header_t:
  0   ++-----------------++
      || 0x01161970      || Magic number
  4   |+-----------------+|
      || OGFS_METATYPE_LH|| Type of header
  8   |+-----------------+|
      || generation      || Generation number
  16  |+-----------------+|
      || OGFS_FORMAT_LH  || Format of header
  20  |+-----------------+|
      || pad             || Padding bytes
  24  ++-----------------++
      | flags             | Flags*
  28  +-------------------+
      | pad               | Padding bytes
  32  +-------------------+
      | trans first block | !!!??? Block num. of first hdr in this trans
  40  +-------------------+
      | trans seq. number | Sequence number of trans this segment belongs to
  48  +-------------------+
      | log tail block    | !!!??? /* Block number of log tail */
  56  +-------------------+
      | last nopd block   | !!!??? /* block number of last nopen dump */
  64  +-------------------+
      | reserved          | Reserved for future use (char[])
  128 +-------------------+

  * Flag bits

    OGFS_LOG_HEAD_UNMOUNT (0x01)
      1: The journal was unmounted cleanly
      0: The journal is not clean (mounted or crashed while mounted)

14. Journal Chunks (ogfs_log_descriptor_t)

      ogfs_log_descriptor_t:
  0   ++-----------------++
      || 0x01161970      || Magic number
  4   |+-----------------+|
      || OGFS_METATYPE_LD|| Type of header
  8   |+-----------------+|
      || generation      || Generation number
  16  |+-----------------+|
      || OGFS_FORMAT_LD  || Format of header
  20  |+-----------------+|
      || pad             || Padding bytes
  24  ++-----------------++
      | ld_type           | !!!??? Type of data in this log entry chunk*
  28  +-------------------+
      | ld_length         | Number of buffers in this chunk
  32  +-------------------+
      | ld_data1          | Log entry chunk type specific data**
  36  +-------------------+
      | ld_data2          | Log entry chunk type specific data**
  40  +-------------------+
      | reserved          | Reserved for future use (char[])
  104 +-------------------+

*!!!

**!!!
SourceForge Logo