Contact the author.
OpenGFS File System Layout on Disk Copyright 2003 The OpenGFS Project. Introduction ------------ This document contains details of the data and structures contained in the OpenGFS file system on the physical disks. This document is aimed at developers, potential developers, students, and anyone who wants to know about the low level details of the file system. This document is not intended as a user guide to OpenGFS. Look in the OpenGFS HOWTO-generic for details of configuration and setting up OpenGFS. Terminology ----------- The term "sector" refers to a 512 block as it is used in the Linux kernel. The term "block" refers to a block in the OpenGFS file system. The size of a block can be a power of two, from 512 to 65536. The terms log and journal mean the same throughout this document. General Information ------------------- 1. Byte Order All data in the OpenGFS structures is stored in big endian byte order on disk. In memory, host byte order is used. The disk byte order is defined by the OGFS_ENDIAN_BIG macro in fs/ogfs_ondisk.h. No guarantees can be made that the current code works with little endian structures too. Pool Layout ----------- 1. Real Pool Layout The OpenGFS file system is usually not stored directly on a disk partition, but uses a pool device that is provided by the pool kernel module. Please refer to the ogfs-pool document for details. A pool consists of a number of subpools. Each subpool contains one or more disk partitions. The pool device /dev/pool/supplies a linear address space for the sectors available on the disk partitions. Example: Let us assume that the pool was configured so that subpools of type ogfs_data are at the beginning of the address space, while subpools of type ogfs_journal are placed at the end. Let us further assume the pool has n data subpools with a total of N sectors and m journal subpools with a total of M sectors. The picture below shows the layout of the example pool device: /dev/pool/ : +-----------+---------+-----------+-----------+---------+-----------+ | data | ... | data | journal | ... | journal | | subpool 1 | | subpool n | subpool 1 | | subpool m | +-----------+---------+-----------+-----------+---------+-----------+ ^ ^ ^ |_ sector 0 |_ sector N |_ sector N+M-1 Pool layout example The subpools may appear in any order, mixing data and journal subpools. 2. Faked Pool Layout When the file system is created, the mkfs.ogfs utility can be told to ignore the pool layout. This can be useful for testing on a real pool, without worrying about the actual pool setup. Another application is to run an OpenGFS file system locally on a disk partition or a file. In any case, mkfs.ogfs can be told to ignore the normal pool layout and fake one without actually accessing the pool driver: $ mkfs.ogfs -i -j m -J s -b bs target ^ ^ ^ ^ ^ | : | : |___ target device or file | : | :....... block size | : |____________ size of each journal in MB | :................. number of journals |______________________ ignore pool layout To initialise the file system, mkfs.ogfs internally fakes a pool layout similar to a real pool: (m*s) MB are reserved at the end of the available space for the journals. The remaining sectors are put in a data subpool that starts at the beginning of the target. Assuming the target has T sectors and (m*s) MB consist of M sectors: target: +-----------+-----------+---------+-----------+ | data | journal | ... | journal | | subpool | subpool 1 | | subpool m | +-----------+-----------+---------+-----------+ ^ ^ ^ |_ sector 0 |_ sector T-M |_ sector T-1 3. Subpool Limitations Imposed by mkfs.ogfs The mkfs.ogfs utility imposes a number of limits on the actual or faked pool geometry: * Each journal must at least have a size of 32 MB. (The default without the -J option is 128 MB). * Subpools bust be aligned to block boundaries. The mkfs.ogfs utility internally cuts off the excess sectors at the beginning and the end of a subpool. These sectors are never used by the file system. * Each data subpool must consist of at least 100 blocks. * The first data subpool must provide extra space for the superblock and other information. The first 64 kB of the file system are filled with zeros. In total, the size of the first data subpool must be at least (101 * block size) + (128 * sector size). Overall File System Layout -------------------------- The OpenGFS file system consists of four top level components: the superblock, resource groups, journals and an unused area at the beginning of 64 kB at the beginning. The distribution of the journals among the journal subpools has been discussed above, so let us focus on the layout of the data subpools. data subpool 1 data subpool n journal subpools +------------------------+-----+-----------------+----------+-----+----------+ |+---+--+---------------+| |+---------------+|+---------+ +---------+| || | | | | || || | | ||| | | || || 0 |sb|rg 1| ... |rg r|| ... ||rg | ... | rg |||journal 1| ... |journal m|| || | | | | || ||r+t| |r+t+u||| | | || |+---+--+---------------+| |+---------------+|+---------+ +---------+| ^ ^ ^ ^ ^ ^ ^ ^ : | : : : : | | : | : : : : |___________ journals : | :..........:............:.........:................... resource groups : |________________________________________________________ superblock :............................................................ 64kB zeros The available blocks in each data subpool are divided into resource groups. Please refer to the "Resource Groups" section for details. The first data subpool holds some additional information: The superblock begins at a 64 kB offset from the beginning of the subpool. The remaining blocks are split into resource groups as described above. The first resource group also contains the journal index and the resource group index (note that both indices may be spread over several resource groups by the ogfs_expand tool). These two structures are discussed later. Caveat: Upon file system creation, the first resource group must be big enough to hold both, the journal index and the resource group index. At the moment, mkfs.ogfs prints an error message when it is already half through file system creation. However, it seems to be possible to start with only one data subpool, create the file system with mkfs.ogfs, then add more data subpools to the pool and use ogfs_expand. This can happen if you have a lot of data subpools with a lot of resource groups, but the first data subpool is very small. Dinodes ------- 1. Distributed Inodes (Dinodes) OpenGFS stores file meta data in inode like structures called distributed inodes or dinodes in OpenGFS terminology. Dinodes are similar to inodes stored in other file systems. Every dinode occupies a whole block. An advantage is that this reduces the likelihood of multiple clients trying to lock the same block. Another one is that the block address of the dinode uniquely identifies the dinode and is thus used as the number of the dinode. Therefore inodes are allocated and deallocated when they are needed and any block can contain either data or a dinode. On the other hand, a lot of space is wasted in comparison to other file systems, at least in some cases. When the file system is created with mkfs.ogfs, only three dinodes are generated: journal index dinode - first data block in first resource group resource group index dinode - second data block in first resource group root directory dinode - third data block in first resource group As the size of a block is chosen when the file system is created, the size of a dinode varies. A constant portion of the available space is consumed by meta data, and the remaining space is used for data or block addresses. As of 25th of February, 2002, the meta data header occupies 232 bytes. dinode layout: +-----------+------------------------+ | meta data | data area | +-----------+------------------------+ 0 232 bs <- byte offset Since block addresses use 64 bits or 8 bytes, 35 block addresses can populate a dinode with a block size of 512 bytes. For a size of 65536 bytes, 8163 addresses can be stored. Note: Proper tuning of the block size is essential to overall performance. 2. Freeing Dinodes When a dinode is freed, the block is stored at the beginning of a list of free dinodes in the resource group header and the OGFS_DIF_UNUSED flag is set. Note: Once converted to a dinode, a block can not hold data anymore. For example - with an bs of 8192 - if at one time a directory with a million files reside in a resource group, the dinodes occupy 8 GB. This space becomes unavailable for file data, even if the directory is deleted. But you can online reclaim these freemeta blocks by using "ogfs_tool" utility to return the lost space. 3. Allocating Dinodes When a dinode is needed, the first unused dinode is taken off the head of the free dinode list in the target resource group. In case there are no free dinodes left, the resource group bitmap is scanned for a free meta data block to be converted to a dinode. Should this search bring up no free block, the first 64 (OGFS_META_CLUMP) free non-meta data blocks (type OGFS_BLKST_FREE) in the resource group are converted to type OGFS_BLKST_FREEMETA and the new dinode is allocated from the newly available space. Caveat: The decision in which resource group the dinode is to be allocated is made before the list of free dinodes is examined. This may be somewhat inefficient if there were another resource group that can accomodate the data equally well but also has a free dinode. 4. Dinode Incarnations When a dinode is created initially, the incarnation number (di_incarn field in the ogfs_dinode_t) is set to zero. Freeing the dinode keeps the incarnation number intact. When a new file or directory is created with a previously created inode, the incarnation number is increased by one. The incarnation number is used to initialise the i_incarnation field in a Linux inode structure. Files ----- Where and how a file is stored depends on whether the file is larger than (bs - 232) bytes or not. If it is small enough, the data is stored directly in the dinode ("stuffed"). This significantly reduces the need for block reads for small files and especially directories. stuffed dinode layout: +-----------+------------------------+ | meta data | direct data | +-----------+------------------------+ 0 232 bs <- byte offset Caveat: When a file grows beyond what fits in the dinode, the data has to be copied ("unstuffed") into a separate data block. Caveat: Once a file has outgrown the dinode and has been unstuffed, it is not stuffed again if it shrinks enough to fit into the dinode. To stuff the file again, it must be deleted and created from scratch. For larger files, the data area is filled with direct on indirect file system numbers. The depth of indirection is practically unlimited, allowing any number of blocks per file, up to the theoretical maximum of 16 exabytes (2^64 bytes). block addresses occupy 8 bytes. Therefore, (bs - 232) / 8 addresses can be stored in a stuffed dinode. unstuffed dinode layout with direct data blocks: +-----------+------------------------+ | meta data | block addresses | | |_|_|_|_|_|_|_|_| ... | +-----------+|-|-|-|-|---------------+ | | | | | ___________| | | | |___________ ... | _____| | |_____ | | | | | | v v v v v +-----+ +-----+ +-----+ +-----+ +-----+ data blocks | 1 | | 2 | | 3 | | 4 | | 5 | ... +-----+ +-----+ +-----+ +-----+ +-----+ Up to (bs / 8) =: b block addresses can be stored in a file system. In case indirect data blocks are needed: unstuffed dinode layout with indirect data blocks: +-----------+------------------------+ | meta data | block addresses | | |_|_|_|_|_|_|_|_| ... | +-----------+|-|---------------------+ | | ____________| |________ ... | | v v +--------------------+ +--------------------+ | block addresses | | block addresses | indirect blocks |_|_|_| ... |_| |_|_|_| ... |_| ... +|------------------|+ +|------------------|+ | | | | | ____| | ____| | | | | v v v v +-----+ +-----+ +-----+ +-----+ data blocks | 1 | ... | b | | b+1 | ... | 2b | ... +-----+ +-----+ +-----+ +-----+ Additional levels of indirection are added as needed, but all file data is always stored at the same level of indirection. The following table shows the maximum file size given a block size bs and the desired level of indirection: bs | stuffed direct indirect 2x ind. 3x ind. 4x ind. ------+--------------------------------------------------------------------- 512 | 280 17920 1120 KB 70 MB 4480 MB 280 GB 1024 | 792 101376 6336 KB 792 MB 99 GB 12672 GB 2048 | 1816 464896 29056 KB 7264 MB 1816 GB 454 TB 4096 | 3864 1978368 123648 KB 61824 MB 30912 GB 15456 TB 8192 | 7960 8151040 509440 KB 509440 MB 509440 GB 509440 TB 16384 | 16152 33079296 2019 MB 4038 GB 8076 TB 16152 PB 32768 | 32536 133267456 8134 MB 32536 GB 130144 TB *16 EB 65536 | 65304 534970368 32652 MB 261216 GB 2089728 TB *16 EB bs | 4x ind. 5x ind. 6x ind. 7x ind. 8x ind. ------+---------------------------------------------------------- 512 | 17920 GB 1120 TB 70 PB 4480 PB *16 EB 1024 | 1584 TB 198 PB *16 EB N/A N/A 2048 | 16224 TB *16 EB N/A N/A N/A 4096 | 7728 PB *16 EB N/A N/A N/A 8192 | *16 EB N/A N/A N/A N/A 16384 | *16 EB N/A N/A N/A N/A 32768 | N/A N/A N/A N/A N/A 65536 | N/A N/A N/A N/A N/A Maximum file size for a given block size bs and level of indirection * Actual value is higher, but the file size is limited to a 64 bit value or 16 exabytes. Note: The journal index and the resource group index are stored as normal files. They have a dinode each with associated data. Instead of a directory entry somewhere below the root directory, the associated dinodes are stored at a fixed position in the file system. Resource Group Index -------------------- The resource group index is a special file; the dinode is not stored in any directory but directly in the superblock. There is one index entry for each resource group in the file system. The index entries describe the position and size of the resource groups, the position and number of the data blocks in the resource groups and the size of the block bitmaps. For more details refer to appendix A. Depending on the number of resource groups and the available space in a dinode, the index is stored directly in the dinode or in separate blocks. Small resource group index in the dinode: resource group index in a stuffed dinode: +------------------+---------------------------------+ | dinode meta data |__|__|__|__| ... |__|__|__|| | | resource group index entries | +------------------+---------------------------------+ Larger resource group index in explicit data blocks: resource group index with an unstuffed dinode: +------------------+---------------------------------+ | dinode meta data |__|__|__|__| ... | | | | | | | block addresses | +------------------+-|--|--|--|----------------------+ ______________| | | |______________ | ____| |____ | ... possibly via indirect | | | | blocks | | | | | | | | v v v v resource index blocks +--+---------+ +--+---------+ +--+---------+ +--+---------+ |md|i| ... |i| |md|i| ... |i| |md|i| ... |i| |md|i| ... |i| ... +--+---------+ +--+---------+ +--+---------+ +--+---------+ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ | : : | : : | : : | : : | :.......:...|..:.......:...|..:.......:...|..:.......:.. index entries |______________|______________|______________|_____________ meta data Note: Index entries may be split between two adjacent blocks. Resource Groups --------------- 1. General Information Data subpools are divided into resource groups. The file system core tries to keep the blocks of a file in the same resource group to reduce necessary disk seeks and to reduce the number of locks required to protect the data from other clients and thus reducing overall performance degradation due to locking. A resource group is made up of three components: the resource group header, the block bitmap and the data blocks. The number of resource groups in each data subpool depends on the number f of available blocks in the subpool. If f is equal to or greater than 573440, the number of resource groups in that subpool is (f / 57344), rounded down. If f is smaller than that, the number of resource groups is 10. The remaining (f % 57344) blocks are given to the first resource group of the subpool. All resource groups in all the data subpools are numbered continuously, beginning with 1 for the resource group following the superblock. The table below shows the typical resource group size for each possible block size. | typical resource bs | group size ------+------------------ 512 | 28 MB 1024 | 56 MB 2048 | 112 MB 4096 | 224 MB 8192 | 448 MB 16384 | 896 MB 32768 | 1792 MB 65536 | 3584 MB Typical resource group size for the possible block sizes. Caveat: The typical resource group size seems to be too large to efficiently reduce the chance of multiple clients trying to lock the same resource group in parallel. More investigation and possibly tuning options are necessary. resource group layout: +------+------------+-------------------------------------------------------+ | meta | block |____|____|____|____|____|____| ... |____|____| | data | bitmap | data blocks | +------+------------+-------------------------------------------------------+ 2. Resource Group Header The resource group header contains information about the number of free and used block for data, dinodes and meta data and keeps a list of unused dinodes in another dinode. 3. Block Bitmap Each block is represented by a pair of bits in the bitmap. The lower bit indicates whether the block is free (0) or used (1). The upper bit indicates whether the block (potentially) contains meta data (1) or normal data (0). Normal data includes blocks with direct data or indirect data. All other blocks, including all dinodes, have the meta data flag set. The bitmap begins in the first block of the resource group, right after the resource group header. In case it does not fit into that block, further blocks are allocated which carry a header structure and more of the bitmap data. The header structure is a plain ogfs_meta_header_t structure without any bitmap specific information. Bitmap layout: +-----------------------+----------------------------------------+--- | resource |00 10 10 01 | bitmap |10 10<11 01 11 01 10 10 11 01| | group |11 | block |00 00|01 10 00 00 01 | ... | header |10 bitmap | header |00 01|11 10 01 00 01 bitmap | more | |01 | |01>01|10 01 01 01 01 | blocks +-----------------------+-------------:--|-----------------------+--- ^^ ^ ^ : | |: | : : |__ lowest bit pair in second byte |: | : :..... highest bit pair in second byte |: |___________________ dedicated bitmap block |:.......................:....... bytes in the bitmap |________________________________ part of bitmap in first block The mapping between the bytes in the bitmap and the blocks is linear: the first byte corresponds to the first four blocks, the second byte to the second four blocks, and so on. The lower bits within a byte stand for lower block numbers: Bit pair layout: ________________________ high bit in byte | ____ low bit in byte | | v bitmap byte v +-----------------------+ | M U | M U | M U | M U | +-----------------------+ ^ ^ ^ ^ ^ ^ ^ ^ : | : | : | : | : |___:_|___:_|___:_|____ lower bits: block unused (0) or used (1) :.....:.....:.....:...... higher bits: used for data (0) or meta data (1) Example bitmap byte: ________________________ high bit | ____ low bit | | v bitmap byte n v +-----------------------+ | 1 0 | 0 0 | 0 1 | 1 1 | +-----------------------+ ^ ^ ^ ^ : | : |_____ block 4n is used by meta data : | :........... block 4n+1 is used by data : |_________________ block 4n+2 is an unused data block :....................... block 4n+3 is an unused meta data block Possible bit combinations: bits | C code macro meaning -----+------------------------------------------------- 00 | OGFS_BLKST_FREE free block 01 | OGFS_BLKST_USED block occupied by data 10 | OGFS_BLKST_FREEMETA free meta data block 11 | OGFS_BLKST_USEDMETA block occupied by meta data 4. Data Blocks The blocks not used for the resource group header and the bitmap are used for data. If the number of data blocks is not a multiple of four, the excess blocks are not used. Journal Index ------------- The journal index is very similar to the resource group index, but stores information about the journals instead of the resource groups. The layout is identical except that a journal index entry has a different format. Journals -------- 1. Journal Segmentation The journal of each node is divided into suqeuences of blocks called segments. The number of blocks in each segments is specified when the file system is created and must be at least two. The default value is 16. All journals have the same segment size. In case the first block number of the journal is not a multiple of the segment size, the surplus blocks up to the first segment boundary from the start of the journal are cut off and not used. Likewise, some blocks may be cut from the end of the journal. journal: +-----+-----------------+---------------------------+-----------------+-----+ |__|__|__|__|__|__|__|__| ... |__|__|__|__|__|__|__|__| | | segment blocks | | segment blocks | pad | +-----+-----------------+---------------------------+-----------------+-----+ ^ ^ ^ ^ : | | pad. blks ...: : |___________________________________________|___ journal segments :..................................................... padding blocks The segments are used to store meta data information about unfinished write transactions. The data itself is not logged. 2. Segments A segment is a sequential range of blocks. The first block acts as a header block while the following blocks store the journal data. The header block carries a ogfs_log_header_t structure at the beginning of the block and an exact copy at the end. When the file system is mounted, the two copies of the header structure are compared. In case they do not match, the segment is invalid, for example if the machine crashed before the transaction was completely written to disk. segment: +--------------------+--------------------+-----------+--------------------+ | hdr | unused | hdr | journal data | ... | journal data | | | space | | | | | +--------------------+--------------------+-----------+--------------------+ ^ ^ ^ : |________________________________|_ data blocks :....................................................... segment header block Apart from the usual meta data that describes the block type, the header contains information about the sequence number of the transaction the segment belongs to, a flag whether the journal was cleanly unmounted (!!!??? in which segment is this flag valid?), !!!???. 3. Transactions and Log Entries For every write operation in the file system, a transaction structure is built in memory and a log entry describing the transaction is written to the journal. The journal is written as a ring. New transactions are appended to the current head segment of the log, pushing it forwards and wrapping around at the last log segment. Transactions that have not been finished follow the head up to the tail segment and are completed from the tail towards the head (first in first out). For each transaction, the following steps are performed: a) The maximum possible disk space for a transaction is reserved on the disk. b) Locks and buffers for the transaction are collected in memory. c) The log information about the transaction is written to the log. d) An end marker is written to the log. e) The write operations associated with the transaction is performed. f) The log entry is closed. If the machine drops out of the cluster beforce (d) has been completed, no data has been written to disk and no recovery is necessary for this transaction. It may be necessary to identify the incomplete log entry, though. After that point, the log entry is replayed when the node fails. After step (f) the log is again in a clean state and the log entry can be overwritten if need be. Bug: If the transaction sequence number has an integer overflow, the current code will not properly repair the file system. The log entry for a transaction may consist of one or more segments. The header of each segment carries the sequence number of the log entry which is the sequence number of the last log entry plus one. The head of the log is identified when the journal is mounted by looking for the log entry with the highest sequence number that was completely written to the log. Once all data has been written to a log entry, a new sequence number is written to the header of the first segment after the just written log entry. This indicates that the log entry has been written successfully. Log entry: +---------+---------+-------+---------+---------+ | first | second | ... | last | empty | | segment | segment | | segment | segment | +---------+---------+-------+---------+---------+ ^ ^ ^ ^ | | | :.......... segment with end mark |_________|_________________|____________________ segments of the log entry Physical structure of a log entry 4. Transaction Chunks As it was pointed out above, the first block of each segment is reserved for the segment header and carries only journal meta data. The real transaction data is stored in the blocks following the segment headers. Disregarding the headers, a log entry can be thought of as a sequence of blocks with the information needed to bring the file system to a sane state if the node fails before the transaction was completely written to disk. This sequence of blocks is divided into several chunks that each contain information of a different kind. Log entry: +---------------------+---------------------+-------+---------------------+ | block | ... | block | block | ... | block | ... | block | ... | block | | 1 | | k | 1 | | l | | 1 | | m | +---------------------+---------------------+-------+---------------------+ ^ ^ ^ | : |_________ last chunk | :....................................... second chunk |_____________________________________________________________ first chunk Logical structure of a log entry The first block of each chunk has a header of type ogfs_log_descriptor_t that denotes the chunk type, the length of the chunk in blocks and holds some chunk specific information (see Appendix A for structure details). !!!log entry chunk types !!!recovery process When the file system is mounted or recovered, it is essential to identify the head and tail segments of the log. Directories ----------- Directories in OpenGFS come in three flavours, depending on their size. Each directory has a dinode that stores the usual data like permission and ownership information and other meta data. If a directory is small enough, the data area of the directory dinode is used to store the directory entries directly. If the directory is too big to fit or when the directory grows too much, the file names of the directory entries are hashed with a 32 bit CRC algorithm. The hash table is stored as if the directory dinode were a plain file dinode and the hashtable its data. That means, that the hash table is stored directly in the data area of the dinode if possible, and direct or indirect data blocks are allocated if it does not fit. For details about the hash table, please refer to the Hash Table Layout section below. Note: Directory entries are always contained in a single block. 1. Stuffed Directories A directory is stuffed into its dinode if the total length of all its directory entries (ogfs_dirent_t) is no larger than (bs - 232): stuffed directory dinode: +------+-------------------------+ | meta | . | .. | foobar | ... | | data | directory entries | +------+-------------------------+ 2. Directories with a Stuffed Hashtable If the directory needs a hash table, but the hash table is up to (bs - 232) bytes large, it is stored in the dinode: directory dinode with stuffed hash table: +------+-------------------------+ | meta | hash | unused | | data | table | area | +------+-------------------------+ 3. Directories with an Unststuffed Hashtable Once the hash table grows beyond the free space in the directory dinode, it is stored in direct or indirect data blocks pointed to from the dinode: directory dinode and unstuffed hash table: +-----------+------------------------+ | meta data | block addresses | | |_|_|_|_|_|_|_|_| ... | +-----------+|-|-|-|-|---------------+ | | | | | ___________| | | | |___________ ... (indirection levels in between) | _____| | |_____ | | | | | | v v v v v +-----+ +-----+ +-----+ +-----+ +-----+ hash table blocks | 1 | | 2 | | 3 | | 4 | | 5 | ... +-----+ +-----+ +-----+ +-----+ +-----+ Directory Hash Tables --------------------- 1. Hash Table Layout OpenGFS calculates hash values of the file names with a 32 bit CRC algorithm and uses an arbitrary number of the 32 hash bits in the hash table (higher bits are used first). Each hash value is associated with a block (a "leaf") via the hash table. When a hash table is created, its initial size is half of the block size. hash table entries: +--------------+--------------+--------------+ | leaf address | leaf address | leaf address | ... +--------------+--------------+--------------+ 0 8 16 24 <- byte offset To reduce disk space consumption, multiple hash values can be linked to the same leaf. The directory entries are stored sequentially in the leaves: hash table entries: +-------+-------+-------+-------+ | 00 | 01 | 10 | 11 | | addr. | addr. | addr. | addr. | +-------+-------+-------+-------+ | | | | |_ _____| |_ _____| | | v v +-----------+ +-----------+ leaves | 000 "aaa" | | 110 "xxx" | | 100 "bbb" | | 110 "yyy" | | 001 "ccc" | | | +-----------+ +-----------+ Example with a maximim hash size of 3 bits 2. Adding Directory Entries to a Hash Table After adding a new directory named "zzz" with the hash value 110 the hash table looks like this: hash table entries: +-------+-------+-------+-------+ | 00 | 01 | 10 | 11 | | addr. | addr. | addr. | addr. | +-------+-------+-------+-------+ | | | | |_ _____| |_ _____| | | v v +-----------+ +-----------+ leaves | 000 "aaa" | | 110 "xxx" | | 100 "bbb" | | 110 "yyy" | | 001 "ccc" | | 110 "zzz" | +-----------+ +-----------+ Algorithm (adding a new directory entry) Step 1 The target leaf for the new entry is located by looking up the file name's hash value in the hash table. Step 2 Try to insert the directory entry into the target leaf. If it fits, we are done. But if the target leaf overflows, proceed with step 3. Step 3 Check whether the target leaf is pointed to by multiple entries in the hash table and continue with step 4 if not. For such a leaf, the depth of the leaf (lf_depth member of ogfs_leaf_t) is smaller than the depth of the directory dinode (di_depth member of ogfs_dinode_t). A new leaf is allocated, the depth of both leaves is increased by one, and the corresponding directory entries are migrated from the target leaf to the new one. Set the target leaf to the new leaf and go back to step 2. Note: The moved entries leave gaps in the old target leaf as if they had been deleted, adding to leaf fragmentation. Since both leaves have to be written anyway, this would be a good time to strip the entries in the old leaf of their free space and add it all to the last entry. Example: above hash table after adding file "ddd" with hash value 100: hash table entries: +-------+-------+-------+-------+ | 00 | 01 | 10 | 11 | | addr. | addr. | addr. | addr. | +-------+-------+-------+-------+ | | | | |_ | |_______|_________ | |_________ | | | | v v v +-----------+ +-----------+ +-----------+ leaves | 000 "aaa" | | 001 "ccc" | | 110 "xxx" | | 100 "bbb" | | | | 110 "yyy" | | 100 "ddd" | | | | 110 "zzz" | +-----------+ +-----------+ +-----------+ Step 4 If the hash table has reached the allowed maximum size, proceed with step 5. If not, double the size of the hash table and go back to step 2. Example: adding file "eee" with hash value 000: hash table entries: +-------+-------+-------+-------+-------+-------+-------+-------+ | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | | addr. | addr. | addr. | addr. | addr. | addr. | addr. | addr. | +-------+-------+-------+-------+-------+-------+-------+-------+ | : | | ; : | | | : |_______|_______________:_______|_______| | : ; : | |_ :................................ | | _._._._._._._; : | | ; : | v v v v +-----------+ +-----------+ +-----------+ +-----------+ leaves | 000 "aaa" | | 100 "bbb" | | 001 "ccc" | | 110 "xxx" | | 000 "eee" | | 100 "ddd" | | | | 110 "yyy" | | | | | | | | 110 "zzz" | +-----------+ +-----------+ +-----------+ +-----------+ Step 5 If the target target leaf is the last leaf in its chain, proceed with step 6. If not, the target leaf points to another leaf in the chain, set the target leaf to the next leaf and go back to step 2. Step 6 Allocate a new leaf and chain it to the target leaf of the current chain of leaves. Set the target leaf to the new leaf and go back to step 2. The maximum hash table size has been reached. The leaves are chained to the overflowing leaf. Example: adding file "vvv" with hash value 110: hash table entries: +-------+-------+-------+-------+-------+-------+-------+-------+ | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | | addr. | addr. | addr. | addr. | addr. | addr. | addr. | addr. | +-------+-------+-------+-------+-------+-------+-------+-------+ | | : : ! | | : | | : : ! | | : | | :...............!.......|.......|.................. | |_______________________!_______| | : |_ ! | |_ : | _.__.__.__.__! | | : | ! | | : v v v v v +-----------+ +-----------+ +-----------+ +-----------+ +-----------+ | 000 "aaa" | | 100 "bbb" | | 001 "ccc" | | 110 "xxx" | | | | 000 "eee" | | 100 "ddd" | | | | 110 "yyy" | | | | | | | | | | 110 "zzz" | | | +-----------+ +-----------+ +-----------+ +-----------+ +-----------+ | v +-----------+ | 110 "vvv" | | | | | +-----------+ 3. Splitting and Merging Leaves Cost of splitting a leaf: * Read one block * Write two blocks * Read all entries in old leaf * Possibly copy entries to new leaf * Overall cost: O(bs) disk I/O, O(bs) CPU, O(hs) memory hs: hash table size bs: block size Caveat: Once two leaves have been split, they are never merged again. 4. Growing and Shrinking a Hash Table As the hash table is stored as regular file data, doubling the hash table is quite simple. The total length of the file is doubled, and its file system are written from last to first. Cost of doubling a hash table of h bytes: * (hs / 2bs) block reads (rounded up) * 2(hs /bs) block writes * hs memory byte reads * 2hs memory byte writes * Overall cost: O(hs) disk I/O, O(hs) CPU, O(bs) memory hs: hash table size bs: block size Note: The number of disk writes could be cut in half and the memory writes could be eliminated completely if the low bits of the CRC were used in the hash table instead of the high bits. Caveat: Hash tables can only grow, never shrink. Once a hash table has reached a certain size, the only way to free the occupied space is to delete the directory. Directory Leaves ---------------- Unless the directory is stuffed into the dinode, the directory entries are stored in the leaves. The meta data of a leaf contains information about the number of entries in this leaf, and other information. directory leaf block: +------+-----------------------------------------------------+ | meta |__|_____|_|______|________| ... |_____| free space | | data | directory entries | +------+-----------------------------------------------------+ Directory Entries ----------------- 1. Directory Entry Layout The contents of a directory entry are shown in the figure below. directory entry layout: +------------------------------------------------+-----------+ | dinum | hash | dirent | name | file | spare | "foobar" | | | value | length | length | type | space | file name | +------------------------------------------------+-----------+ 0 16 20 22 24 26 40 The size of a directory entry (ogfs_dirent_t) is the size of the file name plus 40 bytes for the header, rounded up to a multiple of 8 (OGFS_DIR_REC_LEN macro). The file name is not terminated with a zero byte. 2. Directory Entry Allocation and Deallocation When a directory entry is deallocated, the space it occupied (de_rec_len field in the ogfs_dirent_t) is added to the previous directory entry in the same leaf. The first directory entry of the leaf has no predecessor. Instead, the dinode number is set to zero. Note: Since the first entry in a directory is "." entry, it is unlikely that this ever happens. before removing the second entry: +-------------------------+-------------------------+-----------------------+ | header | "file 1" | pad | header | "file 2" | pad | free space | +-------------------------+-------------------------+-----------------------+ leaf after removing the entry: +---------------------------------------------------------------------------+ | header | "file 1" | pad | free space | +---------------------------------------------------------------------------+ leaf Deleting a directory entry When an entry is added to a specific chain of leaves, the chain is searched from the start for an allocated entry or one with the dinode number set to zero that has enough free space to store the new directory entry. If such an entry is found, the new entry is stored at the beginning of the unused space held by in the old entry. The remaining unsued space is stored in the new entry: old directory entry: +---------------------------------------------------------------------------+ | header | "oldfile" | pad | free space | +---------------------------------------------------------------------------+ leaf after adding the new entry: +-----------------------------------------------------+---------------------+ | header | "oldfile" | pad | header | "newfile" | pad | free space | +-----------------------------------------------------+---------------------+ leaf Adding a directory entry in an existing entry Caveat: This algorithm can cause fragmentation of the space in the leaf chains and almost double the space required to store the directory entries. Furthermore, fragmentation increases the average number of block reads required to locate a directory entry in a chain of leaves. Caveat: The space occupied by a chain of leaves can only grow, but not shrink. To free the occupied space, the directory has to be deleted. Caveat: If a new directory entry is too big to be stored in the free space of any other entry, but the last directory entry in the chain of leaves has some free space, the new entry does not use the remaining space but is appended in new space. old directory entry: +---------------------------------------+ ... | header | "oldfile" | pad | free space | +---------------------------------------+ ^ end of leaf chain __| +---------------------------------------+-------------------------+ ... | header | "oldfile" | pad | free space | header | "newfile | pad | +---------------------------------------+-------------------------+ ^ ^ possibly wasted space __| end of leaf chain __| Adding an entry at the end of a chain of leaves. If no leaf is found that can hold the new directory entry, a new leaf is allocated and all available space in the leaf is given to the new directory entry: adding a directory entry to a new leaf: +---------------------------------------------------------------------------+ | header | "newfile" | pad | free space | +---------------------------------------------------------------------------+ ^ ^ |__ start of new leaf end of leaf __| Special Layout of the First Resource Group ------------------------------------------ When the file system is created with mkfs.ogfs, the first resource group has a special layout. In addition to normal data, it contains special inodes for the journal index, the resource group index and the root directory. The journal and resource group indices are initially created in the first resource group but may be extended to other resource groups later with ogfs_expand. The first data blocks are reserved for the three special inodes and the two indices: first resource group: +----+------+------+------+------+----------------------------------------+ |meta|fs blk|jindex|rindex| root |__|__|__|__|__|__|__|__|__| ... |__|__| |data|bitmap|dinode|dinode|dinode| jindex | rindex | other data blocks | +----+------+------+------+------+----------------------------------------+ ^ ^ ^ ^ ^ ^ ^ ^ : : | | | : : |___ normal data : : | | | :........:............ index data : : |______|______|____________________________ special dinodes :....:................................................. normal meta data Meta data generations --------------------- All the OpenGFS meta data headers carry a so called generation number. The generation number starts with zero when a block is filled with meta data (including dinodes) for the first time. Every time a (logged) transaction is flushed to disk, the generation in each of the involved meta data blocks is increased by one. Since the generation is a 64 bit value, an overflow is prectically impossible. Generations are used only during the recovery process. The generation of the transactions in the log of a failed node is compared to the actual generation of the data on disk to determine which transactions have to be replayed. APPENDIX A. Structure Details -------------------- This section contains a complete list of the structures stored in the file system. Unless otherwise stated, all types are unsigned integers of the given length. Security risk: At least in some places, padding bytes and the reserved space in some of the structures are filled with random data from the stack. This might contain sensitive information. 1. Generic Headers (ogfs_meta_header_t and ogfs_inum_t) These structures are used in many of the other structures. Generic block header structure: ogfs_meta_header_t: 0 +-----------------+ | 0x01161970 | Magic number 4 +-----------------+ | meta type | Type of header (OGFS_METATYPE_XX) 8 +-----------------+ | generation | Generation number 16 +-----------------+ | format | Structure layout of header (OGFS_FORMAT_XX) 20 +-----------------+ | pad | Padding bytes 24 +-----------------+ ^ |___ byte offset Dinode address type (ogfs_inum_t): ogfs_inum_t: 0 +-------------------+ | no_formal_ino | Formal dinode number (0 or value of next field) 8 +-------------------+ | block address | Address of the dinode's block 16 +-------------------+ 2. Superblock (ogfs_sb_t) Note: The block holding the superblock is only partially filled by the superblock data. ogfs_sb_t: 0 ++-----------------++ || 0x01161970 || Magic number 4 |+-----------------+| || OGFS_METATYPE_SB|| Type of header 8 |+-----------------+| || generation || Generation number 16 |+-----------------+| || OGFS_FORMAT_SB || Format of header 20 |+-----------------+| || pad || Padding bytes 24 ++-----------------++ | file system fmt. | Superblock format* (OGFS_FORMAT_FS) 28 +-------------------+ | multihost format | Multihost format* (OGFS_FORMAT_MULTI) 32 +-------------------+ | flags | No flags yet 36 +-------------------+ | block size | Size of blocks in bytes (512 - 65536) 40 +-------------------+ | block size shift | log2(block size); (9 - 16) 44 +-------------------+ | segment size | Journal segment size in blocks 48 +-------------------+ | jindex dinode | Journal index dinode (ogfs_inum_t) 64 +-------------------+ | rindex dinode | Resource group index dinode (ogfs_inum_t) 80 +-------------------+ | root dir dinode | Root directory dinode (ogfs_inum_t) 96 +-------------------+ | locking protocol | Name of the locking protocol to use (char[]) 160 +-------------------+ | lock table name | Name of lock table for this FS (char[]) 224 +-------------------+ | reserved | Reserved for future use 352 +-------------------+ * The format fields are hard coded to a constant version number. When the superblock is read, the version numbers on disk are compared with these constants. This mechanism could be used to automatically upgrade the superblock format. Currently, the new version numbers are written. 3. Resource Group Index Block Header (ogfs_jdata_t) In case the resource group index is stored in multiple blocks, each of these blocks has this header: ogfs_jdata_t: 0 ++-----------------++ || 0x01161970 || Magic number 4 |+-----------------+| || OGFS_METATYPE_JD|| Type of header 8 |+-----------------+| || generation || Generation number 16 |+-----------------+| || OGFS_FORMAT_JD || Format of header 20 |+-----------------+| || pad || Padding bytes 24 ++-----------------++ 4. Resource Group Index Entry (ogfs_rindex_t) ogfs_rindex_t: 0 +-------------------+ | starting block | Starting block of the resource group 8 +-------------------+ | num. header blks. | Size of resource group header in blocks 12 +-------------------+ | pad | Padding bytes 16 +-------------------+ | first data loc. | First data block in resource group 24 +-------------------+ | num. data blks. | Number of data block in resource group 28 +-------------------+ | bitmap size | Size of block bitmap in bytes 32 +-------------------+ | reserved | Reserved for future use (char[]) 96 +-------------------+ 5. Resource Group Header (ogfs_rgrp_t) ogfs_rgrp_t: 0 ++-----------------++ || 0x01161970 || Magic number 4 |+-----------------+| || OGFS_METATYPE_RG|| Type of header 8 |+-----------------+| || generation || Generation number 16 |+-----------------+| || OGFS_FORMAT_RG || Format of header 20 |+-----------------+| || pad || Padding bytes 24 ++-----------------++ | flags | No flags yet 28 +-------------------+ | num. free | Number of free data blocks 32 +-------------------+ | num. dinodes | Number of dinodes 36 +-------------------+ | num. free dinodes | Number of unused dinodes 40 +-------------------+ | free dinode list | List of unused dinodes (ogfs_inum_t) 56 +-------------------+ | num. used meta | Number of used meta data blocks (except dinodes) 60 +-------------------+ | num. free meta | Number of unused meta data blocks 64 +-------------------+ | reserved | Reserved for future use (char[]) 128 +-------------------+ 6. Bitmap Block Header (ogfs_jdata_t) ogfs_rgrpbits_t: 0 ++-----------------++ || 0x01161970 || Magic number 4 |+-----------------+| || OGFS_METATYPE_RB|| Type of header 8 |+-----------------+| || generation || Generation number 16 |+-----------------+| || OGFS_FORMAT_RB || Format of header 20 |+-----------------+| || pad || Padding bytes 24 ++-----------------++ 7. Journal Index Block Header (ogfs_jdata_t) Same as a resource group index block header. 8. Journal Index Entry (ogfs_jindex_t) ogfs_jindex_t: 0 +-------------------+ | starting block | Starting block of the journal 8 +-------------------+ | num. segments | Number of segments in the journal 12 +-------------------+ | pad | Padding bytes 16 +-------------------+ | reserved | Reserved for future use (char[]) 80 +-------------------+ 9. Dinode (ogfs_dinode_t) ogfs_dinode_t: 0 ++-----------------++ || 0x01161970 || Magic number 4 |+-----------------+| || OGFS_METATYPE_DI|| Type of header 8 |+-----------------+| || generation || Generation number 16 |+-----------------+| || OGFS_FORMAT_DI || Format of header 20 |+-----------------+| || pad || Padding bytes 24 ++-----------------++ | dinode num. | Number of this dinode* 40 +-------------------+ | mode | UNIX mode and permission bits 44 +-------------------+ | owner uid | Owner's user id 48 +-------------------+ | owner gid | Owner's group id 52 +-------------------+ | link count | Number of dinodes of this file (hard link count) 56 +-------------------+ | file size | File size in bytes 64 +-------------------+ | num blocks | Number of blocks in file (1 = stuffed dinode) 72 +-------------------+ | atime | Last access time (int64) 80 +-------------------+ | mtime | Last modification time (int64) 88 +-------------------+ | ctime | Last change time (int64) 96 +-------------------+ | device major num. | Device major number 100 +-------------------+ | device minor num. | Device minor number 104 +-------------------+ | res. group blk. | First block of dinode's resource group 112 +-------------------+ | goal resource grp.| Resource group to allocate next block from 120 +-------------------+ | goal data blk. | block num. in resource group to allocate data 124 +-------------------+ | goal meta blk. | block num. in to allocate meta data 128 +-------------------+ | flags | Flags** 132 +-------------------+ | payload format | Content type of the dinode (see below***) 136 +-------------------+ | file type | Type of file (FILE_XXX macros in ogfs_ondisk.h) 138 +-------------------+ | tree height | Height of data block tree (0 = stuffed dinode) 140 +-------------------+ | incarnation | Incarnation number 144 +-------------------+ | pad | Padding bytes 146 +-------------------+ | directory depth | Number of used bits in hashtable (only in dirs) 148 +-------------------+ | num. direntries | Number of entries in directory (only in dirs) 152 +-------------------+ | next unused dinum | Next unused dinode (in unused dinodes; ogfs_inum_t) 168 +-------------------+ | reserved | Reserved for future use (char[]) 232 +-------------------+ * The formal dinode number is set to zero when the block is first converted to a dinode. It is set to the block address as soon as a file or directory is created in the dinode and never changes again. ** Flag bits OGFS_DIF_JDATA (0x01) 1: Set for the journal and resource index and all directory inodes 0: File dinode This flag is used only in the ogfsck program. OGFS_DIF_EXHASH (0x02) 1: Directory uses a hash table 0: Directory is stuffed into the dinode OGFS_DIF_UNUSED (0x04) 1: Dinode is unused. A list of unused dinodes is kept in the resource group header. 0: Dinode is in use *** Valid payload formats OGFS_FORMAT_JI (1000) Used in the journal index dinode. OGFS_FORMAT_RI (1100) Used in the resource group index dinode. OGFS_FORMAT_DE (1200) Used in stuffed directory inodes. none (0) Used for directory inodes with a hash table and file inodes. 10. Indirect Block Header (ogfs_indirect_t) ogfs_indirect_t: 0 ++-----------------++ || 0x01161970 || Magic number 4 |+-----------------+| || OGFS_METATYPE_IN|| Type of header 8 |+-----------------+| || generation || Generation number 16 |+-----------------+| || OGFS_FORMAT_IN || Format of header 20 |+-----------------+| || pad || Padding bytes 24 ++-----------------++ | reserved | Reserved for future use (char[]) 88 +-------------------+ 11. Directory Leaf Node Block Header (ogfs_leaf_t) ogfs_leaf_t: 0 ++-----------------++ || 0x01161970 || Magic number 4 |+-----------------+| || OGFS_METATYPE_LF|| Type of header 8 |+-----------------+| || generation || Generation number 16 |+-----------------+| || OGFS_FORMAT_LF || Format of header 20 |+-----------------+| || pad || Padding bytes 24 ++-----------------++ | depth | Number of hash table bits used to address the leaf 26 +-------------------+ | num. entries | Number of directory entries in this leaf node 28 +-------------------+ | entry format | Always OGFS_FORMAT_DE 32 +-------------------+ | next leaf | Next leaf block address if this one overflowed 40 +-------------------+ | reserved | Reserved for future use (char[]) 104 +-------------------+ 12. Directory Entry (ogfs_dirent_t) ogfs_dirent_t: 0 +-------------------+ | dinum | Dinode number* 16 +-------------------+ | file name hash | Hash of the file name 20 +-------------------+ | length | Length of the directory entry structure in bytes 22 +-------------------+ | name length | Length of the file name in bytes 24 +-------------------+ | file type | Type of the dinode this directory entry points to 26 +-------------------+ | reserved | Reserved for future use (char[]) 40 +-------------------+ | file name | File name +-------------------+ | pad | Padding bytes (to multiples of 8 bytes) +-------------------+ * The formal dinode number is set to zero when the last directory entry in a leaf is deleted. 13. Journal Segment Header (ogfs_log_header_t) ogfs_log_header_t: 0 ++-----------------++ || 0x01161970 || Magic number 4 |+-----------------+| || OGFS_METATYPE_LH|| Type of header 8 |+-----------------+| || generation || Generation number 16 |+-----------------+| || OGFS_FORMAT_LH || Format of header 20 |+-----------------+| || pad || Padding bytes 24 ++-----------------++ | flags | Flags* 28 +-------------------+ | pad | Padding bytes 32 +-------------------+ | trans first block | !!!??? Block num. of first hdr in this trans 40 +-------------------+ | trans seq. number | Sequence number of trans this segment belongs to 48 +-------------------+ | log tail block | !!!??? /* Block number of log tail */ 56 +-------------------+ | last nopd block | !!!??? /* block number of last nopen dump */ 64 +-------------------+ | reserved | Reserved for future use (char[]) 128 +-------------------+ * Flag bits OGFS_LOG_HEAD_UNMOUNT (0x01) 1: The journal was unmounted cleanly 0: The journal is not clean (mounted or crashed while mounted) 14. Journal Chunks (ogfs_log_descriptor_t) ogfs_log_descriptor_t: 0 ++-----------------++ || 0x01161970 || Magic number 4 |+-----------------+| || OGFS_METATYPE_LD|| Type of header 8 |+-----------------+| || generation || Generation number 16 |+-----------------+| || OGFS_FORMAT_LD || Format of header 20 |+-----------------+| || pad || Padding bytes 24 ++-----------------++ | ld_type | !!!??? Type of data in this log entry chunk* 28 +-------------------+ | ld_length | Number of buffers in this chunk 32 +-------------------+ | ld_data1 | Log entry chunk type specific data** 36 +-------------------+ | ld_data2 | Log entry chunk type specific data** 40 +-------------------+ | reserved | Reserved for future use (char[]) 104 +-------------------+ *!!! **!!!