About | Buy Stuff | Industry Watch | Learning Curve | Products | Search | Twitter
Home » Learning Curve

HFS+

Taking a walk in the orchard.


Every disk access on OS X goes to disk twice. At least twice.

That's right: accessing a file under OS X is slow. It has to be. OS X uses the HFS+ file system.

HFS+ stands for 'HFS Plus'. HFS stands for 'Hierarchical File System'.

It may sound like Apple first came up with the idea of a hierarchical file system, but that would be way off the mark.

The popularisation of the idea can most likely be attributed to Ken Thompson and Dennis Ritchie, creators of Unix.

Apple's HFS+ has very little to do with the Unix file system (UFS).

Users are given the option of installing OS X with UFS, but such a move is almost guaranteed to leave the computer in a state of ruin. The only file system an Apple computer running OS X can reasonably use is HFS+.

MFS

HFS+ builds on HFS, Apple's original 'hierarchical file system', the re-write of the best forgotten Macintosh Filing System (MFS), which Apple researched for two years prior to the release of the first Macintosh.

MFS was a true train wreck of an idea. An MFS disk contained no directories whatsoever. The user of MFS was led to believe there were in fact directories, as the Open and Save As dialogs intentionally gave this impression, but on disk everything was one big glorious mess.

The wonder is that it took the crew in Cupertino two years to realise they had a disaster of an idea on their hands.

HFS is what happened when MFS was finally scrapped.

Interlude (UFS)

The Unix file system is deceptively simple. Directories contain almost no information whatsoever. All they contain are the file names and the so-called inode numbers.

Older Unix systems allowed file names of up to 14 characters, preceded by a 16-bit inode number. Directory entries were of a static length of 16 bytes. Directories could be read with the Unix program od (octal dump).

Later Unix systems admit of file names of up to 255 characters and use a 32-bit inode system.

The inode list on any Unix device contains records for all the files on the device. The size of an inode block is static. Current sizes are approximately 128 bytes.

All the file system driver has to do when accessing a file is note the inode number, multiply that by the size of the inode block (eg 128), and it's got the offset of the file's inode block in the inode list, and thereby its physical location.

It is this physical location which the driver uses to access the complete file information - in a single disk access operation.

Deceptively simple.

HFS

Back in the days when the MS-DOS file system was waiting for a 32-bit upgrade, Apple's HFS was mucking about with the same capacities. Any disk on either a PC or a Mac would admit of approximately 65,536 files.

But where directory information on MS-DOS was linear, the engineers at Apple decided to introduce a different approach: the made-to-order binary tree.

A binary tree is similar to a binary search. A binary search performs simple comparison operations on a target that can be assumed to be sorted in a known manner. Each time a comparison is made, half of the target is eliminated. Because the search halves the target for any one comparison, a match - or the lack of a match - can be arrived at a lot faster than if one compared every single item.

The key here is that the target must be sorted in a manner agreeable to the search method used. This is the binary tree of HFS.

An assumption, when using this method, instead of the linear approach of both MS-DOS and Unix, is that disk directories can be overstuffed with several thousand files.

As the disk access is the bottleneck, any algorithm that reduces the number of disk accesses for any one user request is going to be faster. The time needed for CPU operations is insignificant in this context.

On high-performance Unix systems, the inode idea still works, because the driver can immediately know exactly where all the file information is located: it's the beginning of the inode list plus an offset equal to the inode number multiplied by the size of the inode block.

Would that things were so simple in Cupertino...

The Binary Apple Orchard

In order to work, the HFS binary tree has to be pre-sorted in a benign way. The top root node of the tree must always point to the middle of the tree, seen from the sorting algorithm used.

If a new file is introduced, so that the current root of the tree is no longer in the middle of the sort, the entire tree must be reorganised.

Every binary tree node has children which reach all the way down to the actual file data. The middle value becomes the root of the tree. Whenever a new file is added to a directory, the entire directory tree must be rewritten, to ensure that the root is in the middle.

This must be done every time a directory is 'edited': every time a file name changes, every time a file is added, every time a file is deleted.

When files are moved from one directory to another, both binary trees for both affected directories must be rewritten.

To alleviate a little of the pain, Apple engineers invented a new sort of binary tree which they call the B*-tree. This is a curious animal to say the least: although the name 'binary' implies 'two', the Apple B*-tree has not two, but three nodes.

Even more incomprehensible, the new third node is never used [sic]. This new node has a value, but the tree is constructed in such as way that no node ever uses it. All searches through the tree must have keys which always sort higher [sic].

Only the nodes at the very bottom of the tree, the so-called leaf nodes, have any file information. All the rest of the nodes merely point onward (downward) ultimately to these leaf nodes.

Lost in the Orchard

A directory entry contains the file name, the file type, the creator type (proprietary to the Mac), the modification date and time, and a pointer to the entry's actual contents. This information is only found at the bottom of the tree - in the 'leaf' nodes.

All the other nodes have only file names and pointers to other nodes.

Because nodes can never sort below their new third pointer (never point 'left'), every index level must repeat the lowest numbered value that pointed to it, and the root node has to start with the lowest numbered value in the entire tree.

For if the directory is not kept in this condition, sooner or later something will point 'to the left', and by definition, Apple B*-trees must never point left...

In each node, starting with root, you find the record with the greatest value less than or equal to the value you're after. You then drop down to where that record points and repeat the comparison process until you reach a leaf node at the bottom. At that point you might find the value you want or not. If you don't find it, you know (in theory) that the file you are looking for does not exist.

When the nodes fill up, the file system has to create a new node and then re-sort the whole tree 'from left to right' in each node and in every level of the tree.

To HFS, disk space is parceled out in 'allocation blocks' (clusters), which are always a multiple of the disk sector or disk block. An HFS drive is limited, by virtue of the storage used for denoting a block index, to 64K allocation blocks. HFS+, which uses 32 bits instead of 16, can handle 4MB of allocation blocks.

Mutation in the Orchard

HFS+ was the expansion of HFS to take into account the use of larger storage devices and larger files on those storage devices. Put simply, the indexes and file size fields are increased to 32 bits each, roughly as MS-DOS and Unix eventually did.

But because the HFS B*-tree system would otherwise be too lugubrious on these bigger disks, the catalog file  was invented.

All the nodes for an HFS+ directory are stored in a catalog file, the location of which on the drive is fixed. Drivers can find this location by consulting the disk configuration found in block 2.

This block contains the creation and modification dates and times for the volume, the count of files and folders on the volume, the disk size, the number of free blocks, where the driver is to put the next file created, and where it can find the catalog file.

HFSPlusVolumeHeader

The official HFS+ volume header definition follows. UInt16 is a 16-bit unsigned integer, UInt32 a 32-bit unsigned integer, and so forth.

struct HFSPlusVolumeHeader {
    UInt16              signature;
    UInt16              version;
    UInt32              attributes;
    UInt32              lastMountedVersion;
    UInt32              reserved;

    UInt32              createDate;
    UInt32              modifyDate;
    UInt32              backupDate;
    UInt32              checkedDate;

    UInt32              fileCount;
    UInt32              folderCount;

    UInt32              blockSize;
    UInt32              totalBlocks;
    UInt32              freeBlocks;

    UInt32              nextAllocation;
    UInt32              rsrcClumpSize;
    UInt32              dataClumpSize;
    HFSCatalogNodeID    nextCatalogID;

    UInt32              writeCount;
    UInt64              encodingsBitmap;

    UInt8               finderInfo[32];

    HFSPlusForkData     allocationFile;
    HFSPlusForkData     extentsFile;
    HFSPlusForkData     catalogFile;
    HFSPlusForkData     attributesFile;
    HFSPlusForkData     startupFile;
};
typedef struct HFSPlusVolumeHeader HFSPlusVolumeHeader;

[Note that the file system has to scoot back to block 2 all the time to keep modifyDate, fileCount, folderCount, freeBlocks, nextAllocation, and nextCatalogID current. That's a lot of wear and tear on a disk used with an operating system (Unix) that invented 'lazy write' for a very good reason. If you find your OS X hard drives crashing on you, this might be why.]

catalogFile Layout

The catalog file (catalogFile) is grouped into nodes just like the directory tree. The size of a node is set when a volume is initialised. It must be a power of two between 9 and 32 - yielding a block size of 512 to 32,768 bytes.

Each node begins with a 14-byte descriptor, which contains information about what kind of node (index, leaf), how many records it has, where it is in the B*-tree, where the next and previous nodes are, and so on.

All the above information is used whenever the file system has to 'balance' the tree again.

Following the descriptor are the records - as many as fit. The two bytes at the end of the node point back to the location of the first record in the node, which is always at offset 14. The two bytes which immediately precede these final two bytes point to the second record in the node.

The effect is that records build from the front of the node (at offset 14) towards the middle of the node, while pointers to these records at the end of the node work backwards towards that same middle.

Naturally, if a file or directory is deleted, its record must be deleted as well, and the records leading from offset 14 must be written anew, as must the pointers at the end.

Put all these nodes together, and you have the 'catalog file'. The catalog file contains a header node that describes the entire B*-tree structure, including the size of each node and which nodes are in use.

The header node also knows the depth of the tree, and the number of the root where searches must start.

Empty catalogFiles

HFS+ drives are initialised with an empty catalog file with nodes ready to be used. They're pre-allocated to prevent fragmenting, and when this initial structure is all used up, the system extends the catalog file to make room for more nodes.

The number of each node is fixed, as is its position in the catalog file. The node can, however, shift position in the tree, depending on the values in other nodes.

OS X tries to cache as much of this catalog file in RAM as possible, but the catalog file can take upwards of 20MB. So a deep tree search can end up jumping around through the extents of the catalog file on disk.

The good news, according to Apple engineers, is that their system will guarantee that with a file system with 500,000 files, you'll still find what you're looking for in at most nineteen node hops.

[The good news with the Unix file system is that it's never even as much as two.]

Apple engineers also assure us that most HFS+ disks never have more than three or four catalog file extents. Still, they admit to understanding that the upkeep of the catalog file means a lot of wear and tear on the disk head which is jumping back and forth all the time...

CNIDs

The HFS+ node keys are so-called CNIDs  and file names concatenated together. The catalog node ID (CNID) is a unique 32-bit number between 10h and ffffffffh. Each CNID is unique per disk. If a file is deleted, its CNID may never be reused.

CNIDs are usually assigned sequentially. If you reach CNID ffffffffh, no matter that the disk be empty, it can no longer be used.

If this should ever happen (with a file server, whatever), you have to back up and restore the drive, whereupon each file gets a new CNID.

The CNID is why Carbon and older legacy Mac software cannot access files with a path. These applications need to specify the CNID for the containing folder. Getting the CNID for the containing folder is of course another matter...

The key to accessing a file is the parent folder's CNID concatenated together with the file name.

If a program is looking for a folder named 'Kerberos' in a parent folder with CNID 65,534, the search key would be the 32-bit value 0000fffeh followed immediately by the Unicode form of 'Kerberos'.

00|00|00|00|0f|0f|0f|0e|00|4b|00|65|00|72|00|62|00|65|00|72|00|6f|00|73|00|00

Implications, Conclusions

Thus the search for a file must always start with searching for the CNID of the parent folder - and if it has a parent, finding that folder's CNID, and so forth, until one arrives at the Unix 'root' directory ('/').

Thus all files on disk end up being sorted in the above order in the B*-tree. The effect is a sort by sub-folder, so listing a folder's contents is a matter of walking the forward and backward links between index nodes at the right level.

The implication of this 'caveat' is that a disk access in Mac OS X requires at least two searches through the catalog file tree: the first to determine the parent folder's CNID (this can be recursive), and the second (or final) to find the file once the CNID of the parent folder can be concatenated with the file name.


Word has it the NextStep people in Cupertino hate HFS+. How very strange.

NextStep of course used UFS - not HFS+.

Had Apple used UFS instead of HFS+, OS X might have been released several years sooner, Mac hardware would today last a lot longer, and Mac file systems would crash a lot less often.

About | ACP | Buy | Forum | Industry Watch | Learning Curve | Search | Twitter | Xnews
Copyright © Rixstep. All rights reserved.