Btrfs - Design

Design

Ohad Rodeh's original proposal at USENIX 2007 noted that B+ trees, which are widely used as on-disk data structures for databases, could not efficiently support copy-on-write-based snapshots because its leaf nodes were linked together: if a leaf was copy-on-written, its siblings and parents would have to be as well, as would their siblings and parents and so on until the entire tree was copied. He suggested instead a modified B-tree (which has no leaf linkage), with a refcount in each tree node and certain relaxations to the tree's balancing algorithms to make them copy-on-write friendly. The result would be a data structure suitable for a high-performance object store that could perform copy-on-write snapshots, while maintaining good concurrency.

At Oracle later that year, Chris Mason began work on a snapshot-capable file system that would use this data structure almost exclusively—not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy-on-write, checksumming and mirroring needed to be implemented only once to benefit the entire file system.

Btrfs is structured as several layers of such trees, all using the same B-tree implementation. The trees store generic items sorted on a 136-bit key. The first 64 bits of the key are a unique object id. The middle 8 bits are an item type field; its use is hardwired into code as an item filter in tree lookups. Objects can have multiple items of multiple types. The remaining right-hand 64 bits are used in type-specific ways. Therefore items for the same object end up adjacent to each other in the tree, ordered by type. By choosing certain right-hand key values, objects can further put items of the same type in a particular order.

Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.

Read more about this topic:  Btrfs

Famous quotes containing the word design:

    If I knew for a certainty that a man was coming to my house with the conscious design of doing me good, I should run for my life ... for fear that I should get some of his good done to me,—some of its virus mingled with my blood.
    Henry David Thoreau (1817–1862)

    Delay always breeds danger; and to protract a great design is often to ruin it.
    Miguel De Cervantes (1547–1616)

    I always consider the settlement of America with reverence and wonder, as the opening of a grand scene and design in providence, for the illumination of the ignorant and the emancipation of the slavish part of mankind all over the earth.
    John Adams (1735–1826)