Btrfs - Design

Design

Ohad Rodeh's original proposal at USENIX 2007 noted that B+ trees, which are widely used as on-disk data structures for databases, could not efficiently support copy-on-write-based snapshots because its leaf nodes were linked together: if a leaf was copy-on-written, its siblings and parents would have to be as well, as would their siblings and parents and so on until the entire tree was copied. He suggested instead a modified B-tree (which has no leaf linkage), with a refcount in each tree node and certain relaxations to the tree's balancing algorithms to make them copy-on-write friendly. The result would be a data structure suitable for a high-performance object store that could perform copy-on-write snapshots, while maintaining good concurrency.

At Oracle later that year, Chris Mason began work on a snapshot-capable file system that would use this data structure almost exclusively—not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy-on-write, checksumming and mirroring needed to be implemented only once to benefit the entire file system.

Btrfs is structured as several layers of such trees, all using the same B-tree implementation. The trees store generic items sorted on a 136-bit key. The first 64 bits of the key are a unique object id. The middle 8 bits are an item type field; its use is hardwired into code as an item filter in tree lookups. Objects can have multiple items of multiple types. The remaining right-hand 64 bits are used in type-specific ways. Therefore items for the same object end up adjacent to each other in the tree, ordered by type. By choosing certain right-hand key values, objects can further put items of the same type in a particular order.

Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.

Read more about this topic:  Btrfs

Famous quotes containing the word design:

    Nowadays the host does not admit you to his hearth, but has got the mason to build one for yourself somewhere in his alley, and hospitality is the art of keeping you at the greatest distance. There is as much secrecy about the cooking as if he had a design to poison you.
    Henry David Thoreau (1817–1862)

    Westerners inherit
    A design for living
    Deeper into matter—
    Not without due patter
    Of a great misgiving.
    Robert Frost (1874–1963)

    Delay always breeds danger; and to protract a great design is often to ruin it.
    Miguel De Cervantes (1547–1616)