IBM General Parallel File System - Architecture


GPFS provides high performance by allowing data to be accessed over multiple computers at once. Most existing file systems are designed for a single server environment, and adding more file servers does not improve performance. GPFS provides higher input/output performance by "striping" blocks of data from individual files over multiple disks, and reading and writing these blocks in parallel. Other features provided by GPFS include high availability, support for heterogeneous clusters, disaster recovery, security, DMAPI, HSM and ILM.

According to (Schmuck and Haskin), a file that is written to the filesystem is broken up into blocks of a configured size, less than 1 megabyte each. These blocks are distributed across multiple filesystem nodes, so that a single file is fully distributed across the disk array. This results in high reading and writing speeds for a single file, as the combined bandwidth of the many physical drives is high. This makes the filesystem vulnerable to disk failures -any one disk failing would be enough to lose data. To prevent data loss, the filesystem nodes have RAID controllers — multiple copies of each block are written to the physical disks on the individual nodes. It is also possible to opt out of RAID-replicated blocks, and instead store two copies of each block on different filesystem nodes.

Other features of the filesystem include

  • Distributed metadata, including the directory tree. There is no single "directory controller" or "index server" in charge of the filesystem.
  • Efficient indexing of directory entries for very large directories. Many filesystems are limited to a small number of files in a single directory (often, 65536 or a similar small binary number). GPFS does not have such limits.
  • Distributed locking. This allows for full Posix filesystem semantics, including locking for exclusive file access.
  • Partition Aware. The failure of the network may partition the filesystem into two or more groups of nodes that can only see the nodes in their group. This can be detected through a heartbeat protocol, and when a partition occurs, the filesystem remains live for the largest partition formed. This offers a graceful degradation of the filesystem — some machines will remain working.
  • Filesystem maintenance can be performed online. Most of the filesystem maintenance chores (adding new disks, rebalancing data across disks) can be performed while the filesystem is live. This ensures the filesystem is available more often, so keeps the supercomputer cluster itself available for longer.

It is interesting to compare this with Hadoop's HDFS filesystem, which is designed to store similar or greater quantities of data on commodity hardware — that is, datacenters without RAID disks and a Storage Area Network (SAN).

  1. HDFS also breaks files up into blocks, and stores them on different filesystem nodes.
  2. HDFS does not expect reliable disks, so instead stores copies of the blocks on different nodes. The failure of a node containing a single copy of a block is a minor issue, dealt with by re-replicating another copy of the set of valid blocks, to bring the replication count back up to the desired number. In contrast, while GPFS supports recovery from a lost node, it is a more serious event, one that may include a higher risk of data being (temporarily) lost.
  3. GPFS makes the location of the data transparent — applications are not expected to know or care where the data lies. In contrast, Google GFS and Hadoop HDFS both expose that location, so that MapReduce programs can be run near the data.
  4. GPFS supports full Posix filesystem semantics. HDFS and GFS do not support full Posix compliance.
  5. GPFS distributes its directory indices and other metadata across the filesystem. Hadoop, in contrast, keeps this on the Primary and Secondary Namenodes, large servers which must store all index information in-RAM.
  6. GPFS breaks files up into small blocks. Hadoop HDFS likes blocks of 64 MB or more, as this reduces the storage requirements of the Namenode. Small blocks or many small files fill up a filesystem's indices fast, so limit the filesystem's size.

Read more about this topic:  IBM General Parallel File System

Famous quotes containing the word architecture:

    They can do without architecture who have no olives nor wines in the cellar.
    Henry David Thoreau (1817–1862)