Suffix Tree - External Construction

External Construction

Suffix trees quickly outgrow the main memory on standard machines for sequence collections in the order of gigabytes. As such, their construction calls for external memory approaches.

There are theoretical results for constructing suffix trees in external memory. The algorithm by Farach-Colton, Ferragina & Muthukrishnan (2000) is theoretically optimal, with an I/O complexity equal to that of sorting. However the overall intricacy of this algorithm has prevented, so far, its practical implementation.

On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few) GB/hours. The state of the art methods are TDD, TRELLIS, DiGeST, and B2ST.

TDD and TRELLIS scale up to the entire human genome – approximately 3GB – resulting in a disk-based suffix tree of a size in the tens of gigabytes,. However, these methods cannot handle efficiently collections of sequences exceeding 3GB. DiGeST performs significantly better and is able to handle collections of sequences in the order of 6GB in about 6 hours. . All these methods can efficiently build suffix trees for the case when the tree does not fit in main memory, but the input does. The most recent method, B2ST, scales to handle inputs that do not fit in main memory. ERA is a recent parallel suffix tree construction method that is significantly faster. ERA can index the entire human genome in 19 minutes on an 8-core desktop computer with 16GB RAM. On a simple Linux cluster with 16 nodes (4GB RAM per node), ERA can index the entire human genome in less than 9 minutes.

Read more about this topic:  Suffix Tree

Famous quotes containing the words external and/or construction:

    Given that external reality is a fiction, the writer’s role is almost superfluous. He does not need to invent the fiction because it is already there.
    —J.G. (James Graham)

    Striving toward a goal puts a more pleasing construction on our advance toward death.
    Mason Cooley (b. 1927)