Suffix Tree - Implementation

Implementation

If each node and edge can be represented in space, the entire tree can be represented in space. The total length of all the strings on all of the edges in the tree is, but each edge can be stored as the position and length of a substring of S, giving a total space usage of computer words. The worst-case space usage of a suffix tree is seen with a fibonacci word, giving the full nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has a pointer to its first child, and to the next node in the child list it is a part of. Hash maps, sorted/unsorted arrays (with array doubling), and balanced search trees may also be used, giving different running time properties. We are interested in:

  • The cost of finding the child on a given character.
  • The cost of inserting a child.
  • The cost of enlisting all children of a node (divided by the number of children in the table below).

Let be the size of the alphabet. Then you have the following costs:

Lookup Insertion Traversal
Sibling lists / unsorted arrays
Hash maps
Balanced search tree
Sorted arrays
Hash maps + sibling lists

Note that the insertion cost is amortised, and that the costs for hashing are given perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of 8 (For array including LCP values built within 32-bit address space and 8-bit characters. This factor depends on this properties and may reach 2 with usage of 4-byte wide characters (needed to contain any symbol in some UNIX-like systems, see wchar_t) on 32-bit systems), and researchers have continued to find smaller indexing structures.

Read more about this topic:  Suffix Tree