Suffix Array - Applications

Applications

The suffix array of a string can be used as an index to quickly locate every occurrence of a substring pattern within the string . Finding every occurrence of the pattern is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array and can be found efficiently with two binary searches. The first search locates the starting position of the interval, and the second one determines the end position:

def search(P): l = 1; r = n + 1 while l < r: mid = (l+r) / 2 if P > suffixAt(A): l = mid + 1 else: r = mid s = l; r = n + 1 while l < r: mid = (l+r) / 2 if P == suffixAt(A): l = mid else: r = mid - 1 return (s, r)

Finding the substring pattern of length in the string of length takes time, given that a single suffix comparison needs to compare characters. Manber & Myers (1990) describe how this bound can be improved to time using LCP information. The idea is that a pattern comparison does not need to re-compare certain characters, when it is already known that these are part of the longest common prefix of the pattern and the current search interval. Abouelhoda, Kurtz & Ohlebusch (2004) improve the bound even further and achieve a search time of as known from suffix trees.

Suffix sorting algorithms can be used to compute the Burrows–Wheeler transform (BWT). The BWT requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotated BWT matrix corresponds to the order of suffixes in a suffix array. The BWT can therefore be computed in linear time by first constructing a suffix array of the text and then deducing the BWT string: .

Suffix arrays can also be used to look up substrings in Example-Based Machine Translation, demanding much less storage than a full phrase table as used in Statistical machine translation.

Many additional applications of the suffix array require the LCP array. Some of these are detailed in the application section of the latter.

Read more about this topic:  Suffix Array