FM-index - FM-index Data Structure

FM-index Data Structure

An FM-index is created by first taking the Burrows-Wheeler transform (BWT) of the input text. For example, the BWT of the string T = "abracadabra" is "ard$rcaaaabb", and here it is represented by the matrix M where each row is a rotation of the text that has been sorted. The transform corresponds to the last column labeled L.

F L
$ abracadabr a
a $abracadab r
a bra$abraca d
a bracadabra $
a cadabra$ab r
a dabra$abra c
b ra$abracad a
b racadabra$ a
c adabra$abr a
d abra$abrac a
r a$abracada b
r acadabra$a b

The BWT in itself allows for some compression with, for instance, move to front and Huffman encoding, but the transform has even more uses. The rows in the matrix are essentially the sorted suffixes of the text and the first column F of the matrix shares similarities with suffix arrays. How the suffix array relates to the BWT lies at the heart of the FM-index.

C of "ard$rcaaaabb"
c $ a b c d r
C 0 1 6 8 9 10

It is possible to make a last-to-first column mapping LF(i) from a character L to F, with the help of a table C and a function Occ(c, k). C is a table that, for each character c in the alphabet, contains the number of occurrences of lexically smaller characters in the text. The function Occ(c, k) is the number of occurrences of character c in the prefix L. Ferragina and Manzini showed that it is possible to compute Occ(c, k) in constant time.

Occ(c, k) of "ard$rcaaaabb"
a r d $ r c a a a a b b
1 2 3 4 5 6 7 8 9 10 11 12
$ 0 0 0 1 1 1 1 1 1 1 1 1
a 1 1 1 1 1 1 2 3 4 5 5 5
b 0 0 0 0 0 0 0 0 0 0 1 2
c 0 0 0 0 0 1 1 1 1 1 1 1
d 0 0 1 1 1 1 1 1 1 1 1 1
r 0 1 1 1 2 2 2 2 2 2 2 2

The last-to-first mapping can now be defined as LF(i) = C] + Occ(L, i). For instance, on row 9, L is a and the same a can be found on row 5 in the first column F, so LF(9) should be 5 and LF(9) = C + Occ(a, 9) = 5. For any row i of the matrix, the character in the last column L precedes the character in the first column F also in T. Finally, if L = T, then L = T, and using the equality it is possible to extract a string of T from L.

The FM-index itself is a compression of the string L together with C and Occ in some form, as well as information that maps a selection of indices in L to positions in the original string T.

Read more about this topic:  FM-index

Famous quotes containing the words data and/or structure:

    This city is neither a jungle nor the moon.... In long shot: a cosmic smudge, a conglomerate of bleeding energies. Close up, it is a fairly legible printed circuit, a transistorized labyrinth of beastly tracks, a data bank for asthmatic voice-prints.
    Susan Sontag (b. 1933)

    Man is more disposed to domination than freedom; and a structure of dominion not only gladdens the eye of the master who rears and protects it, but even its servants are uplifted by the thought that they are members of a whole, which rises high above the life and strength of single generations.
    Karl Wilhelm Von Humboldt (1767–1835)