FM-index - FM-index Data Structure

FM-index Data Structure

An FM-index is created by first taking the Burrows-Wheeler transform (BWT) of the input text. For example, the BWT of the string T = "abracadabra" is "ard$rcaaaabb", and here it is represented by the matrix M where each row is a rotation of the text that has been sorted. The transform corresponds to the last column labeled L.

F L
$ abracadabr a
a $abracadab r
a bra$abraca d
a bracadabra $
a cadabra$ab r
a dabra$abra c
b ra$abracad a
b racadabra$ a
c adabra$abr a
d abra$abrac a
r a$abracada b
r acadabra$a b

The BWT in itself allows for some compression with, for instance, move to front and Huffman encoding, but the transform has even more uses. The rows in the matrix are essentially the sorted suffixes of the text and the first column F of the matrix shares similarities with suffix arrays. How the suffix array relates to the BWT lies at the heart of the FM-index.

C of "ard$rcaaaabb"
c $ a b c d r
C 0 1 6 8 9 10

It is possible to make a last-to-first column mapping LF(i) from a character L to F, with the help of a table C and a function Occ(c, k). C is a table that, for each character c in the alphabet, contains the number of occurrences of lexically smaller characters in the text. The function Occ(c, k) is the number of occurrences of character c in the prefix L. Ferragina and Manzini showed that it is possible to compute Occ(c, k) in constant time.

Occ(c, k) of "ard$rcaaaabb"
a r d $ r c a a a a b b
1 2 3 4 5 6 7 8 9 10 11 12
$ 0 0 0 1 1 1 1 1 1 1 1 1
a 1 1 1 1 1 1 2 3 4 5 5 5
b 0 0 0 0 0 0 0 0 0 0 1 2
c 0 0 0 0 0 1 1 1 1 1 1 1
d 0 0 1 1 1 1 1 1 1 1 1 1
r 0 1 1 1 2 2 2 2 2 2 2 2

The last-to-first mapping can now be defined as LF(i) = C] + Occ(L, i). For instance, on row 9, L is a and the same a can be found on row 5 in the first column F, so LF(9) should be 5 and LF(9) = C + Occ(a, 9) = 5. For any row i of the matrix, the character in the last column L precedes the character in the first column F also in T. Finally, if L = T, then L = T, and using the equality it is possible to extract a string of T from L.

The FM-index itself is a compression of the string L together with C and Occ in some form, as well as information that maps a selection of indices in L to positions in the original string T.

Read more about this topic:  FM-index

Famous quotes containing the words data and/or structure:

    To write it, it took three months; to conceive it three minutes; to collect the data in it—all my life.
    F. Scott Fitzgerald (1896–1940)

    Women over fifty already form one of the largest groups in the population structure of the western world. As long as they like themselves, they will not be an oppressed minority. In order to like themselves they must reject trivialization by others of who and what they are. A grown woman should not have to masquerade as a girl in order to remain in the land of the living.
    Germaine Greer (b. 1939)