Sequence Mining - String Mining

String Mining

String mining typically deals with a limited alphabet for items that appear in a sequence, but the sequence itself may be typically very long. Examples of an alphabet can be those in the ASCII character set used in natural language text, nucleotide bases 'A', 'G', 'C' and 'T' in DNA sequences, or amino acids for protein sequences. In biology applications analysis of the arrangement of the alphabet in strings can be used to examine gene and protein sequences to determine their properties. Knowing the sequence of letters of a DNA a protein is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and biological function. This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when insertions, deletions and mutations occur in a string.

A survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented in the paper String Mining in Bioinformatics, which include:

  • Repeat-related problems: that deal with operations on single sequences and can be based on exact string matching or approximate string matching methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences.
  • Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include BLAST for comparing a single sequence with multiple sequences in a database, and ClustalW for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See sequence alignment.

Read more about this topic:  Sequence Mining

Famous quotes containing the words string and/or mining:

    A culture may be conceived as a network of beliefs and purposes in which any string in the net pulls and is pulled by the others, thus perpetually changing the configuration of the whole. If the cultural element called morals takes on a new shape, we must ask what other strings have pulled it out of line. It cannot be one solitary string, nor even the strings nearby, for the network is three-dimensional at least.
    Jacques Barzun (b. 1907)

    It’s a mining town in lotus land.
    F. Scott Fitzgerald (1896–1940)