GOR Method - Algorithm

Algorithm

The mathematics and algorithm of the GOR method were based on an earlier series of studies by Robson and colleagues reported mainly in the Journal of Molecular Biology (e.g.) and The Biochemical Journal (e.g.). The latter describes the information theoretic expansions in terms of conditional information measures. The use of the word "simple" in the title of the GOR paper reflected the fact that the above earlier methods provided proofs and techniques somewhat daunting by being rather unfamiliar in protein science in the early 1970s; even Bayes methods were then unfamiliar and controversial. An important feature of these early studies, which survived in the GOR method, was the treatment of the sparse protein sequence data of the early 1970s by expected information measures. That is, expectations on a Bayesian basis considering the distribution of plausible information measure values given the actual frequencies (numbers of observations). The expectation measures resulting from integration over this and similar distributions may now be seen as composed of "incomplete" or extended zeta functions, e.g. z(s,observed frequency) − z(s,expected frequency) with incomplete zeta function z(s, n) = 1 + (1/2)s + (1/3)s+ (1/4)s + …. +(1/n)s. The GOR method used s=1. Also, in the GOR method and the earlier methods, the measure for the contrary state to e.g. helix H, i.e. ~H, was subtracted from that for H, and similarly for beta sheet, turns, and coil or loop. Thus the method can be seen as employing a zeta function estimate of log predictive odds. An adjustable decision constant could also be applied, which thus also implies a decision theory approach; the GOR method allowed the option to use decision constants to optimize predictions for different classes of protein. The expected information measure used as a basis for the information expansion was less important by the time of publication of the GOR method because protein sequence data became more plentiful, at least for the terms considered at that time. Then, for s=1, the expression z(s,observed frequency) − z(s,expected frequency) approaches the natural logarithm of (observed frequency / expected frequency) as frequencies increase. However, this measure (including use of other values of s) remains important in later more general applications with high dimensional data, where data for more complex terms in the information expansion are inevitably sparse (e.g.).