Algorithms For Calculating Variance - Online Algorithm

Online Algorithm

It is often useful to be able to compute the variance in a single pass, inspecting each value only once; for example, when the data are being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation. For such an online algorithm, a recurrence relation is required between quantities from which the required statistics can be calculated in a numerically stable fashion.

The following formulas can be used to update the mean and (estimated) variance of the sequence, for an additional element . Here, xn denotes the sample mean of the first n samples (x1, ..., xn), s2n their sample variance, and σ2N their population variance.

It turns out that a more suitable quantity for updating is the sum of squares of differences from the (current) mean, here denoted :

A numerically stable algorithm is given below. It also computes the mean. This algorithm is due to Knuth, who cites Welford, and it has been thoroughly analyzed. It is also common to denote and .

def online_variance(data): n = 0 mean = 0 M2 = 0 for x in data: n = n + 1 delta = x - mean mean = mean + delta/n M2 = M2 + delta*(x - mean) variance = M2/(n - 1) return variance

This algorithm is much less prone to loss of precision due to massive cancellation, but might not be as efficient because of the division operation inside the loop. For a particularly robust two-pass algorithm for computing the variance, first compute and subtract an estimate of the mean, and then use this algorithm on the residuals.

The parallel algorithm below illustrates how to merge multiple sets of statistics calculated online.

Read more about this topic:  Algorithms For Calculating Variance