Algorithms For Calculating Variance - Higher-order Statistics

Higher-order Statistics

Terriberry extends Chan's formulae to calculating the third and fourth central moments, needed for example when estimating skewness and kurtosis:

$\begin{align} M_{4,X} = M_{4,A} + M_{4,B} & + \delta^4\frac{n_A n_B \left(n_A^2 - n_A n_B + n_B^2\right)}{n_X^3} \\ & + 6\delta^2\frac{n_A^2 M_{2,B} + n_B^2 M_{2,A}}{n_X^2} + 4\delta\frac{n_AM_{3,B} - n_BM_{3,A}}{n_X} \\ \end{align}$

Here the are again the sums of powers of differences from the mean, giving

skewness:

kurtosis:

For the incremental case (i.e., ), this simplifies to:

By preserving the value, only one division operation is needed and the higher-order statistics can thus be calculated for little incremental cost.

An example of the online algorithm for kurtosis implemented as described is:

def online_kurtosis(data): n = 0 mean = 0 M2 = 0 M3 = 0 M4 = 0 for x in data: n1 = n n = n + 1 delta = x - mean delta_n = delta / n delta_n2 = delta_n * delta_n term1 = delta * delta_n * n1 mean = mean + delta_n M4 = M4 + term1 * delta_n2 * (n*n - 3*n + 3) + 6 * delta_n2 * M2 - 4 * delta_n * M3 M3 = M3 + term1 * delta_n * (n - 2) - 3 * delta_n * M2 M2 = M2 + term1 kurtosis = (n*M4) / (M2*M2) - 3 return kurtosis

Pébay further extends these results to arbitrary-order central moments, for the incremental and the pairwise cases. One can also find there similar formulas for covariance.

Choi and Sweetman offer two alternate methods to compute the skewness and kurtosis, each of which can save substantial computer memory requirements and CPU time in certain applications. The first approach is to compute the statistical moments by separating the data into bins and then computing the moments from the geometry of the resulting histogram, which effectively becomes a one-pass algorithm for higher moments. One benefit is that the statistical moment calculations can be carried out to arbitrary accuracy such that the computations can be tuned to the precision of, e.g., the data storage format or the original measurement hardware. A relative histogram of a random variable can be constructed in the conventional way: the range of potential values is divided into bins and the number of occurrences within each bin are counted and plotted such that the area of each rectangle equals the portion of the sample values within that bin:

where and represent the frequency and the relative frequency at bin and $A= \sum_{k=1}^{K} h(x_k) \,\Delta x_k$ is the total area of the histogram. After this normalization, the raw moments and central moments of can be calculated from the relative histogram:

$m_n^{(h)} = \sum_{k=1}^{K} x_k^n \, H(x_k) \Delta x_k = \frac{1}{A} \sum_{k=1}^{K} x_k^n \, h(x_k) \Delta x_k$

$\theta_n^{(h)}= \sum_{k=1}^{K} \Big(x_k-m_1^{(h)}\Big)^n \, H(x_k)\Delta x_k = \frac{1}{A} \sum_{k=1}^{K} \Big(x_k-m_1^{(h)}\Big)^n \, h(x_k) \Delta x_k$

where the superscript indicates the moments are calculated from the histogram. For constant bin width $\Delta x_k=\Delta x$ these two expressions can be simplified using :

$m_n^{(h)}= \frac{1}{I} {\sum_{k=1}^{K} x_k^n \, h(x_k)}$

$\theta_n^{(h)}= \frac{1}{I}{\sum_{k=1}^{K} \Big(x_k-m_1^{(h)}\Big)^n \, h(x_k)}$

The second approach from Choi and Sweetman is an analytical methodology to combine statistical moments from individual segments of a time-history such that the resulting overall moments are those of the complete time-history. This methodology could be used for parallel computation of statistical moments with subsequent combination of those moments, or for combination of statistical moments computed at sequential times.

If sets of statistical moments are known: $(\gamma_{0,q},\mu_{q},\sigma^2_{q},\alpha_{3,q},\alpha_{4,q}) \quad$ for, then each can be expressed in terms of the equivalent raw moments:

$\gamma_{n,q}= m_{n,q} \gamma_{0,q} \qquad \quad \textrm{for} \quad n=1,2,3,4 \quad \text{ and } \quad q = 1,2, \dots ,Q$

where is generally taken to be the duration of the time-history, or the number of points if is constant.

The benefit of expressing the statistical moments in terms of is that the sets can be combined by addition, and there is no upper limit on the value of .

$\gamma_{n,c}= \sum_{q=1}^{Q}\gamma_{n,q} \quad \quad \textrm{for} \quad n=0,1,2,3,4$

where the subscript represents the concatenated time-history or combined . These combined values of can then be inversely transformed into raw moments representing the complete concatenated time-history

$m_{n,c}=\frac{\gamma_{n,c}}{\gamma_{0,c}} \quad \textrm{for} \quad n=1,2,3,4$

Known relationships between the raw moments and the central moments are then used to compute the central moments of the concatenated time-history. Finally, the statistical moments of the concatenated history are computed from the central moments:

$\mu_c=m_{1,c} \ \ \ \ \ \sigma^2_c=\theta_{2,c} \ \ \ \ \ \alpha_{3,c}=\frac{\theta_{3,c}}{\sigma_c^3} \ \ \ \ \ \alpha_{4,c}=\frac{\theta_{4,c}}{\sigma_c^4}$

Read more about this topic: Algorithms For Calculating Variance

Famous quotes containing the word statistics:

“We already have the statistics for the future: the growth percentages of pollution, overpopulation, desertification. The future is already in place.”
—Günther Grass (b. 1927)

Related Phrases

Related Words