Bessel's Correction - The Source of The Bias

The Source of The Bias

Suppose the mean of the whole population is 2050, but the statistician does not know that, and must estimate it based on this small sample chosen randomly from the population:

One may compute the sample average:

This may serve as an observable estimate of the unobservable population average, which is 2050. Now we face the problem of estimating the population variance. That is the average of the squares of the deviations from 2050. If we knew that the population average is 2050, we could proceed as follows:

\begin{align} {} & \frac{1}{5}\left \\ =\; & \frac{36}{5} = 7.2
\end{align}

But our estimate of the population average is the sample average, 2052, not 2050. Therefore we do what we can:

\begin{align} {} & \frac{1}{5}\left \\ =\; & \frac{16}{5} = 3.2
\end{align}

This is a substantially smaller estimate. Now a question arises: is the estimate of the population variance that arises in this way using the sample mean always smaller than what we would get if we used the population mean? The answer is yes except when the sample mean happens to be the same as the population mean.

In intuitive terms, we are seeking the sum of squared distances from the population mean, but end up calculating the sum of squared differences from the sample mean, which, as will be seen, is the number that minimizes that sum of squared distances. So unless the sample happens to have the same mean as the population, this estimate will always underestimate the population variance.

To see why this happens, we use a simple identity in algebra:

With representing the deviation from an individual to the sample mean, and representing the deviation from the sample mean to the population mean. Note that we've simply decomposed the actual deviation from the (unknown) population mean into two components: the deviation to the sample mean, which we can compute, and the additional deviation to the population mean, which we can not. Now apply that identity to the squares of deviations from the population mean:

\begin{align} {^2 & = ^2 \\ & = \overbrace{(2053 - 2052)^2}^{\text{This is }a^2.} + \overbrace{2(2053 - 2052)(2052 - 2050)}^{\text{This is }2ab.} + \overbrace{(2052 - 2050)^2}^{\text{This is }b^2.}
\end{align}

Now apply this to all five observations and observe certain patterns:

\begin{align} \overbrace{(2051 - 2052)^2}^{\text{This is }a^2.}\ +\ \overbrace{2(2051 - 2052)(2052 - 2050)}^{\text{This is }2ab.}\ +\ \overbrace{(2052 - 2050)^2}^{\text{This is }b^2.} \\ (2053 - 2052)^2\ +\ 2(2053 - 2052)(2052 - 2050)\ +\ (2052 - 2050)^2 \\ (2055 - 2052)^2\ +\ 2(2055 - 2052)(2052 - 2050)\ +\ (2052 - 2050)^2 \\ (2050 - 2052)^2\ +\ 2(2050 - 2052)(2052 - 2050)\ +\ (2052 - 2050)^2 \\ (2051 - 2052)^2\ +\ \underbrace{2(2051 - 2052)(2052 - 2050)}_{\begin{smallmatrix} \text{The sum of the entries in this} \\ \text{middle column must be 0.} \end{smallmatrix}}\ +\ (2052 - 2050)^2
\end{align}

The sum of the entries in the middle column must be zero because the sum of the deviations from the sample average must be zero. When the middle column has vanished, we then observe that

  • The sum of the entries in the first column (a2) is the sum of the squares of the deviations from the sample mean;
  • The sum of all of the entries in the remaining two columns (a2 and b2) is the sum of squares of the deviations from the population mean, because of the way we started with 2, and did the same with the other four entries;
  • The sum of all the entries must be bigger than the sum of the entries in the first column, since all the entries that have not vanished are positive (except when the population mean is the same as the sample mean, in which case all of the numbers in the last column will be 0).

Therefore:

  • The sum of squares of the deviations from the population mean will be bigger than the sum of squares of the deviations from the sample mean (except when the population mean is the same as the sample mean, in which case the two are equal).

That is why the sum of squares of the deviations from the sample mean is too small to give an unbiased estimate of the population variance when the average of those squares is found.

Read more about this topic:  Bessel's Correction

Famous quotes containing the words source and/or bias:

    The dream of reason produces monsters. Imagination deserted by reason creates impossible, useless thoughts. United with reason, imagination is the mother of all art and the source of all its beauty.
    —Francisco José De Goya Y Lucientes (1746–1828)

    The solar system has no anxiety about its reputation, and the credit of truth and honesty is as safe; nor have I any fear that a skeptical bias can be given by leaning hard on the sides of fate, of practical power, or of trade, which the doctrine of Faith cannot down-weigh.
    Ralph Waldo Emerson (1803–1882)