Imputation (statistics)

In statistics, imputation is the substitution of some value for missing data. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data. Imputation theory is constantly developing and thus requires consistent attention to new information regarding the subject. There have been many theories embraced by scientists to account for missing data but the majority of them introduce large amounts of bias. A few of the well known attempts to deal with missing data include: Hot deck and cold deck imputation; listwise and pairwise deletion; mean imputation; regression imputation; last observation carried forward; and stochastic imputation.

A once-common method of imputation was hot-deck imputation where a missing value was imputed from a randomly selected similar record. The term "hot deck" dates back to the storage of data on punched cards, and indicates that the information donors come from the same dataset as the recipients. The stack of cards was "hot" because it was currently being processed.

Cold-deck imputation, by contrast, selects donors from another dataset. Since computer power has advanced rapidly and punched cards are no longer used, more sophisticated methods of imputation have generally superseded the original random and sorted hot deck imputation techniques, such as the nearest neighbour hot deck imputation and the approximate Bayesian bootstrap.

Listwise deletion is when all cases with a missing value are deleted. If the data are missing completely at random, then listwise deletion does not add any bias, but it does decrease the power of the analysis by deleting all cases with missing values on a predictor variable thus decreasing the N value. If the cases are not missing completely at random, then listwise deletion will introduce bias because the sub-sample of cases represented by the missing data represent a unique population, thus list wise deletion destroys the information introduced by that unique population. Pairwise deletion involves only deleting specific variable cell values along with the outcome variable when a particular variable is required in an analysis and has a missing value, but the case will exist in all other situations, thus the total N for analysis will not be consistent across parameter estimations. Because of the incomplete N values at some points in time, while still maintaining complete case comparison for other parameters, pairwise deletion introduces impossible mathematical situations such as correlations that are over 100% (Enders, 2010).

Mean imputation ruins the correlation between predictors and outcomes because it changes the way two variables co-variate, that is, if two variables move in tandem give or take a certain amount of error, then using mean imputation to fill in the missing values destroys the natural relationship between the predictor and outcome variables since the imputed values will all be the mean of the predictor thus remain static for all missing values while the outcome still varies independently.

Last observation carried forward involves using the cell values immediately prior to the data that are missing, and using these values for each of the empty cells until the next available data cell without a missing value is identified.

Regression imputation has the opposite problem of mean imputation. A regression model is estimated to predict the missing values and the missing data is imputed in relation to this. The problem is that the imputed data don't have an error term included in their estimation, thus the estimates fit perfectly along the regression line with out any residual variance. This causes relationships to be over identified. Stochastic regression was a fairly successful attempt to correct the lack of an error term by adding the average regression variance to the regression imputations to introduce error. Stochastic regression is an excellent technique, and shows much less bias than the above mentioned techniques, but it still missed one thing - if data are imputed then intuitively one would think that more noise should be introduced to the problem than simple residual variance (Enders, 2010).

In order to deal with the problem of increased noise due to imputation, Rubin (1987) developed a method for averaging the outcomes across multiply imputed data sets to account for this. The way this works is that imputation processes similar to stochastic regression are run on the same data set multiple times and the imputed data sets are saved for later analysis. Each imputed data set is analyzed separately and the results are averaged except for the standard error term (SE). The SE is constructed by the within variance of each data set as well as the variance between imputed items on each data set. These two variances are added together and the square root of them determines the SE, thus the noise due to imputation as well as the residual variance are introduced to the regression model.

Rubin and Little are two of the most influential researchers in modern missing data theory. Rubin proposed the highly influential missing data mechanisms in 1976, and proposed a way to combine the analysis from multiple data sets in 1987 as mentioned above. Little introduced the concept of maximum likelihood in the mid 1970s and worked with Rubin on further developments of both maximum likelihood and multiple imputation. These two methods represent what is currently identified as the best methods for dealing with missing data available.

Maximum likelihood involves the estimation of parameters based on the observed data using Bayesian estimation techniques, maximum likelihood estimation within what is known as a density function, and the EM algorithm to forward the observations toward maximum likelihood through a sequential progress of minimizing the difference between maximum likelihood estimations. Maximum likelihood estimation is equivalent to multiple imputation in most cases though maximum likelihood is easier to implement than multiple imputation thus usually more preferred, but multiple imputation has slightly more situations in which it is applicable.

In machine learning, it is sometimes possible to train a classifier directly over the original data without imputing it first. That was shown to yield better performance in cases where the missing data is structurally absent, rather than missing due to measurement noise.