This is not intended to be an in-depth mathematical treatment of the subject although some prior linear algebra and matrix methodology is needed to understand some of theconcepts. Rather, it is more a statement of the underlying problem, some of the assumptions involved and an intuitive view of how the methods work. Some prior understanding of multivariate statistical structures and how relationships are measured is also assumed, but I will try to keep it as short and simple as possible.
The Imputation Problem and Assumptions
For our underlying example, we will take a set series of temperature measurements at each of N times and p locations. The problem is that some of the temperature values are missing and we would like to guess what these values might look like.
It is fairly obvious that in order to do any meaning computation, we need to make some assumptions about the structure in which the data is generated. Unless the temperatures at the various locations are somehow related (due to proximity of location or common influence of broader climate behavior) knowing what happens at one location will not help us to infer what the missing values at another location might have been.
The usual assumption is to assume that the vector of temperatures for all locations at a given time simultaneously behaves like multivariate normal variable: a set of individually normally distributed values linked together through a set of parameters called covariances (or interchangeably, correlations). The covariance measure the pair-wise (linear) relationship between two of the variables at a time. The reason that this assumption is useful is because the behavior of the temperatures at a given time is completely characterized by the mean vector, µ, (consisting of the mean temperatures for each of the p locations) and the pxp covariance matrix, Σ. The diagonal of the matrix consists of the variances of each temperature series.
When µ and Σ are known, it can be shown that the optimum way of guessing at the missing values is to do a linear regression where the known temperatures at that time are the predictors. The coefficients of the regression equation can be calculated directly from the mean vector and the covariance matrix (both of which are unknown population values). But we could estimate them from the data only if no data values were missing. So we are in a circular bind.
The EM Algorithm
This is where the EM algorithm comes in. It is a more general procedure which can be applied in many situations. I tried searching Google for a simple description (and found more “gentle” explanations loaded with heavy math) with no success. You can try Wiki for a definition but even there it quickly becomes unreadable.
The underlying situation is as we have already described it. There is data with an associated model. The model contains one or more parameters. Some of the data is either latent (unobservable) or simply missing. The algorithm works quite simply:
- Start with a guess at the missing values (although you could instead start by estimating the unknown parameters).
-Estimate the parameters since now all of the sample values are “known”.
- Use these new parameter estimates to optimally “predict” the unknown values.
-Alternate back and forth until there is little change in the results.
It sounds simple but there are entire books written on the procedure. And, there are pitfalls. For a given problem, the final answer can depend on the starting values since there is an optimization procedure involved and it is possible that convergence is to a local, not a global overall optimum. Furthermore, depending on the number of values to be imputed, it is possible that the problem becomes mathematically intractable and further manipulation is needed to find a solution.
In our situation, EM has mainly been used not to impute values, but to estimate the mean vector and covariance matrix (when a few values are missing at random and not for systematic reasons) so that hypotheses tests could be carried out on the population parameters. The use of this method for reconstructing large numbers of values has originated in other disciplines.
EM and Imputation of Temperatures
Back to our problem:
In our case, the parameters are µ and Σ which are estimated from the sample mean vector and the sample covariance matrix. Various methods are possible for imputing the missing values:
-As mentioned previously, using all of the known values to linearly predict the missing ones. One possible criticism of this is that it could be subject to “overfitting” because the number of predictors is large.
-Principle component regression. Take the known values (which are possibly related) and extract a given number of PCs. Use the PCs as predictors in a linear regression as above.
-Total Least Squares (also referred to as Errors In Variables). Here it is assumed that both the predictors and the missing values have errors and the residuals for data points are calculated orthogonally (perpendicularly) from the predicting planar surface, rather than “vertically” in the direction of the points being predicted. The method again involves singular value decomposition (SVD) of the data. This seems to have the drawback that the results change if the data are rotated in non-orthogonal fashion.
-Truncation of the SVD decomposition of the data matrix. This appears to be what Steven McIntyre’s regem_pca function in R does. The data matrix is decomposed using SVD, all eigenvalues smaller than a given value are replaced by zero and the data matrix then reconstructed. The missing values are then replaced by their counterparts from the newly reconstructed matrix.
-Inverse calibration. Here the value being reconstructed is used as the predictor in a reverse of the first method. I have not seen this method used anywhere, but for fun I have written an R function for doing so, although I haven’t really looked at the effectiveness of the method so far.
It is not easy to evaluate the relative efficiency of each of these methods because the asymptotic results do not have simple closed mathematical forms . Proper comparison should be done on a theoretical, mathematical basis. However, this is rarely the approach in climate science literature where the preferred “proof” is to apply the methodology to one or two “surrogate” model outputs as justification for the efficacy of techniques.
RegEM
Where does RegEM fit in? If the number of values to be imputed is very large, the number of parameters to be estimated may be larger than the available amount of real information in the data. Try fitting a regression with ten predictors using only 5 data points. In that case, the math fails (matrices which need to be inverted are singular and are therefore not invertible) and you can get infinitely many perfect fits. In order to get a “unique” solution, the problem must be “regularized” by the imposition of extra conditions on the fitting equation. This leads to methods such as (partial) truncated total least squares (what a mouthful!) and ridge regression. You might wish to look at this paper (starting at page 5) for a somewhat mathematical description of these two methods). Extra regularization parameters for “fine tuning” must be introduced so that the results become subject to the choices of the analyst. The procedure can also become unstable with lack of convergence in some situations. Because of the complexity, it is not easy to recognize possible bias in the results and the amount of possible underestimation of the variability of those results. Justifying that the what you end up with is meaningful when creating so many imputations is not a trivial task under any circumstances.
No heavy duty math so far. Maybe more another time.