Imputation, EM and RegEM

This is not intended to be an in-depth mathematical treatment of the subject although some prior linear algebra and matrix methodology is needed to understand some of theconcepts. Rather, it is more a statement of the underlying problem, some of the assumptions involved and an intuitive view of how the methods work. Some prior understanding of multivariate statistical structures and how relationships are measured is also assumed, but I will try to keep it as short and simple as possible.

The Imputation Problem and Assumptions

For our underlying example, we will take a set series of temperature measurements at each of N times and p locations. The problem is that some of the temperature values are missing and we would like to guess what these values might look like.

It is fairly obvious that in order to do any meaning computation, we need to make some assumptions about the structure in which the data is generated. Unless the temperatures at the various locations are somehow related (due to proximity of location or common influence of broader climate behavior) knowing what happens at one location will not help us to infer what the missing values at another location might have been.

The usual assumption is to assume that the vector of temperatures for all locations at a given time simultaneously behaves like multivariate normal variable: a set of individually normally distributed values linked together through a set of parameters called covariances (or interchangeably, correlations). The covariance measure the pair-wise (linear) relationship between two of the variables at a time. The reason that this assumption is useful is because the behavior of the temperatures at a given time is completely characterized by the mean vector, µ, (consisting of the mean temperatures for each of the p locations) and the pxp covariance matrix, Σ. The diagonal of the matrix consists of the variances of each temperature series.

When µ and Σ are known, it can be shown that the optimum way of guessing at the missing values is to do a linear regression where the known temperatures at that time are the predictors. The coefficients of the regression equation can be calculated directly from the mean vector and the covariance matrix (both of which are unknown population values). But we could estimate them from the data only if no data values were missing. So we are in a circular bind.

The EM Algorithm

This is where the EM algorithm comes in. It is a more general procedure which can be applied in many situations. I tried searching Google for a simple description (and found more “gentle” explanations loaded with heavy math) with no success. You can try Wiki for a definition but even there it quickly becomes unreadable.

The underlying situation is as we have already described it. There is data with an associated model. The model contains one or more parameters. Some of the data is either latent (unobservable) or simply missing. The algorithm works quite simply:

- Start with a guess at the missing values (although you could instead start by estimating the unknown parameters).

-Estimate the parameters since now all of the sample values are “known”.

- Use these new parameter estimates to optimally “predict” the unknown values.

-Alternate back and forth until there is little change in the results.

It sounds simple but there are entire books written on the procedure. And, there are pitfalls. For a given problem, the final answer can depend on the starting values since there is an optimization procedure involved and it is possible that convergence is to a local, not a global overall optimum. Furthermore, depending on the number of values to be imputed, it is possible that the problem becomes mathematically intractable and further manipulation is needed to find a solution.

In our situation, EM has mainly been used not to impute values, but to estimate the mean vector and covariance matrix (when a few values are missing at random and not for systematic reasons) so that hypotheses tests could be carried out on the population parameters. The use of this method for reconstructing large numbers of values has originated in other disciplines.

EM and Imputation of Temperatures

Back to our problem:

In our case, the parameters are µ and Σ which are estimated from the sample mean vector and the sample covariance matrix. Various methods are possible for imputing the missing values:

-As mentioned previously, using all of the known values to linearly predict the missing ones. One possible criticism of this is that it could be subject to “overfitting” because the number of predictors is large.

-Principle component regression. Take the known values (which are possibly related) and extract a given number of PCs. Use the PCs as predictors in a linear regression as above.

-Total Least Squares (also referred to as Errors In Variables). Here it is assumed that both the predictors and the missing values have errors and the residuals for data points are calculated orthogonally (perpendicularly) from the predicting planar surface, rather than “vertically” in the direction of the points being predicted. The method again involves singular value decomposition (SVD) of the data. This seems to have the drawback that the results change if the data are rotated in non-orthogonal fashion.

-Truncation of the SVD decomposition of the data matrix. This appears to be what Steven McIntyre’s regem_pca function in R does. The data matrix is decomposed using SVD, all eigenvalues smaller than a given value are replaced by zero and the data matrix then reconstructed. The missing values are then replaced by their counterparts from the newly reconstructed matrix.

-Inverse calibration. Here the value being reconstructed is used as the predictor in a reverse of the first method. I have not seen this method used anywhere, but for fun I have written an R function for doing so, although I haven’t really looked at the effectiveness of the method so far.

It is not easy to evaluate the relative efficiency of each of these methods because the asymptotic results do not have simple closed mathematical forms . Proper comparison should be done on a theoretical, mathematical basis. However, this is rarely the approach in climate science literature where the preferred “proof” is to apply the methodology to one or two “surrogate” model outputs as justification for the efficacy of techniques.

RegEM

Where does RegEM fit in? If the number of values to be imputed is very large, the number of parameters to be estimated may be larger than the available amount of real information in the data. Try fitting a regression with ten predictors using only 5 data points. In that case, the math fails (matrices which need to be inverted are singular and are therefore not invertible) and you can get infinitely many perfect fits. In order to get a “unique” solution, the problem must be “regularized” by the imposition of extra conditions on the fitting equation. This leads to methods such as (partial) truncated total least squares (what a mouthful!) and ridge regression. You might wish to look at this paper (starting at page 5) for a somewhat mathematical description of these two methods). Extra regularization parameters for “fine tuning” must be introduced so that the results become subject to the choices of the analyst. The procedure can also become unstable with lack of convergence in some situations. Because of the complexity, it is not easy to recognize possible bias in the results and the amount of possible underestimation of the variability of those results. Justifying that the what you end up with is meaningful when creating so many imputations is not a trivial task under any circumstances.

No heavy duty math so far. Maybe more another time.

About these ads

14 Comments

Filed under Uncategorized

14 responses to “Imputation, EM and RegEM

  1. Ryan O

    Roman, thanks for the very clear and basic overview. Funny . . . I actually had that Tingley paper but hadn’t read it yet. I’ve been collecting .pdfs about EM since I started playing with Tapio’s RegEM.

    This isn’t really a question – more of just a comment on what I am thinking at the moment. I cannot figure out why Steig used the PCs in the post-1982 timeframe instead of using the actual AVHRR data to predict the ground temps. I understand using the PCs in the pre-1982 timeframe, but the use post-1982 is just mystifying.

    I do understand why they did not simply infill missing ground data with satellite: though the data is close, there are some non-linear relationships between the ground and satellite data, especially near ocean or at the ice shelf/continent transitions. So infilling with the satellite data would be a bit complicated. However, it makes no sense to me whatsoever to use the PCs when you have perfectly good satellite data available.

    Anyway, just brain dump.

    Thanks again for the overview. Much appreciated, Roman!

  2. RomanM

    Ryan, I think that there could be several reasons for this.

    One is the fact that using the raw data might have made the computing portion too unwieldy. For example, the correlation matrix for the combined station and satellite data would be approximately 5550 x 5550 = 30802500 elements in size.

    The other is the possibility that the procedure might not have converged with the raw data in place of the three PCs.

  3. Ryan O

    I don’t mean using all of the raw data. I mean using a limited set of the data in the immediate geographical vicinity of the ground station – or even simply the closest gridcell. The correlation coefficients between the ground data and the closest cloudmasked AVHRR gridcells are typically ~0.6 or higher. You wouldn’t necessarily need to use an EM algorithm for this purpose, because you have at least as many predictors as you have predictands. This would allow you to have the post-1982 portion of the matrix completed in a way that would seem to be less susceptible to error.

  4. Ryan O

    Just to clarify, the purpose of the above wouldn’t be to do the imputation back to 1957. It would simply be to make the 42-station predictor matrix complete in the post-1982 timeframe. You’d still impute back to 1957 using the PCs.

  5. RomanM

    I take this to mean that the satellite data would be used to infill the ground data post-1982. On one level, this would make sense, but then a second stage would still be needed to complete the process. I don’t know if they would have thought of that because I don’t recall anyone else doing something similar in climate science circles.

    Where such a process would be useful would be for calculating values for the initial infill instead of the usual zeroes. One might expect that the iteration would converge faster and possibly avoid unreasonable solutions.

    Note: We crossposted – I understood you correctly.

    • Ryan O

      Actually, the Shuman paper you linked during the discussion about where the reported Byrd and Siple trends came from in the Steig paper did something similar. Shuman calibrated the 37 GHz satellite data to the ground data and the spliced the sets. It wasn’t used for reconstruction, of course, but the principle would be the same.

      When I was trying to get a better calibration of the AVHRR data to the ground stations to look for satellite offsets, I just did a local regression of the residuals (local regression because the AVHRR data tails off on the low end for ice shelf stations and tails off on the high end for stations near the ocean) and used that to predict ground from AVHRR and vice versa. Seemed to work pretty well. I may try doing that in two ways – 1) for the initial infill, and 2) for direct splicing such that there is no imputation that occurs during the post-1982 timeframe. Compare the results and see what the differences are.

      • RomanM

        Actually, I have been thinking more about it and I have come to agree with you. I’d like to see the results.

  6. Pingback: A Short Post on EM « the Air Vent

  7. Thanks Roman, I put a link on tAV to your post.

    Thanks also for the link to the link to the Tingley paper which I’m reading now.

  8. davidc

    Roman,

    “Alternate back and forth until there is little change in the results”

    What is the reason for believing that this convergence is to the reality, not to the guesses? My starting point here is multiple linear regression, already a slippery beast when there is lots of data. When there is not enough data nothing can be done. So we do some make-believe so something can be done. Is there anything in the math (say, a matrix inversion) that tracks the real part separate from the make-believe?

  9. RomanM

    In statistical procedures such as this one, you really don’t expect for your calculations to “converge to reality”.

    What you are looking for is convergence to a (hopefully unique) mathematical solution which satisfies whatever conditions you have set out beforehand in the algorithm you are using. Whether or not it relates at all to reality depends on the data you are using, the assumptions you have made and the properties of the procedure in question.

    The more complicated the procedure, the more tenuous the result. Multiple regression (and multivariate statistics in general) are difficult at the best of times – particular with opaque operations such as RegEM. Yes, infilling huge amounts of data (more than half!) and then purporting to be able to identify trends in fractions of a degree should be taken with a large dose of scepticism.

    The best bet for tracking the “real part” is to avoid the “make-believe”. I have always followed the KISS principle when doing consulting (Keep It Statistically Simple). The results are more easily evaluated and the conclusions more justifiable if it is clear that necessary assumptions are satisfied and there is nothing questionable going on. If one is reduced to using more complicated procedures because of circumstances, then proper realistic error bounds must be calculated to reflect the added possible divergence from reality.

  10. I’m a big fan of Winbugs and it seems that imputation is something it can do rather well.
    I came across this: Multiple Imputation for Missing Data at http://www.medicine.mcgill.ca/epidemiology/Joseph/…/missing.regression.pdf
    This is a very nice tutorial on imputation using R with conventional imputation vs WinBugs MCMC.
    I’m an engineer (retired) and have used Bayes (and Markov chain Monte Carlo) to crack stat problems that other methods can’t even start on. (Reliability estimates with very sparse data – sometimes with just operating times and no failure events.)
    I have never had to deal with imputation type problems but would like to learn more. I would be very interested to follow any work someone might decide to pursue on this suggestion.

    • RomanM

      I was not familiar with Winbugs so I did some googling and ended up downloading the BRugs library from R to take a look at it.

      It looks interesting, but will take a while to figure out what it can do. However, I must warn you that I am not a fan of bayesian methodology for a number of reasons. Despite that, I am always willing to learn new things.

      Imputation of missing values has been a topic under consideration in statistical methods for a long time. The main reason for its existence is to enable the use of procedures which require a complete set of data for the procedure to be properly implemented. Rather than deleting observations which may be missing portions of information, one uses relationships among the variables to estmate the value of the missing piece.

      Unfortunately, the folks in climate science seem to have stretched this from just a few values to the creation of vast amounts of data and interpreted the results as gospel truth.

      I would be interested in seeing what the paper that you linked says about the BUGS methods for imputation, but the link does not seem to work for me.

      • The link I pasted was trashed but I’m not sure how. This one should work, http://www.medicine.mcgill.ca/epidemiology/Joseph/courses/EPIB-677/missing.regression.pdf
        My error for not proofing the copy before sending.
        I welcome a Bayesian “non-fan”. Being largely self taught on this topic, I have not had the benefits of a competent critic.
        My journey into Bayes land began with the discovery of this paperback, Data Analysis: A Bayesian Tutorial (Oxford Science Publications), by D.S. Sivia, while wandering about the campus book store at UC Irvine. (The issue I have (1996) must impress some. It is a paperback and is offered for as much as $346 US on Amazon.. Happily the 2nd edition is priced at $37.64 US.)
        One intro problem, Example 3: The light house problem, stunned me with how little prior information when combined with the obvious geometry of the problem could yield such a result.
        I understand that a similar (Bayes) approach was use to find the US submarine, USS Scorpion. And is now the basis of CASP (Computer Assisted Search Program) used by the US Coast Guard (and probably everyone else during search and rescue.)
        As you might guess I am get rather worked up over these ideas.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s