## Rejected Again… Open Mind Isn’t!

I hate feeling rejected! It’s happened to me again.  Back in August 2009, I experienced that for the first time at RealClimate.   A comment which  I submitted on statistical methodology used in the Steig Antarctic paper did not pass their “peer review” and abruptly disappeared.

Today, I posted the following comment at Tamino’s Open Mind on the thread “Combining Stations” – one of a series of posts about processing station temperature data:

You don’t seem to get a lot of technical observations from your regular commenters in a thread such as this so I thought I might offer some comments about the method you are using to estimate the “offsets” for each station.

I don’t know if you are aware of it, but the “optimal” method appears to be a disguised two factor weighted analysis of variance. One factor is the stations (whose coefficients are the “offsets”) and the second is time (in this case months) treated as a numeric variable. This is not immediately obvious from the form in which you formulated the sum of squares that you are minimizing, but some fairly simple algebra can re-express that SS in a way which is more obviously consistent with an anova. The weights for a given temperature measurement are proportional to the number of stations which have available data for that month. I would have run some comparisons for your example in R, but your aversion to posting data and code makes that more difficult. However, I wrote a simple R program for calculating the offsets as well as some scripts for the anova and ran them on randomly generated data with identical results for both methods.

The side benefit of this is that the estimated monthly grid values are calculated simultaneously with the offsets rather than as (possibly sub-optimal averages) of the offset adjusted temperature series. Variability estimates can also be computed more easily as well.

Within an hour the comment dissipated in electron limbo.  I fail to understand why.

Was it the fact that I mistyped the phrase that time is ”treated as a numeric variable” (I meant to sat a “categorical” variable)?  Was it my observation that there were not a lot of technical contributions by the readers (there were 14 comments in the thread none of which contributed to the specific material being developed by the author)? Or maybe, my chiding remark that Tamino does not reveal his code or data (thereby preventing most of his readers from real participation)?  I honestly don’t know.

I should be pretty irritated but I take comfort in the fact that I am now in good company :)  – with people like Jeff Id and Lucia, both of whom I believe were summarily deleted from that blog at one time or another.

I had expected a real discussion regarding the methodology – that he would be gratified that someone actually expressed enough interest in the material to do further analysis (with some pretty interesting insights into his procedure), but I now suspect that he did not even understand exactly what I was telling him.    My guess would be that Tamino could actually have learned something from such a conversation, but it appears that he is more interested in appearing to be a mathematical “expert” to his small coterie of readers preferring to pontificate in one-sided expositions

I’ll also opine that the optimal method can definitely be called superior, both because it has no dependence on the order in which stations are processed, and because it simply minimizes the sum of squared differences between all pairs of station records.

rather than interacting with his audience in a meaningful and productive fashion to get a better grasp of the science behind it.

Open Mind? What a laugh!

Filed under Uncategorized

## Met Office and the International Date Line

Since the release of the Met Office Subset recently, I have been looking further at the data trying to evaluate what exactly is contained therein.  One of the checks that one would like to do – determining how the global gridded data is calculated – is not really possible.  The data set is self-admittedly incomplete and, anyway, a list of which stations are used and which aren’t is not provided.  However, I decided to look at the grid cells information of the provided stations anyway to see what was going on.

Using another fairly extensive file of station information which I already possessed, ghcnd-stations.txt (which I believe to be downloaded from the previous incarnation of ClimateAudit.org), I did some checking of Met station information against the ghcn file.  Using only those that I could easily match up in the two data sets, I found differences in coordinates and elevations of a number of stations.  Although the heights (altitudes) need to be looked into further, because some of the differences can be quite large, the grid squares of most of the numerically different coordinates seemed   to be unaffected by those differences.  Except …

Filed under Uncategorized

## Met Office and the Cru Data

Some days ago, the UK Met Office released a subset of the HadCrut3 data set at this web page.  I thought that it might be useful for people to be able to download the data and read it into the R statistical package for further analysis.  Along the way, I discovered, like Harry, that not everything is as quite as advertised so hopefully my observations on the differences will also prove useful.

The released data is available as a windows zipfile.   In the set, each station is available as a single text file (with no extension).  The files are contained in 85 directories (at this point in time) whose two digit names are the first two digits of the station numbers of the files in that directory.  It makes it easy to locate a station, but adds a bit of complication to trying to read the entire set simultaneously.  However, it turns out that  R actually can do it pretty easily J.  One word of warning:  When I downloaded the zipfile yesterday, I had found that new stations had been added to the set without any announcement of the changes on the web change that the set had been altered.  In the future, it is a good idea to keep an eye on the size of the file to determine if it is being altered.  Note to Met Office – it would be nice to have a date attached to the file indicating the most recent version.

Filed under Uncategorized

## More Pictures

Here are four more pictures which are among my animal favorites.  Click on a picture to enlarge it:

Clicking on the enlarged picture opens a slightly larger version.

Filed under Photos

In his blog post, Giorgio Gilestro claims to show that the adjustments made by GHCN to temperature data do not induce  artificial trends into the data based mainly on this histogram of adjustments.

Arguments were made on the blog that the temporal order of the adjustments makes a difference and in fact these arguments seem to be correct.   I checked this through the following reasonably simple analysis:

Filed under Uncategorized

## Imputation, EM and RegEM

This is not intended to be an in-depth mathematical treatment of the subject although some prior linear algebra and matrix methodology is needed to understand some of theconcepts. Rather, it is more a statement of the underlying problem, some of the assumptions involved and an intuitive view of how the methods work. Some prior understanding of multivariate statistical structures and how relationships are measured is also assumed, but I will try to keep it as short and simple as possible.

The Imputation Problem and Assumptions

For our underlying example, we will take a set series of temperature measurements at each of N times and p locations. The problem is that some of the temperature values are missing and we would like to guess what these values might look like.

It is fairly obvious that in order to do any meaning computation, we need to make some assumptions about the structure in which the data is generated. Unless the temperatures at the various locations are somehow related (due to proximity of location or common influence of broader climate behavior) knowing what happens at one location will not help us to infer what the missing values at another location might have been.

The usual assumption is to assume that the vector of temperatures for all locations at a given time simultaneously behaves like multivariate normal variable: a set of individually normally distributed values linked together through a set of parameters called covariances (or interchangeably, correlations). The covariance measure the pair-wise (linear) relationship between two of the variables at a time. The reason that this assumption is useful is because the behavior of the temperatures at a given time is completely characterized by the mean vector, µ, (consisting of the mean temperatures for each of the p locations) and the pxp covariance matrix, Σ. The diagonal of the matrix consists of the variances of each temperature series.

When µ and Σ are known, it can be shown that the optimum way of guessing at the missing values is to do a linear regression where the known temperatures at that time are the predictors. The coefficients of the regression equation can be calculated directly from the mean vector and the covariance matrix (both of which are unknown population values). But we could estimate them from the data only if no data values were missing. So we are in a circular bind.

The EM Algorithm

This is where the EM algorithm comes in. It is a more general procedure which can be applied in many situations. I tried searching Google for a simple description (and found more “gentle” explanations loaded with heavy math) with no success. You can try Wiki for a definition but even there it quickly becomes unreadable.

The underlying situation is as we have already described it. There is data with an associated model. The model contains one or more parameters. Some of the data is either latent (unobservable) or simply missing. The algorithm works quite simply:

- Start with a guess at the missing values (although you could instead start by estimating the unknown parameters).

-Estimate the parameters since now all of the sample values are “known”.

- Use these new parameter estimates to optimally “predict” the unknown values.

-Alternate back and forth until there is little change in the results.

It sounds simple but there are entire books written on the procedure. And, there are pitfalls. For a given problem, the final answer can depend on the starting values since there is an optimization procedure involved and it is possible that convergence is to a local, not a global overall optimum. Furthermore, depending on the number of values to be imputed, it is possible that the problem becomes mathematically intractable and further manipulation is needed to find a solution.

In our situation, EM has mainly been used not to impute values, but to estimate the mean vector and covariance matrix (when a few values are missing at random and not for systematic reasons) so that hypotheses tests could be carried out on the population parameters. The use of this method for reconstructing large numbers of values has originated in other disciplines.

EM and Imputation of Temperatures

Back to our problem:

In our case, the parameters are µ and Σ which are estimated from the sample mean vector and the sample covariance matrix. Various methods are possible for imputing the missing values:

-As mentioned previously, using all of the known values to linearly predict the missing ones. One possible criticism of this is that it could be subject to “overfitting” because the number of predictors is large.

-Principle component regression. Take the known values (which are possibly related) and extract a given number of PCs. Use the PCs as predictors in a linear regression as above.

-Total Least Squares (also referred to as Errors In Variables). Here it is assumed that both the predictors and the missing values have errors and the residuals for data points are calculated orthogonally (perpendicularly) from the predicting planar surface, rather than “vertically” in the direction of the points being predicted. The method again involves singular value decomposition (SVD) of the data. This seems to have the drawback that the results change if the data are rotated in non-orthogonal fashion.

-Truncation of the SVD decomposition of the data matrix. This appears to be what Steven McIntyre’s regem_pca function in R does. The data matrix is decomposed using SVD, all eigenvalues smaller than a given value are replaced by zero and the data matrix then reconstructed. The missing values are then replaced by their counterparts from the newly reconstructed matrix.

-Inverse calibration. Here the value being reconstructed is used as the predictor in a reverse of the first method. I have not seen this method used anywhere, but for fun I have written an R function for doing so, although I haven’t really looked at the effectiveness of the method so far.

It is not easy to evaluate the relative efficiency of each of these methods because the asymptotic results do not have simple closed mathematical forms . Proper comparison should be done on a theoretical, mathematical basis. However, this is rarely the approach in climate science literature where the preferred “proof” is to apply the methodology to one or two “surrogate” model outputs as justification for the efficacy of techniques.

RegEM

Where does RegEM fit in? If the number of values to be imputed is very large, the number of parameters to be estimated may be larger than the available amount of real information in the data. Try fitting a regression with ten predictors using only 5 data points. In that case, the math fails (matrices which need to be inverted are singular and are therefore not invertible) and you can get infinitely many perfect fits. In order to get a “unique” solution, the problem must be “regularized” by the imposition of extra conditions on the fitting equation. This leads to methods such as (partial) truncated total least squares (what a mouthful!) and ridge regression. You might wish to look at this paper (starting at page 5) for a somewhat mathematical description of these two methods). Extra regularization parameters for “fine tuning” must be introduced so that the results become subject to the choices of the analyst. The procedure can also become unstable with lack of convergence in some situations. Because of the complexity, it is not easy to recognize possible bias in the results and the amount of possible underestimation of the variability of those results. Justifying that the what you end up with is meaningful when creating so many imputations is not a trivial task under any circumstances.

No heavy duty math so far. Maybe more another time.

Filed under Uncategorized

There has been a good deal of discussion regarding the correlation between temperatures at various locations throughout Antarctica. Several people have looked at the relationships between correlation and distance by creating graphs linking the two. One of the difficulties in interpreting these is that they are affected by a variety of factors, including the shape and topography of the continent and by the fact that the place is completely surrounded by a large pool of water.

It is informative to pick several locations and to see how the AVHHR series at that location is related to all other locations. I selected two points: the tip of the peninsula (Steig series 1) and the obvious interior point: the South Pole (grid point 1970 is the closest).

For a selected site, after calculating the 5509 correlations, we graph them using a color scale to represent the correlation (as usual red is positive and blue is negative and white areas represent zero correlation). The location of the grid site is represented by a green +. Keep in mind that these are correlations measuring relationships between temperatures at the grid point, not positive or negative trends.

First, we take the latest revelation from Steig, the cloud masked AVHHR data. The grid point is on the tip of the peninsula.

Several things stand out in the graph. Obviously, the region immediately adjacent to the grid point is strongly correlated, but what is somewhat surprising is that the correlation drops off fairly becoming negative while still in the Western Antarctic area. The relatively low correlation continues to the rest of the continent.

Next, we take the original reconstruction: ant_recon.txt. This was supposedly reconstructed from the previous data using RegEM and the manned surface stations:

The correlation has strengthened dramatically in the Western Antarctic so that now the pattern exhibited by the reconstruction at the tip of the peninsula seems to be reflected by the entire west. As wll, the Eastern portion has now become more strongly inversely correlated with the peninsula.