# Category Archives: Uncategorized

## Faster Version for Combining Series

I have written a faster version for combining temperature time series (allowing for weights) in R.  I had hoped to post it along with an example using it, but I got sidetracked so the example is not completed yet.  However, I am posting the new function so that Jeff can use it in his latest efforts.    This version runs a LOT faster and more efficiently than the one posted earlier .

As well, I have found a “bug” in that version which causes the script to fail when any series is missing all of the values for some  month.  When the data is run in the newer version, you get results for all months, but I don’t think the results are necessarily realistic for the month in question.  There is no way to infer what the values for that month might look like for that station without making further assumptions so, at the moment, the best bet is to remove the offending series from the analysis and run the program without it.  I have included a short program to identify possible problem series.

Anyway, here is the updated version.  I might get the example done tomorrow if the nice weather we are currently having goes away. 😉

####Function for combining series
# For even faster calculation, use all=F
# to speed up multiple grid calculations

temp.combine = function(tsdat, wts=NULL, all=T) {
##### version2.0
### subfunction to do pseudoinverse
psx.inv = function(mat,tol = NULL) {
if (NCOL(mat)==1) return( mat /sum(mat^2))
msvd = svd(mat)
dind = msvd\$d
if (is.null(tol)) {tol = max(NROW(mat),NCOL(mat))*max(dind)*.Machine\$double.eps}
dind[dind<tol]=0
dind[dind>0] = 1/dind[dind>0]
inv = msvd\$v %*% diag(dind, length(dind)) %*% t(msvd\$u)
inv}
### subfunction to do offsets
calcx.offset = function(tdat,wts) {
## new version
nr = length(wts)
delt.mat = !is.na(tdat)
delt.vec = rowSums(delt.mat)
row.miss= (delt.vec ==0)
delt2 = delt.mat/(delt.vec+row.miss)
co.mat = diag(colSums(delt.mat)) - (t(delt.mat)%*% delt2)
co.vec = colSums(delt.mat*tdat,na.rm=T) - colSums(rowSums(delt.mat*tdat,na.rm=T)*delt2)
co.mat[nr,] = wts
co.vec[nr]=0
psx.inv(co.mat)%*%co.vec }
### main routine
nr = nrow(tsdat)
nc = ncol(tsdat)
dims = dim(tsdat)
if (is.null(wts)) wts = rep(1,nc)
wts=wts/sum(wts)
off.mat = matrix(NA,12,nc)
dat.tsp = tsp(tsdat)
for (i in 1:12) off.mat[i,] = calcx.offset(window(tsdat,start=c(dat.tsp[1],i), deltat=1), wts)
colnames(off.mat) = colnames(tsdat)
rownames(off.mat) = month.abb
matoff = matrix(NA,nr,nc)
for (i in 1:nc) matoff[,i] = rep(off.mat[,i],length=nr)
temp = rowMeans(tsdat-matoff,na.rm=T)
pred=NULL
residual=NULL
if (all==T) { pred =  c(temp) + matoff
residual = tsdat-pred }
list(temps = ts(temp,start=c(dat.tsp[1],1),freq=12),pred =pred, residual = residual, offsets=off.mat) }

#pick out those series with have at least nn + 1 observations in every month
#Outputs a logical vector with TRUE indicating that that sereis is OK
dat.check = function(tsdat, nn=0) {  good = rep(NA,ncol(tsdat))
for (i in 1:ncol(tsdat)) good[i]= (min(rowSums(!is.na(matrix(tsdat[,i],nrow=12))))>nn)
good }

Filed under Uncategorized

## Combining Stations (Plan C)

Sometimes a dog can be not only a “best friend”, but also a catalyst for solving statistical problems.  A case in point: last night at 2:30 AM, our dog, Mr. Chips, woke me up to let him out to do his business.  When I couldn’t go back to sleep immediately, I lay in the dark rethinking my current math problem, in this case, the “offset matching” on which I have already posted several  times.

I had been concerned with the implementation of weights for combining series and several things did not sit quite right in my mind.  So, as is good statistical procedure, I reformulated the problem not from the point of view of Tamino’s sum of squared differences of pairs of offset station values, but rather from the consideration of the estimation problem itself.  It put everything into place for me.

Filed under Uncategorized

## Weighted Seasonal Offsets

I have updated the R script for offset matching when combining multiple temperature time series so that it can now be used with different station weights.  This is useful when area weighting is desired in computing a mean grid cell series or when combining grid cells from different latitudes.  This method can now be used at every stage of determining a global series, from combining all sources at a single station to averaging mean latitudinal series to calculate a global mean – a good example of the adage “when all you have is a hammer, everything looks like a nail”.  🙂

Filed under Uncategorized

## Comparing Single and Monthly Offsets

On the Open Mind blog, carrot eater posted the following comment with an inline response from Tamino:

Are the offsets mu specific by month? I’m curious at what point you take out the seasonal cycle.

[Response: There’s a single mu for a given station. Only after combining stations do I compute anomalies to remove the seasonal cycle. Computing a separate mu for each month is a viable alternative, but forces all station records to have the same average seasonal cycle as the reference station. If one wishes to study the seasonal cycle itself, it’s better to use a single offset for the station as a whole.]

So, is Tamino correct that the difference is only important with regard to the seasonal cycle or is the situation more complicated than that?

Filed under Uncategorized

## Combining Stations (Plan B)

In his post on combining  station monthly temperature series, Tamino proposes a method for calculating an “offset” for each station.  This offset is intended to place the series for the stations all on at the same level.  The reason that makes such an accommodation necessary is the fact that there are often missing values in a given record – sometimes for complete years at a stretch.   Simple averages  in such a situation can provide a distorted view of what the overall picture looks like.

In his proposed “optimal” method, Tamino suggests  a procedure based on least square methods to calculate a set of coefficients for shifting a station record up or down before calculating a combined series for the set of stations.  The starting point is a sum of squares (which for calculational and discussion purposes I have rewritten in a slightly different form): Continue reading

Filed under Uncategorized

## Rejected Again… Open Mind Isn’t!

I hate feeling rejected! It’s happened to me again.  Back in August 2009, I experienced that for the first time at RealClimate.   A comment which  I submitted on statistical methodology used in the Steig Antarctic paper did not pass their “peer review” and abruptly disappeared.

Today, I posted the following comment at Tamino’s Open Mind on the thread “Combining Stations” – one of a series of posts about processing station temperature data:

You don’t seem to get a lot of technical observations from your regular commenters in a thread such as this so I thought I might offer some comments about the method you are using to estimate the “offsets” for each station.

I don’t know if you are aware of it, but the “optimal” method appears to be a disguised two factor weighted analysis of variance. One factor is the stations (whose coefficients are the “offsets”) and the second is time (in this case months) treated as a numeric variable. This is not immediately obvious from the form in which you formulated the sum of squares that you are minimizing, but some fairly simple algebra can re-express that SS in a way which is more obviously consistent with an anova. The weights for a given temperature measurement are proportional to the number of stations which have available data for that month. I would have run some comparisons for your example in R, but your aversion to posting data and code makes that more difficult. However, I wrote a simple R program for calculating the offsets as well as some scripts for the anova and ran them on randomly generated data with identical results for both methods.

The side benefit of this is that the estimated monthly grid values are calculated simultaneously with the offsets rather than as (possibly sub-optimal averages) of the offset adjusted temperature series. Variability estimates can also be computed more easily as well.

Within an hour the comment dissipated in electron limbo.  I fail to understand why.

Was it the fact that I mistyped the phrase that time is ”treated as a numeric variable” (I meant to sat a “categorical” variable)?  Was it my observation that there were not a lot of technical contributions by the readers (there were 14 comments in the thread none of which contributed to the specific material being developed by the author)? Or maybe, my chiding remark that Tamino does not reveal his code or data (thereby preventing most of his readers from real participation)?  I honestly don’t know.

I should be pretty irritated but I take comfort in the fact that I am now in good company 🙂  – with people like Jeff Id and Lucia, both of whom I believe were summarily deleted from that blog at one time or another.

I had expected a real discussion regarding the methodology – that he would be gratified that someone actually expressed enough interest in the material to do further analysis (with some pretty interesting insights into his procedure), but I now suspect that he did not even understand exactly what I was telling him.    My guess would be that Tamino could actually have learned something from such a conversation, but it appears that he is more interested in appearing to be a mathematical “expert” to his small coterie of readers preferring to pontificate in one-sided expositions

I’ll also opine that the optimal method can definitely be called superior, both because it has no dependence on the order in which stations are processed, and because it simply minimizes the sum of squared differences between all pairs of station records.

rather than interacting with his audience in a meaningful and productive fashion to get a better grasp of the science behind it.

Open Mind? What a laugh!

Filed under Uncategorized

## Met Office and the International Date Line

Since the release of the Met Office Subset recently, I have been looking further at the data trying to evaluate what exactly is contained therein.  One of the checks that one would like to do – determining how the global gridded data is calculated – is not really possible.  The data set is self-admittedly incomplete and, anyway, a list of which stations are used and which aren’t is not provided.  However, I decided to look at the grid cells information of the provided stations anyway to see what was going on.

Using another fairly extensive file of station information which I already possessed, ghcnd-stations.txt (which I believe to be downloaded from the previous incarnation of ClimateAudit.org), I did some checking of Met station information against the ghcn file.  Using only those that I could easily match up in the two data sets, I found differences in coordinates and elevations of a number of stations.  Although the heights (altitudes) need to be looked into further, because some of the differences can be quite large, the grid squares of most of the numerically different coordinates seemed   to be unaffected by those differences.  Except …

Filed under Uncategorized

## Met Office and the Cru Data

Some days ago, the UK Met Office released a subset of the HadCrut3 data set at this web page.  I thought that it might be useful for people to be able to download the data and read it into the R statistical package for further analysis.  Along the way, I discovered, like Harry, that not everything is as quite as advertised so hopefully my observations on the differences will also prove useful.

The released data is available as a windows zipfile.   In the set, each station is available as a single text file (with no extension).  The files are contained in 85 directories (at this point in time) whose two digit names are the first two digits of the station numbers of the files in that directory.  It makes it easy to locate a station, but adds a bit of complication to trying to read the entire set simultaneously.  However, it turns out that  R actually can do it pretty easily J.  One word of warning:  When I downloaded the zipfile yesterday, I had found that new stations had been added to the set without any announcement of the changes on the web change that the set had been altered.  In the future, it is a good idea to keep an eye on the size of the file to determine if it is being altered.  Note to Met Office – it would be nice to have a date attached to the file indicating the most recent version.

Filed under Uncategorized

In his blog post, Giorgio Gilestro claims to show that the adjustments made by GHCN to temperature data do not induce  artificial trends into the data based mainly on this histogram of adjustments.

Arguments were made on the blog that the temporal order of the adjustments makes a difference and in fact these arguments seem to be correct.   I checked this through the following reasonably simple analysis:

Filed under Uncategorized

## Imputation, EM and RegEM

This is not intended to be an in-depth mathematical treatment of the subject although some prior linear algebra and matrix methodology is needed to understand some of theconcepts. Rather, it is more a statement of the underlying problem, some of the assumptions involved and an intuitive view of how the methods work. Some prior understanding of multivariate statistical structures and how relationships are measured is also assumed, but I will try to keep it as short and simple as possible.

The Imputation Problem and Assumptions

For our underlying example, we will take a set series of temperature measurements at each of N times and p locations. The problem is that some of the temperature values are missing and we would like to guess what these values might look like.

It is fairly obvious that in order to do any meaning computation, we need to make some assumptions about the structure in which the data is generated. Unless the temperatures at the various locations are somehow related (due to proximity of location or common influence of broader climate behavior) knowing what happens at one location will not help us to infer what the missing values at another location might have been.

The usual assumption is to assume that the vector of temperatures for all locations at a given time simultaneously behaves like multivariate normal variable: a set of individually normally distributed values linked together through a set of parameters called covariances (or interchangeably, correlations). The covariance measure the pair-wise (linear) relationship between two of the variables at a time. The reason that this assumption is useful is because the behavior of the temperatures at a given time is completely characterized by the mean vector, µ, (consisting of the mean temperatures for each of the p locations) and the pxp covariance matrix, Σ. The diagonal of the matrix consists of the variances of each temperature series.

When µ and Σ are known, it can be shown that the optimum way of guessing at the missing values is to do a linear regression where the known temperatures at that time are the predictors. The coefficients of the regression equation can be calculated directly from the mean vector and the covariance matrix (both of which are unknown population values). But we could estimate them from the data only if no data values were missing. So we are in a circular bind.

The EM Algorithm

This is where the EM algorithm comes in. It is a more general procedure which can be applied in many situations. I tried searching Google for a simple description (and found more “gentle” explanations loaded with heavy math) with no success. You can try Wiki for a definition but even there it quickly becomes unreadable.

The underlying situation is as we have already described it. There is data with an associated model. The model contains one or more parameters. Some of the data is either latent (unobservable) or simply missing. The algorithm works quite simply:

– Start with a guess at the missing values (although you could instead start by estimating the unknown parameters).

-Estimate the parameters since now all of the sample values are “known”.

– Use these new parameter estimates to optimally “predict” the unknown values.

-Alternate back and forth until there is little change in the results.

It sounds simple but there are entire books written on the procedure. And, there are pitfalls. For a given problem, the final answer can depend on the starting values since there is an optimization procedure involved and it is possible that convergence is to a local, not a global overall optimum. Furthermore, depending on the number of values to be imputed, it is possible that the problem becomes mathematically intractable and further manipulation is needed to find a solution.

In our situation, EM has mainly been used not to impute values, but to estimate the mean vector and covariance matrix (when a few values are missing at random and not for systematic reasons) so that hypotheses tests could be carried out on the population parameters. The use of this method for reconstructing large numbers of values has originated in other disciplines.

EM and Imputation of Temperatures

Back to our problem:

In our case, the parameters are µ and Σ which are estimated from the sample mean vector and the sample covariance matrix. Various methods are possible for imputing the missing values:

-As mentioned previously, using all of the known values to linearly predict the missing ones. One possible criticism of this is that it could be subject to “overfitting” because the number of predictors is large.

-Principle component regression. Take the known values (which are possibly related) and extract a given number of PCs. Use the PCs as predictors in a linear regression as above.

-Total Least Squares (also referred to as Errors In Variables). Here it is assumed that both the predictors and the missing values have errors and the residuals for data points are calculated orthogonally (perpendicularly) from the predicting planar surface, rather than “vertically” in the direction of the points being predicted. The method again involves singular value decomposition (SVD) of the data. This seems to have the drawback that the results change if the data are rotated in non-orthogonal fashion.

-Truncation of the SVD decomposition of the data matrix. This appears to be what Steven McIntyre’s regem_pca function in R does. The data matrix is decomposed using SVD, all eigenvalues smaller than a given value are replaced by zero and the data matrix then reconstructed. The missing values are then replaced by their counterparts from the newly reconstructed matrix.

-Inverse calibration. Here the value being reconstructed is used as the predictor in a reverse of the first method. I have not seen this method used anywhere, but for fun I have written an R function for doing so, although I haven’t really looked at the effectiveness of the method so far.

It is not easy to evaluate the relative efficiency of each of these methods because the asymptotic results do not have simple closed mathematical forms . Proper comparison should be done on a theoretical, mathematical basis. However, this is rarely the approach in climate science literature where the preferred “proof” is to apply the methodology to one or two “surrogate” model outputs as justification for the efficacy of techniques.

RegEM

Where does RegEM fit in? If the number of values to be imputed is very large, the number of parameters to be estimated may be larger than the available amount of real information in the data. Try fitting a regression with ten predictors using only 5 data points. In that case, the math fails (matrices which need to be inverted are singular and are therefore not invertible) and you can get infinitely many perfect fits. In order to get a “unique” solution, the problem must be “regularized” by the imposition of extra conditions on the fitting equation. This leads to methods such as (partial) truncated total least squares (what a mouthful!) and ridge regression. You might wish to look at this paper (starting at page 5) for a somewhat mathematical description of these two methods). Extra regularization parameters for “fine tuning” must be introduced so that the results become subject to the choices of the analyst. The procedure can also become unstable with lack of convergence in some situations. Because of the complexity, it is not easy to recognize possible bias in the results and the amount of possible underestimation of the variability of those results. Justifying that the what you end up with is meaningful when creating so many imputations is not a trivial task under any circumstances.

No heavy duty math so far. Maybe more another time.