Faster Version for Combining Series

I have written a faster version for combining temperature time series (allowing for weights) in R.  I had hoped to post it along with an example using it, but I got sidetracked so the example is not completed yet.  However, I am posting the new function so that Jeff can use it in his latest efforts.    This version runs a LOT faster and more efficiently than the one posted earlier .

As well, I have found a “bug” in that version which causes the script to fail when any series is missing all of the values for some  month.  When the data is run in the newer version, you get results for all months, but I don’t think the results are necessarily realistic for the month in question.  There is no way to infer what the values for that month might look like for that station without making further assumptions so, at the moment, the best bet is to remove the offending series from the analysis and run the program without it.  I have included a short program to identify possible problem series.

Anyway, here is the updated version.  I might get the example done tomorrow if the nice weather we are currently having goes away. 😉

####Function for combining series
# For even faster calculation, use all=F
# to speed up multiple grid calculations

temp.combine = function(tsdat, wts=NULL, all=T) {
##### version2.0
### subfunction to do pseudoinverse
psx.inv = function(mat,tol = NULL) {
 if (NCOL(mat)==1) return( mat /sum(mat^2))
msvd = svd(mat)
 dind = msvd$d
if (is.null(tol)) {tol = max(NROW(mat),NCOL(mat))*max(dind)*.Machine$double.eps}
 dind[dind<tol]=0
 dind[dind>0] = 1/dind[dind>0]
 inv = msvd$v %*% diag(dind, length(dind)) %*% t(msvd$u)
inv}
### subfunction to do offsets
calcx.offset = function(tdat,wts) {
## new version
 nr = length(wts)
 delt.mat = !is.na(tdat)
 delt.vec = rowSums(delt.mat)
 row.miss= (delt.vec ==0)
 delt2 = delt.mat/(delt.vec+row.miss)
 co.mat = diag(colSums(delt.mat)) - (t(delt.mat)%*% delt2)
 co.vec = colSums(delt.mat*tdat,na.rm=T) - colSums(rowSums(delt.mat*tdat,na.rm=T)*delt2)
 co.mat[nr,] = wts
 co.vec[nr]=0
 psx.inv(co.mat)%*%co.vec }
### main routine
 nr = nrow(tsdat)
 nc = ncol(tsdat)
 dims = dim(tsdat)
 if (is.null(wts)) wts = rep(1,nc)
 wts=wts/sum(wts)
 off.mat = matrix(NA,12,nc)
 dat.tsp = tsp(tsdat)
 for (i in 1:12) off.mat[i,] = calcx.offset(window(tsdat,start=c(dat.tsp[1],i), deltat=1), wts)
 colnames(off.mat) = colnames(tsdat)
 rownames(off.mat) = month.abb
 matoff = matrix(NA,nr,nc)
 for (i in 1:nc) matoff[,i] = rep(off.mat[,i],length=nr)
 temp = rowMeans(tsdat-matoff,na.rm=T)
 pred=NULL
 residual=NULL
 if (all==T) { pred =  c(temp) + matoff
 residual = tsdat-pred }
list(temps = ts(temp,start=c(dat.tsp[1],1),freq=12),pred =pred, residual = residual, offsets=off.mat) }


#pick out those series with have at least nn + 1 observations in every month
#Outputs a logical vector with TRUE indicating that that sereis is OK
dat.check = function(tsdat, nn=0) {  good = rep(NA,ncol(tsdat))
 for (i in 1:ncol(tsdat)) good[i]= (min(rowSums(!is.na(matrix(tsdat[,i],nrow=12))))>nn)
 good }

12 Comments

Filed under Uncategorized

Combining Stations (Plan C)

Sometimes a dog can be not only a “best friend”, but also a catalyst for solving statistical problems.  A case in point: last night at 2:30 AM, our dog, Mr. Chips, woke me up to let him out to do his business.  When I couldn’t go back to sleep immediately, I lay in the dark rethinking my current math problem, in this case, the “offset matching” on which I have already posted several  times.

I had been concerned with the implementation of weights for combining series and several things did not sit quite right in my mind.  So, as is good statistical procedure, I reformulated the problem not from the point of view of Tamino’s sum of squared differences of pairs of offset station values, but rather from the consideration of the estimation problem itself.  It put everything into place for me.

Continue reading

41 Comments

Filed under Uncategorized

Weighted Seasonal Offsets

I have updated the R script for offset matching when combining multiple temperature time series so that it can now be used with different station weights.  This is useful when area weighting is desired in computing a mean grid cell series or when combining grid cells from different latitudes.  This method can now be used at every stage of determining a global series, from combining all sources at a single station to averaging mean latitudinal series to calculate a global mean – a good example of the adage “when all you have is a hammer, everything looks like a nail”.  🙂

Continue reading

17 Comments

Filed under Uncategorized

Comparing Single and Monthly Offsets

On the Open Mind blog, carrot eater posted the following comment with an inline response from Tamino:

Are the offsets mu specific by month? I’m curious at what point you take out the seasonal cycle.

[Response: There’s a single mu for a given station. Only after combining stations do I compute anomalies to remove the seasonal cycle. Computing a separate mu for each month is a viable alternative, but forces all station records to have the same average seasonal cycle as the reference station. If one wishes to study the seasonal cycle itself, it’s better to use a single offset for the station as a whole.]

So, is Tamino correct that the difference is only important with regard to the seasonal cycle or is the situation more complicated than that?

Continue reading

47 Comments

Filed under Uncategorized

Combining Stations (Plan B)

In his post on combining  station monthly temperature series, Tamino proposes a method for calculating an “offset” for each station.  This offset is intended to place the series for the stations all on at the same level.  The reason that makes such an accommodation necessary is the fact that there are often missing values in a given record – sometimes for complete years at a stretch.   Simple averages  in such a situation can provide a distorted view of what the overall picture looks like.

In his proposed “optimal” method, Tamino suggests  a procedure based on least square methods to calculate a set of coefficients for shifting a station record up or down before calculating a combined series for the set of stations.  The starting point is a sum of squares (which for calculational and discussion purposes I have rewritten in a slightly different form): Continue reading

35 Comments

Filed under Uncategorized

Rejected Again… Open Mind Isn’t!

I hate feeling rejected! It’s happened to me again.  Back in August 2009, I experienced that for the first time at RealClimate.   A comment which  I submitted on statistical methodology used in the Steig Antarctic paper did not pass their “peer review” and abruptly disappeared.

Today, I posted the following comment at Tamino’s Open Mind on the thread “Combining Stations” – one of a series of posts about processing station temperature data:

You don’t seem to get a lot of technical observations from your regular commenters in a thread such as this so I thought I might offer some comments about the method you are using to estimate the “offsets” for each station.

I don’t know if you are aware of it, but the “optimal” method appears to be a disguised two factor weighted analysis of variance. One factor is the stations (whose coefficients are the “offsets”) and the second is time (in this case months) treated as a numeric variable. This is not immediately obvious from the form in which you formulated the sum of squares that you are minimizing, but some fairly simple algebra can re-express that SS in a way which is more obviously consistent with an anova. The weights for a given temperature measurement are proportional to the number of stations which have available data for that month. I would have run some comparisons for your example in R, but your aversion to posting data and code makes that more difficult. However, I wrote a simple R program for calculating the offsets as well as some scripts for the anova and ran them on randomly generated data with identical results for both methods.

The side benefit of this is that the estimated monthly grid values are calculated simultaneously with the offsets rather than as (possibly sub-optimal averages) of the offset adjusted temperature series. Variability estimates can also be computed more easily as well.

Within an hour the comment dissipated in electron limbo.  I fail to understand why.

Was it the fact that I mistyped the phrase that time is ”treated as a numeric variable” (I meant to sat a “categorical” variable)?  Was it my observation that there were not a lot of technical contributions by the readers (there were 14 comments in the thread none of which contributed to the specific material being developed by the author)? Or maybe, my chiding remark that Tamino does not reveal his code or data (thereby preventing most of his readers from real participation)?  I honestly don’t know.

I should be pretty irritated but I take comfort in the fact that I am now in good company 🙂  – with people like Jeff Id and Lucia, both of whom I believe were summarily deleted from that blog at one time or another.

I had expected a real discussion regarding the methodology – that he would be gratified that someone actually expressed enough interest in the material to do further analysis (with some pretty interesting insights into his procedure), but I now suspect that he did not even understand exactly what I was telling him.    My guess would be that Tamino could actually have learned something from such a conversation, but it appears that he is more interested in appearing to be a mathematical “expert” to his small coterie of readers preferring to pontificate in one-sided expositions

I’ll also opine that the optimal method can definitely be called superior, both because it has no dependence on the order in which stations are processed, and because it simply minimizes the sum of squared differences between all pairs of station records.

rather than interacting with his audience in a meaningful and productive fashion to get a better grasp of the science behind it.

Open Mind? What a laugh!

28 Comments

Filed under Uncategorized

Met Office and the International Date Line

Since the release of the Met Office Subset recently, I have been looking further at the data trying to evaluate what exactly is contained therein.  One of the checks that one would like to do – determining how the global gridded data is calculated – is not really possible.  The data set is self-admittedly incomplete and, anyway, a list of which stations are used and which aren’t is not provided.  However, I decided to look at the grid cells information of the provided stations anyway to see what was going on.

Using another fairly extensive file of station information which I already possessed, ghcnd-stations.txt (which I believe to be downloaded from the previous incarnation of ClimateAudit.org), I did some checking of Met station information against the ghcn file.  Using only those that I could easily match up in the two data sets, I found differences in coordinates and elevations of a number of stations.  Although the heights (altitudes) need to be looked into further, because some of the differences can be quite large, the grid squares of most of the numerically different coordinates seemed   to be unaffected by those differences.  Except …

Continue reading

10 Comments

Filed under Uncategorized

Met Office and the Cru Data

Some days ago, the UK Met Office released a subset of the HadCrut3 data set at this web page.  I thought that it might be useful for people to be able to download the data and read it into the R statistical package for further analysis.  Along the way, I discovered, like Harry, that not everything is as quite as advertised so hopefully my observations on the differences will also prove useful.

The released data is available as a windows zipfile.   In the set, each station is available as a single text file (with no extension).  The files are contained in 85 directories (at this point in time) whose two digit names are the first two digits of the station numbers of the files in that directory.  It makes it easy to locate a station, but adds a bit of complication to trying to read the entire set simultaneously.  However, it turns out that  R actually can do it pretty easily J.  One word of warning:  When I downloaded the zipfile yesterday, I had found that new stations had been added to the set without any announcement of the changes on the web change that the set had been altered.  In the future, it is a good idea to keep an eye on the size of the file to determine if it is being altered.  Note to Met Office – it would be nice to have a date attached to the file indicating the most recent version.

Continue reading

10 Comments

Filed under Uncategorized

More Pictures

Here are four more pictures which are among my animal favorites.  Click on a picture to enlarge it:

Clicking on the enlarged picture opens a slightly larger version.

9 Comments

Filed under Photos

GHCN and Adjustment Trends

In his blog post, Giorgio Gilestro claims to show that the adjustments made by GHCN to temperature data do not induce  artificial trends into the data based mainly on this histogram of adjustments.

Arguments were made on the blog that the temporal order of the adjustments makes a difference and in fact these arguments seem to be correct.   I checked this through the following reasonably simple analysis:

Continue reading

96 Comments

Filed under Uncategorized