Weighted Seasonal Offsets

I have updated the R script for offset matching when combining multiple temperature time series so that it can now be used with different station weights.  This is useful when area weighting is desired in computing a mean grid cell series or when combining grid cells from different latitudes.  This method can now be used at every stage of determining a global series, from combining all sources at a single station to averaging mean latitudinal series to calculate a global mean – a good example of the adage “when all you have is a hammer, everything looks like a nail”.  :)

The properties for weighted offsets are similar to those seen in the equal weight case.  For example, as implemented in the equal weight situation, for any month when all of the stations have measured values, the combined result will be exactly equal to the simple average of the station values.  Only when one or more stations are missing will the results differ to accommodate possible biases due to systematic higher or lower levels that those missing stations may have in that particular month.  The same property is evident for the weighted case.  If all stations report a value, the combined value at the time will always be the weighted average of those temperatures.  Missing values will again produce a different result.

The script has is provided as a single R function with the same name as before:  monthly.offset.    The input consists of two variables, tsdat which is (as before) a multivariate temperature time series and wts, an optional vector of nonnegative weights with a single weight for each station.  If no weights are provided, the function will perform a default equal weight calculation.  The function returns a list with the same results as earlier:  the combined series (temps), predicted series (pred), residual series (residual) and monthly offsets (offsets) for each station.  It should be noted that the offsets for a given month are centered so that the weighted mean of the offsets is equal to zero.  This is an important part of making the method work properly.

Climate science statistics advances, one small step at a time…

Update:  Further analysis has shown that this method is NOT optimal.   please see Combining Stations – Plan C]

####NEW weighted offset function
###  tsdat = time series data matrix
###  wts = vector of non-negative weights of length ncol(tsdat)

monthly.offset =function(tsdat,wts=NULL) {
#pseudoinverse routine
psx.inv = function(mat,tol = NULL) {
if (NCOL(mat)==1) return( mat /sum(mat^2))
msvd = svd(mat)
dind = msvd$d if (is.null(tol)) {tol = max(NROW(mat),NCOL(mat))*max(dind)*.Machine$double.eps}
dind[dind<tol]=0
dind[dind>0] = 1/dind[dind>0]
inv = msvd$v %*% diag(dind, length(dind)) %*% t(msvd$u)
inv}
#offset calculation routine
calc.wt.offset = function(tdat,wts) {
delt.mat = array(rep(wts,each=nrow(tdat)),dim = dim(tdat))*(!is.na(tdat))
delt.vec = rowSums(delt.mat)
co.mat = diag(colSums(delt.vec*delt.mat)) - (t(delt.mat)%*%delt.mat)
co.vec = colSums(delt.vec*delt.mat*tdat,na.rm=T) - colSums(rowSums(delt.mat*tdat,na.rm=T)*delt.mat)
offs = psx.inv(co.mat)%*%co.vec
c(offs) - sum(wts*offs)/sum(wts)}
#main function
if (is.null(wts)) wts = rep(1,ncol(tsdat))
weights = array(rep(wts,each=nrow(tsdat)),dim = dim(tsdat))*(!is.na(tsdat))
dims = dim(tsdat)
off.mat = matrix(NA,12,dims[2])
dat.tsp = tsp(tsdat)
for (i in 1:12) {
dats = window(tsdat,start=c(dat.tsp[1],i), deltat=1)
offsets = calc.wt.offset(dats,wts)
off.mat[i,] = offsets }
colnames(off.mat) = colnames(tsdat)
rownames(off.mat) = month.abb
nr = dims[1]
nc = dims[2]
matoff = matrix(NA,nr,nc)
for (i in 1:nc) matoff[,i] = rep(off.mat[,i],length=nr)
temp = rowSums(weights*(tsdat-matoff),na.rm=T)/rowSums(weights)
pred =  c(temp) + matoff
residual = tsdat-pred
list(temps = ts(temp,start=c(dat.tsp[1],1),freq=12),pred =pred, residual = residual, offsets=off.mat) }



Filed under Uncategorized

17 responses to “Weighted Seasonal Offsets”

1. http://www.math.sdsu.edu/AMS_SIAM08Shen/Shen.pdf

i’m confused?

sorry OT but i thought u might like this

[Roman: It isn’t easy to do a more complete evaluation from a set of powerpoint slides, but there are several general things that I would tend to question here. The method is one which apparently seeks to weight station series by using principal components to optimally reduce the variability of the end result. This is apparently based on the assumption that there is a single “global value” to be estimated. However, IMO, this overlooks the non-random regional variation present in the system. By minimizing the variability, there is the danger of biasing the final estimator so that there is no gain (and possibly a greater error introduced) in the final result. I could be misinterpreting things from the sparse description in the pdf so I remain to be convinced by what they are doing.

Interestingly, there appear to be two Sam Shens, both working in climate science. One is at the University of Alberta and this one at San Diego State. This one is an engineer with a PhD in Applied Math. From personal experience, the math guys tend to overlook the nuances of Statistics ;) ]

• Kenneth Fritsch

Steve Mosher thanks for the link. I am doing a simple minded analysis of the temperature uncertainties for GHCN adjusted temperatures and I have been looking for papers that attack the same problem.

I would like to give a summary of my analysis and methods used for Roman to look at and tell me where my analysis might be in error. Working with the data gives someone with my statistical skills a feel for what I am looking at but unfortunately not the confidence or full set of tools to confidently make conclusions.

I have done a quick read of the Shen paper and I cannot say I understand exactly what they are doing or the assumptions that have used. It would appear to be a work in progress.

When Roman says, “However, IMO, this overlooks the non-random regional variation present in the system”, I have to say that is the most interesting aspect of my analysis of the GHCN data so far.

I am currently looking for some patterns in the data variances. The distribution of global station temperature trends certainly does not fit a normal distribution (no big surprise there) while some grids (with enough stations to make a reasonable estimate) appear to fit a normal distribution well (using Shapiro-Wilk test for normality) and others do not. I also see some correlations of trends and variances with latitude.

On the same OT subject, I need to ask Roman if he has read the paper by the NOAA scientists (I will link to it if you want) where they used Anthony Watts’ CRN data of USHCN stations for an analysis that concludes, more or less that station CRN ratings have little or no affect on temperature trends. I need to read it again and in more detail, but the first time through it did raise some questions for me.

It also reminds that I hope that Watts finds some good statistical support for a paper I hear he has in the works on the same subject.

• Thx Kenneth

I keep coming back to this question of “how many thermometers” in two contexts. In one context there are a few individuals who scream nyquist and say that there are not enough thermometers. I went looking for a good explanation for them ( ee types) The other context is the context of choosing between:

1. A few good thermometers
2. More worse thermometers
3. More worse thermometers made “better”
by proceedures like
B. In filling

That question is fun.

WRT Menne. I havent read it. I think there is a problem in thinkimg that CR1-5 is some kind of quantitative scale. thats why I looked at CRN12 versus CRN5. There I could find a signal.. I think people took the CRN5 = 5 degree error as a FACT when it is not.
So in my mind CRN1-5 is a qualitative scale that may not map uniformly to the number line.

2. Dear Sir

I did greatly appreciate the compact, precise, and concise ( ‘closed form’ ?)
symbolic expression in the original exposition of
‘Roman’s Holiday (From Suboptimal Combination Offset Artifacts)’.
It provides the benchmark by which the correctness of intended implementation can be judged, as well as the soundness of the approach.

In the same breath, I recognize the huge service of providing turnkey implementations of statistically-sophisticated functions in a generally available math automation tool.

If you are in the mood to deal with the necessary editor, I hope you will find the time to give an encore performance of that leadership behaviour,
i.e., write out the mathematical expression being implemented.

In this regard, you have set the bar that many in ‘climascientology’ cannot clear or ‘choose not to attempt';
Formal Expression of the underlying math.

As Yogi Berra never said, “It Ain’t Rocket Science Math we’re talkin’ about here “…

TIA
RR

• RomanM

I will indeed write up the math behind this at some point in time to be determined. I was waiting until Tamino’s paper has been submitted. ;)

Actually, one of my least favorite activities is typing up the text and equations – so far, I have been doing the math on a scrap paper the old fashioned way with a pen. The major part of the fun is in working out and understanding the problem and I get alittle lazy when it comes to the formal exposition.

• Kenneth Fritsch

That’s what happens when you retire and you have no one to impress but yourself. When you say in a time TBD that to me is also a good indication of the retirement independence kicking in.

I tell my wife that we have both been retired so long and we have become so independent that we are no longer fit to be employed by others.

3. Ya beat me to it. This is exactly what I wanted to do. Very nice.

4. In my spare minutes I was struggling with how to handle the weighting of the offset anomalies. I was imagining all these re-centering operations to insure I didn’t add steps in. Your solution is far cleaner than what I would have come up with.

• RomanM

Jeff, from looking at the math, it was apparent that the weights had to be incorporated into the initial sum of squares to be minimized. Initially, it was not obvious how that was to be done from looking at the SS formulation used by Tamino. However, if one looked at it in the context of the reformulation in my initial post on the topic, the actual implementation became relatively simple.

• You don’t mind if I keep posting on this do you?

• Wiki has a post on weighted least squares.

http://en.wikipedia.org/wiki/Linear_least_squares

Don’t konw if the math will copy correctly below.

\hat \boldsymbol \beta = (X^\top W X)^{-1} X^\top W \mathbf y. \,

[Roman: I will try to fix that:

$\hat \beta = \left( {X^T WX} \right)^{ - 1} X^T Wy$

I think that worked]

5. Kenneth Fritsch

I am imposing here with more off topic comments on the Menne paper analysis of the Watts CRN ratings, so Roman do with it what you will.

The link to the paper is:
ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/v2/monthly/menne-etal2010.pdf

And a blog response by Watts is here:
http://wattsupwiththat.com/?s=Surface+Station+Survey

On re-reading the paper I was reminded of the following:

Menne uses CRN12 versus CRN 345 and uses the time period of 1980-present for the comparison of trends. Why did he use CRN12 versus CRN345 as opposed to CRN123 versus CRN45. Why did he choose the time period 1980-present versus lets say 1920-present? Why did he not do sensitivity tests for groupings and time period selections? If a station evolved into hotter (colder) environment not related to the climate would not one considered that much of the change occurred some time ago and then reached a plateau more or less in the following years?

Menne spends time in the paper explaining a good correlation of a newer network of stations recording temperature to those for the USHCN whether “good” or “bad” stations. Then one realizes that these newer stations are only 5 years old and that the comparison is not temperature to temperature but temperature change to temperature change. Again the authors make no effort to point out that the changes in station CRN quality could have been “completed” before 5 years ago. I saw no statistical test in the paper for significance or what level of difference would be required to see significance given the small number of CRN 1 and 2 level stations in the study. While not relevant to the paper’s conclusion, I would think showing the large variability in station trends, whatever the CRN rating, would show the problem of a comparison with smaller sample sizes.

A paper like this one, in my mind, cries out for independent study and analysis and not ones by the owners of a data set. I saw in the Watts blog response that he has a paper in process with Pielke Sr analyzing the Watts CRN rated stations. I certainly hope that they get all the good statistical advice that they can acquire.

6. RomanM

Jeff:

The Wiki is mainly generalities and I don’t think it is particularly helpful for understanding what is going on here. It is more informative to look at the SS to see what is going on. I will probably post something on that when I get around to it.

• Thanks Roman. I’m having a hard time seeing the math from the code on this one. It’s only the calc.wt.offset which is difficult. I know what it does from your first post where you used ‘ln’ but don’t really see how.

Was it your intent to take the next step of weighting when you wrote this part out yourself? b/c that’s what got my head thinking.

BTW, I’ve been rewriting the global temp code again. Each time it gets simpler and cleaner. I think I’m going to redo the map plotting code which makes the whole thing look more complex than it is.

• RomanM

I am not sure what sort of an explanation on the code you are looking for here. Could you be a bit more specific?

Yes, it was always my intent to figure out how to implement weighting for area weighting, other reasons. In particular, I have a problem with the way grid values are calculated using only stations within the grid cell ignoring stations just over the boundary. I think it might be interesting to instead to estimate temperatures at fixed points using weights related to distance from surrounding stations (ignoring grid boundaries) and then averaging those to calculate a grid mean temperature. This gets around the problem of the grid cell shape influencing the result. However, that’s a biyt down the road.

I look forward to seeing your slimmer, faster code…s

7. Pingback: Proxy Hammer « the Air Vent