Will the Real Rapid City Please Stand UP!

This is a post about the quality of temperature data.  You can spend a lot of time to find methods to maximize the information you squeeze out of data, but unless the data itself is reliable, all of the effort is wasted.  Recently, I ran across an example which I found somewhat disconcerting.

I had been testing some methods for estimating the temperature at particular locations in a geographic grid cell from the temperature data set released by the Met Office.  The grid cell was chosen on the basis that was a reasonable collection of stations available for use in the procedure: 40 – 45 N by 100 – 105 W in the north central region of the United States.  I chose a station with a longer fairly complete record and my intent was to look at distance based weighting for estimating the temperature at that station site using the neighboring stations.  Then I could compare the actual measured temperature to the estimated temperature to evaluate how well I had done.  But my results seemed poorer than I had expected. At that point, I thought that perhaps I should look more closely at the station record.

The station I had chosen was Rapid City, South Dakota – ID number 726620  with a current  population close to 60000 people according to Wikipedia .  For comparison purposes, I collected the same station’s records from a variety of other sources: GISS “raw” (same as “combined”) and homogenized directly from the Gistemp web pages, GHCN “raw” and adjusted from the v2 data set and what was listed as the same two GHCN records from the Climate Explorer web site.   The subsequent analysis proved quite interesting.

Continue reading

23 Comments

Filed under Uncategorized

Anomaly Regression – Do It Right!

I have been meaning to do this post for several years, but until now I have not found a particularly relevant time to do it.  In his recent post at the Air Vent, Jeff Id makes the following statement:

Think about that.  We’re re-aligning the anomaly series with each other to remove the steps.  If we use raw data  (assuming up-sloping data), the steps in this case were positive with respect to trend, sometimes the steps can be negative.  If we use anomaly alone (assuming up-sloping data), the steps from added and removed series are always toward a reduction in actual trend.  It’s an odd concept, but the key is that they are NOT TRUE trend as the true trend, in this simple case, is of course 0.12C/Decade.

The actual situation is deeper than Jeff thinks.  The usual method used by climate scientists for doing monthly anomaly regression is wrong! Before you say, “Whoa! How can a consensus be wrong?”, let me first give an example which I will follow up with the math to show you what the problem is.

Continue reading

62 Comments

Filed under Uncategorized

Faster Version for Combining Series

I have written a faster version for combining temperature time series (allowing for weights) in R.  I had hoped to post it along with an example using it, but I got sidetracked so the example is not completed yet.  However, I am posting the new function so that Jeff can use it in his latest efforts.    This version runs a LOT faster and more efficiently than the one posted earlier .

As well, I have found a “bug” in that version which causes the script to fail when any series is missing all of the values for some  month.  When the data is run in the newer version, you get results for all months, but I don’t think the results are necessarily realistic for the month in question.  There is no way to infer what the values for that month might look like for that station without making further assumptions so, at the moment, the best bet is to remove the offending series from the analysis and run the program without it.  I have included a short program to identify possible problem series.

Anyway, here is the updated version.  I might get the example done tomorrow if the nice weather we are currently having goes away. ;)

####Function for combining series
# For even faster calculation, use all=F
# to speed up multiple grid calculations

temp.combine = function(tsdat, wts=NULL, all=T) {
##### version2.0
### subfunction to do pseudoinverse
psx.inv = function(mat,tol = NULL) {
 if (NCOL(mat)==1) return( mat /sum(mat^2))
msvd = svd(mat)
 dind = msvd$d
if (is.null(tol)) {tol = max(NROW(mat),NCOL(mat))*max(dind)*.Machine$double.eps}
 dind[dind<tol]=0
 dind[dind>0] = 1/dind[dind>0]
 inv = msvd$v %*% diag(dind, length(dind)) %*% t(msvd$u)
inv}
### subfunction to do offsets
calcx.offset = function(tdat,wts) {
## new version
 nr = length(wts)
 delt.mat = !is.na(tdat)
 delt.vec = rowSums(delt.mat)
 row.miss= (delt.vec ==0)
 delt2 = delt.mat/(delt.vec+row.miss)
 co.mat = diag(colSums(delt.mat)) - (t(delt.mat)%*% delt2)
 co.vec = colSums(delt.mat*tdat,na.rm=T) - colSums(rowSums(delt.mat*tdat,na.rm=T)*delt2)
 co.mat[nr,] = wts
 co.vec[nr]=0
 psx.inv(co.mat)%*%co.vec }
### main routine
 nr = nrow(tsdat)
 nc = ncol(tsdat)
 dims = dim(tsdat)
 if (is.null(wts)) wts = rep(1,nc)
 wts=wts/sum(wts)
 off.mat = matrix(NA,12,nc)
 dat.tsp = tsp(tsdat)
 for (i in 1:12) off.mat[i,] = calcx.offset(window(tsdat,start=c(dat.tsp[1],i), deltat=1), wts)
 colnames(off.mat) = colnames(tsdat)
 rownames(off.mat) = month.abb
 matoff = matrix(NA,nr,nc)
 for (i in 1:nc) matoff[,i] = rep(off.mat[,i],length=nr)
 temp = rowMeans(tsdat-matoff,na.rm=T)
 pred=NULL
 residual=NULL
 if (all==T) { pred =  c(temp) + matoff
 residual = tsdat-pred }
list(temps = ts(temp,start=c(dat.tsp[1],1),freq=12),pred =pred, residual = residual, offsets=off.mat) }


#pick out those series with have at least nn + 1 observations in every month
#Outputs a logical vector with TRUE indicating that that sereis is OK
dat.check = function(tsdat, nn=0) {  good = rep(NA,ncol(tsdat))
 for (i in 1:ncol(tsdat)) good[i]= (min(rowSums(!is.na(matrix(tsdat[,i],nrow=12))))>nn)
 good }

12 Comments

Filed under Uncategorized

Combining Stations (Plan C)

Sometimes a dog can be not only a “best friend”, but also a catalyst for solving statistical problems.  A case in point: last night at 2:30 AM, our dog, Mr. Chips, woke me up to let him out to do his business.  When I couldn’t go back to sleep immediately, I lay in the dark rethinking my current math problem, in this case, the “offset matching” on which I have already posted several  times.

I had been concerned with the implementation of weights for combining series and several things did not sit quite right in my mind.  So, as is good statistical procedure, I reformulated the problem not from the point of view of Tamino’s sum of squared differences of pairs of offset station values, but rather from the consideration of the estimation problem itself.  It put everything into place for me.

Continue reading

41 Comments

Filed under Uncategorized

Weighted Seasonal Offsets

I have updated the R script for offset matching when combining multiple temperature time series so that it can now be used with different station weights.  This is useful when area weighting is desired in computing a mean grid cell series or when combining grid cells from different latitudes.  This method can now be used at every stage of determining a global series, from combining all sources at a single station to averaging mean latitudinal series to calculate a global mean – a good example of the adage “when all you have is a hammer, everything looks like a nail”.  :)

Continue reading

17 Comments

Filed under Uncategorized

Comparing Single and Monthly Offsets

On the Open Mind blog, carrot eater posted the following comment with an inline response from Tamino:

Are the offsets mu specific by month? I’m curious at what point you take out the seasonal cycle.

[Response: There’s a single mu for a given station. Only after combining stations do I compute anomalies to remove the seasonal cycle. Computing a separate mu for each month is a viable alternative, but forces all station records to have the same average seasonal cycle as the reference station. If one wishes to study the seasonal cycle itself, it’s better to use a single offset for the station as a whole.]

So, is Tamino correct that the difference is only important with regard to the seasonal cycle or is the situation more complicated than that?

Continue reading

47 Comments

Filed under Uncategorized

Combining Stations (Plan B)

In his post on combining  station monthly temperature series, Tamino proposes a method for calculating an “offset” for each station.  This offset is intended to place the series for the stations all on at the same level.  The reason that makes such an accommodation necessary is the fact that there are often missing values in a given record – sometimes for complete years at a stretch.   Simple averages  in such a situation can provide a distorted view of what the overall picture looks like.

In his proposed “optimal” method, Tamino suggests  a procedure based on least square methods to calculate a set of coefficients for shifting a station record up or down before calculating a combined series for the set of stations.  The starting point is a sum of squares (which for calculational and discussion purposes I have rewritten in a slightly different form): Continue reading

29 Comments

Filed under Uncategorized