Sometimes a dog can be not only a “best friend”, but also a catalyst for solving statistical problems. A case in point: last night at 2:30 AM, our dog, Mr. Chips, woke me up to let him out to do his business. When I couldn’t go back to sleep immediately, I lay in the dark rethinking my current math problem, in this case, the “offset matching” on which I have already posted several times.

I had been concerned with the implementation of weights for combining series and several things did not sit quite right in my mind. So, as is good statistical procedure, I reformulated the problem not from the point of view of Tamino’s sum of squared differences of pairs of offset station values, but rather from the consideration of the estimation problem itself. It put everything into place for me.

The Math

As in my initial post, we have r stations each recording monthly temperatures. We assume that the temperature at a station is some common temperature plus a seasonal (monthly) value that is characteristic of that station. A further “error term” is added to account for errors in measurement and local “weather” effects.

Thus

where,

x_{i}(t) = the observed temperature at station i at time t

τ(t) = “temperature of the region”

µ_{i}(m(t)) = µ_{i}(m) the monthly value for station i

ε_{i}(t) = error term

i = 1, 2, …, r, = the station identifiers for a set of r stations

t = 1, 2, …, N, = the time (i.e. year and month) of the observation.

It must be realized that that there is a “tradeoff” between the τ(t) and the µ_{i}(m)’s – adding a constant to τ(t) and subtracting that constant from each of the µ_{i}(m)’s leaves the above equation unchanged, however we can deal with that issue in the following fashion:

Suppose we are given a set of station weights {ω_{i}} whose sum is equal to one and we look at what happens when we calculate the weighted average of the *actual* (not observed) temperatures at each of the stations. The equation looks like:

or

If we shift the µ_{i}(m)’s so that the sum in the last equation is equal to zero, then τ(t) simply becomes the average value we are trying to estimate.

Standard least squares theory says that the optimum solution (when the error terms are equally variable) is given by minimizing the sum of squares:

where δ_{i}(t) = 0 or 1 depending on whether an observation is absent or present.

For the equal weight case, this result is different from that obtained by using Tamino’s “offset matching” approach. It turns out that minimizing the sum of squares (where W(t) is the number of stations available at time t)

produces the same set of “offsets” µ_{i}(m) as those obtained by minimizing the sum of squares that Tamino has proposed. Clearly, although Tamino’s solution is unbiased, it cannot be optimal. It will have larger variability and correspondingly larger error bars than the simple least squares estimates detailed above. Although “offset matching” seemed like a good idea (to me as well), from the equations above, it can be seen that the flaw is that individual temperatures from a station are given less weight in the estimation process when there are other stations which are not reporting.

I have revised one of my earlier scripts to implement the “least squares” solution in R. However, given my loss of three hours of sleep last night, I will not be responsible for any errors which may be in the script until I have had more time to look at it more thoroughly. Later, I will run some comparisons of the two methods to indicate that the differences in the results are not negligible.

anov.combine = function(ttemp,wts=NULL) { #version 2.0 😉 #function to restructure data for anova conv.anov = function(ind.var) { dims.ind = dim(ind.var) response = c(ind.var) rowvar = as.factor(c(row(ind.var))) colvar = as.factor(c(col(ind.var))) months = as.factor(rep(1:12,len=nrow(ind.var))) data.frame(response=response,rowvar=rowvar,colvar=colvar,months=months)} datanov = conv.anov(ttemp) #calculate anovas by month anovout = by(datanov, datanov$months, function(x) lm(response~0+rowvar+colvar,data=x)) offs = matrix(NA,12,ncol(ttemp)) temp.ts = rep(NA,nrow(ttemp)) for (i in 1:12) { testco = coef(anovout[[i]]) whch = grep("rowvar",names(testco)) temp.ts[as.numeric(substr(names(testco)[whch],7,11))] = testco[whch] offs[i,] = c(0,testco[grep("colvar",names(testco))])} #center offsets and adjust time series nr = length(temp.ts) if (is.null(wts)) wts = rep(1,ncol(ttemp)) adj = c(offs %*% (wts/sum(wts))) offs = offs-adj temp.ts = temp.ts + rep(adj, length=nr) #calculate predicted and residuals for data nc = ncol(offs) matoff = matrix(NA,nr,nc) for (i in 1:nc) matoff[,i] = rep(offs[,i],length=nr) preds = c(temp.ts) + matoff colnames(offs) = colnames(ttemp) rownames(offs) = month.abb list(temps = ts(temp.ts,start=c(tsp(ttemp)[1],1),freq=12),offsets = offs, residual = ttemp - preds, predicted=preds) }

There is a lot to absorb about this post. I’m not as familiar with this math as you and if I’m confused, there is usually at least one other. When you have tau plus Mu I assume Mu must be anomaly of monthly delta rather than measured temp. Is that what you mean by seasonal in the text? If so perhaps some clarification would help, if not — definitely some clarification would help.

The transition from the function to be minimized to the code is still not obvious to me either. Of course further explanation may require that I pay by the credit hour.

Tau is the “climate signal” or “grid average” or whatever it is that you are trying to estimate. It would contain seasonal variations, trends and anything else applies to the total group of stations, etc. The mu’s are the monthly deviations from the overall pattern at a given station and are exactlywhat we were calling the “offsets” earlier. This reflects the difference from one station to another due to location: (close to a large body of water, in a desert or at a higher or lower altitude. However, we are not comparing them pairwise as in the Tamino formulation. The sum of these plus the “error term” is the actual measured temperature at given station at a specific time. Seasonal here refers to the 12 months.

To get to a regression, we define N + 12*r indicator variables, each of them taking the value one or zero. N is the total common length of the combined records, r is the number of stations and 12*r is the number of offsets. For a given measurement, only two of these will be a one and all the rest will be zero. The ones are the time (to pick the specific tau(t) ), and the month/station combination to pick the mu_i. Since the others are zero, you can write the sum of squares out simply by only showing the variables which have a one. The coefficients of the indicator variables are the actual values taus and mu’s that we are estimating.

In practice, since the months don’t overlap, we are only dealing with r + N/12 of these at a time in separate regressions. This speeds up the procedure and keeps us from having memory problems in the computer calculation. For example, in100 years and 20 stations, this is a total of 1200 + 12(20) =1440. The matrix to be inverted in the complete regression would be about 1440 x 1440 or about 2 200 000 elements in size. However, when separated by month, it become 120 x 120, a much quicker job (although there would be 12 of these matrices in a given combining problem). This would qualify in statisticsas a two factor analysis of variance situation.

If there are no missing values, the solution becomes trivial. The taus are nothing more than the average of all of the station values at a given time and the mu’s are basically the average of the station values for a given month (shifted as you like to make things add up). It is only when missing values are present that the solution changes in a non-trivial fashion.

Regardless of what people like carrot snorter say, you

canestimate the mean temperature of a region legitimately without resorting to anomalies because the mu’sareactually “anomalizers” of the monthly data of the stations. That way, if you wish to use anomalies, their calculation can be delayed until the combining has been carried out so a single common period is not required for this procedure to work properly. This allows for the use of all the data without throwing away somewhat shorter series or series from disjoint time intervals. You can actually think of this procedure as similar to what is done with tree rings series, except that all parameters are being simultaneously estimated.H and L did their stuff long ago when computers were less powerful and they were not statisticians, so they ended up with what I see as a clumsy way of dealing with the process. They did not realize that choosing the longest station will likely pick the

same long history stationas a starting point for every grid in a 1200 km radius of that station just becuase it is a long series. Talk about the most important trees in the world – these become the most important stations for the temperature record. However, their attempt to divide the world up into zillions of little grid cells is similar to what I have suggested, except that I would not limit the choice of stations to use within the type of regions that they seem to have defined. However, that is a different story.Hope it helps. I also hope that Tamino has not submitted his paper yet… 😉

By the way, I have written a script to do the math without the lm function so the calculation can be a little faster with fewer moving parts.

If I understand you Roman this means that we dont have to resort to H&L reference station method. You also should not have to “infill”

missing months.. is that correct?

Further there are ways this plays into H&L urban adjustment. Since they can only adjust stations if there is some arbitary period of overlap. Also, Jones method required that a station have presence in the base period. He didnt want to move the base period because he would lose many stations.

very cool.

No infilling, no reference stations, no taking stations in order – all of the parameters are estimated simultaneously. Yes, step changes can be accommodated by treating the two periods as separate stations.

It should be understood that I did not “invent” this method of combining stations. I am merely applying good existing statistical methodology to the problem. The method comes with known characteristics and tools for evaluating uncertainties, etc. In the script, I not only output estimates of the combined series (taus) and the offsets (mu’s), but I also give the predicted values (which

canbe used for infilling station series – for example to calculate annual series) and residual values (how much does a station observed series differ from the predicted). The latter can be used to evaluate the assumptions and to look for specific station effects (such as step changes and UHI).Dear Sir;

I hope someday to shake your hand, hopefully the ungloved one now that you have ‘thrown down the gauntlet’…

From my perspective, the weakest link in all of the climascientology numerology procedures to date, has been the absence of the Qualification Test Specification, and the detailed performance thereof.

Elegantly, you have left the verification of unbiased-ness ‘to the reader’, but the challenge is there nonetheless; which method is optimally free of artifact from missing data? , etc.

To my short list of poster labels;

“Brilliant”, Boring, Moderate, and Moronic,

(where the “Brilliant” people self-identify by calling others moronic)

you’ve highlighted the need for another;

Inspired. (as differentiated from “Brilliant” by their humility).

OK< and an extra bone for Mr. Chips…

Carpe Dinero

RR

Since this is a linear model, the estimates of the parameters will be unbiased. This is easy to see from the math involved. It can be demonstrated to non-math people by doing examples with “noiseless” data such as was used in the earlier post, Comparing Single and monthly Offsets , which showed that the single offset method used by Tamino was biased if the relationship between stations was not constant throughout the year.

Your list of poster labels brings to mind the self-description of a football player some time ago as “an ordinary superstar”. Perhaps a “

simplegenius” (not really an adjective) could replace “brilliant”.Never name a dog “Mister” – it gives him an attitude. 😉

Dear Sir;

At some point, I wonder if you would be so kind as to discuss the relative statistical (im)propriety of ‘homogenizing’ data sets which are admittedly non-homogeneous.

Is ‘homogenization’ a sanctioned process in your field?

Are there validity metrics for such numerological

‘adjustment’ processes as practiced by Menne et al… in the creation of USHCNv2?

Herein you seem to combine directly without ‘homogenizing’ anything.

Perhaps it is my imagination that you eschew such machinations, but I see none.

RR

That’s great. It will really help to get the “adjustments” made for station changes off the table.

Hmm.

If a station goes through a step change.. ..from an equipment change? can it just be handled as a new station?

Thanks Roman, Your answer is as long as your post, it was very clear and everything makes sense at this point. I’ve still not visited Open Mind– maybe later today.

.

Roman,

A bit off topic but I have been thinking about the uncertainty caused by empty grid cells. There are two issues with the data we have:

1) Large regions will have no coverage.

2) Anomolies have coherence over large areas (as seen by the satellite data).

I thinking of this method that would estimate an absolute min/max:

1) For each latitude band determine the min/max anomoly.

2) Use the max/min for each empty cell.

3) Calculate the gridded average with the min/max

4) Plot the average + min + max together.

This method would be used instead of extrapolation which GISS likes to do.

Does this method make any sense?

Yes, this is not really related to what we are talking about here and I have not put any thought into the general topics of lack of spatial coverage or the combined analysis of average, max and min simultaneously.

With regard to coverage, one would have to examine relationship between stations at various distances as well as the effect of the local geography, a much more complex situation than what I wish to look at at the moment. However, uncertainty due to missing value

istaken into account by the methods here.Pingback: Improved Gridded Temperature Calculation « the Air Vent

Pingbacks! I typed the wrong title and updated it immediately. Give a monkey a keyboard.

It should say – Improved Offset Temperature Calculation

Pingback: Improved Offset Temperature Calculation « the Air Vent

“With regard to coverage, one would have to examine relationship between stations at various distances as well as the effect of the local geography, a much more complex situation than what I wish to look at at the moment. ”

Roman, I have been looking at the variation of temperature anomaly trends for stations, grids and latitudes using the GHCN adjusted temperature set and the more I look, the more complicated it gets. I have not reached the end of the line yet (given my limited tool set), but I am starting to wonder what an average global or even regional temperature trend means.

it is obviously not a simple matter and maybe that is why these are interesting problems. My approach is to start as simply as possible and use more complex tools as they become necessary. Some of these ideas (such as using geography) would require painstaking effort and I don’t know if I would want to dedicate that much time to it. Where can you find a good graduate student when you need one? 🙂

What is the reasoning behind multiplying by W(t)? It seems at first glance that you would want to divide by W(t). If at t=1 you had one station reporting, and thus one term to sum, and at t=2 you had two stations, thus 2 terms to sum, why would you want to scale those two sums by an additional factor of W(2)=2, roughly as if 4 stations were reporting?

If you divide by W(t), at least you would work in mean squared error at each t.

I assume that you are referring to the equation:

That is not really the way the equation started out. It began life as the sum of squares looking like this where the offseted observations were paired and differenced:

Tamino suggested that subsequently the available offsetted values be simple averaged to estimate what I call the taus. I was able to show that the solution obtained by that procedure gave

exactlythe same results as minimizing the SS with the W(t)s in it. What I have indicated in this post is that that procedure is less than optimal for the problem. So, you are right in that those W’s are not good. In fact, as long as the variability of the “error” term is assumed to be the same for all observations, it appears to me that it is is better to have no W’s at all.I see. That all makes sense to me. And I agree it would be better to leave out the Ws.

Not bad…………

Keep at it 🙂

The first,

Tim

I think I’ll go back to *clouds*, I got to say it clouded over during daylight then cleared at dark.

If it keeps this up we will need roman’s ability to tell us with math stats. how long we will be able to eat.

pumpkin shortage, and now 3X the price for tomato s!

that’s all for now.

Hi Jeff & Roman,

“….from the equations above, it can be seen that the flaw is that individual temperatures from a station are given less weight in the estimation process when there are other stations which are not reporting.”

“….one would have to examine relationship between stations at various distances as well as the effect of the local geography, a much more complex situation than what I wish to look at at the moment. However, uncertainty due to missing value is taken into account by the methods here.”

In comparing stations at various distances (with weighting, usually) one likes to know that there is connectivity, so that the data from one station have some predictive power for another.

The problem has often arisen in estimation of mining grades and it led to a whole field of geostatistics. I won’t develop it here because you are probably well ahead of me. It can be 2-D or 3-D hence has potential for altitude complications.

If, however, you have not looked at the possibilities, I can suggest –

Introductory http://www.gaa.org.au/pdf/bok%20keynote.pdf

Not behind paywall, medium math complexity http://people.ku.edu/~gbohling/cpe940/Variograms.pdf

Quite severe criticism of geostats in several papers within http://www.geostatscam.com

Plenty of texts behind paywalls. The techniques are extensive and one needs to focus on the particular problem to get to a possible solution quickly.

I’m trying to suggest ways round known problems like having a station dominate calculations simply because it is complete for a long time and so is a pseudo reference. Also, I cannot come to grips with linkage extending over 1,000 km or so. The semivariogram can estimate the useful linkage distance.

Disclaimer: I am not proficient in statistics except for the very basics. i.e. industrial QC.

I like the idea of zillions of little grid cells. A fractal map if you will. That would avoid the necessity of smearing the data except for missing items. Of course as stations are added and subtracted from the record your map will change, but I think it would more accurately reflect reality than oceans translated to mountains.

So let me say this about that: the idea of fixed grid cells for temperature data is fundamentally wrong. Or at least get your fractal map and then project your grid cells on it after your fractal map is produced. Is that what you are doing?

I will say that coding such a device is probably a daunting exercise.

And of course the another question is: is a linear function the “right” way to infill data. I would give much more weight to nearby stations. And my reason? Topography.

“Linear” actually means “additive” here. The procedure does not assume that there is any sort of specific linear relationships governing the actual temperatures.

What is actually assumed is that there is some sort of “average” temperature for the group of stations. Each station differs from that average by a fixed amount for each month, so for example, stations A might be 4 degrees higher during all January’s, 3.6 in all February’s, etc. All of these “offsets” are estimated from the data for each station Then there is a third (random) quantity which takes into account measurement error, local weather and other unpredicatble effects. These effects are combined by addition rather than by multiplication – hence the term additive (where linear mathematical methods can be applied) model as opposed to say a multiplicative model. A multiplicative model would not be appropriate for this type of data.

It should also be understood that we are not infilling any data. Rather, we are adjusting the result by taking into account how much each of the missing values typically differs from the ones we did measure in that given month and how much effect that might have on the situation. It

couldbe used as a simple infilling tool if so desired. The whole idea is just a mathematical application of a simple common sense idea.What is

The adjustment is weighted by 1/distance? No?

I would weight it as 1/distance^2 or some such.

I don’t trust mountains to be representative of seashore.

Now if there were no mountains and no oceans….

I checked the math for myself. Very interesting how the relation between the two sums of squares works out like that, I certainly couldn’t see it without a bit of calculus. I’m a little surprised that Tamino wasn’t interested in the result, and deleted your comment. One could argue that your analysis provides some better justification for his method. Sure the implicit weighting in his approach is a little odd, but is that such a big deal?

I coded your approach in Matlab, using a Lagrange multiplier for the constraint rather than taking a pseudo-inverse. This works well, and is probably faster since Gaussian elimination can be used. I thought you might be interested if speed is a concern.

I was wondering about an alternative for the constraint. Using uniform weights, constrain the tau’s to add up to zero rather than the mu’s. You could then interpret the tau’s as being “regional anomalies” and the mu’s as baseline station temperatures. Any thoughts?

In any case, very nice work!

Nice to see somebody else interested in the math. The relationship is

notobvious at a surface level glance, but I had the advantage of experience in knowing that the variance of a set of values can be calculated by looking at the sum of squares of all pairwise differences of observations in a sample – a form very similar to the SS used by Tamino. I see Tamino’s SS as as providing insight into the more appropriate anova model starting point which unlike Tamino’s approach spells outallof the parameters and how they relate to each other.The “implicit weighting” is not a good thing. There is no good statistical (or common sense) reason to give less weight to equally valid measurements just because some other measurements are unavailable. The residuals do not add to zero for either each series separately or for all combined because of the weighting. His method is correctable by using a weight equal to 1/W(t) (where W(t) = number of observations available at time t as defined above) applied to each term in the SS – a property which I have used to speed up the calculations involved by a great deal and will post up shortly. My use of a pseudo-inverse is already pretty fast because it uses the R function, svd, which I understand is programmed pretty efficiently and, more importantly ( 😉 ) requires less work on my part. I would be interested in seeing your code to see what you have done.

With regard to constraining the tau’s to be zero (within each monthly calculation), it will certainly produce the same anomalies as one would get by anomalizing after the fact. There is a drawback that it would have to be recalculated anyway if one wanted the anomaly period to be directly comparable to some other series which might use a different time period. The offsets would also be shifted up and down on a monthwise basis and in most cases the visual information on how the various series compare across the year would become garbled. I personally prefer to keep the issue of anomalizing separate. Besides one of the things I am somewhat curious about is how much the “average global temperature” oscillates throughout the year due to the fact that the insolation varies reasonably substantially due to the earth’s orbital path. This information is unavailable from solely examining anomalies and requires knowing absolute temperatures, something most of the existing methods do not provide.

A second thought with regard to generating the “anomalies” directly: In the examples I have been working with, there are months when all of the series are missing and no estimates of tau are possible for that month. In those cases the anomalies are relative to months where data are available and hence results may not be comparable to results from another combined series. Thus, it is better to calculate estimate which represent average temperatures and leave the anomalies until such may be needed for another purpose. So no, I don’t think it is a good idea, in general, to calculate anomalies at that early stage.

I didn’t mean to suggest that the implicit weights were a good thing, I agree they aren’t. As you point out, your SS and his SS differ only in that the weights differ by a factor of W(t). I meant just that a different least squares weighting scheme is not such a great difference conceptually.

You make good points about constraining the mu’s. I can see why it wouldn’t fit your needs.

As for using the svd, I think it depends on how big your matrix is. If it works for you, then I see nothing wrong with it. For really big systems, I wouldn’t use it, or Gaussian elimination either. A conjugate gradient method would likely be better.

In any case, I posted my bit of Matlab code here:

http://tinyurl.com/yae3arz

It would have to be called separately for each month, with the weight set to zero for missing data points.

I took a look at your Matlab function and I see that it utilizes the very useful “\” operator to solve the linear equations. I don’t think such an operator is available in R so the last time I was translating a Matlab function to make it work in R, I cobbled up the pseudoinverse function to replace it. It works quite well and covers the situations where entire rows and/or columns are 0’s.

With regard to speed, the function that I posted here using the R regression function “lm” is quite slow because it does a lot more things using larger matrices. The re-write which I will do a post on soon works very quickly. The matrices for which svd is used are square of size (number of years + number of stations) for each dimension. This rarely gets as high as 200 and the R svd processes these very quickly. Twelve repetitons of this will do a grid square.

I ran psx.inv on a 200*200 matrices with random elements and typical times on my computer were all less than 0.5 seconds. Even a 500 x 500 ran in less than 7 seconds so I am not unhappy with it.

The R function solve() works similarly to Matlab’s \. In fact, for a non-special matrix, both call the LAPACK function DGESV, so they are just different wrappers for the same code.

I re-wrote psx in Matlab, and tested the speed. The \ method is faster than the svd method, but it only becomes significant for bigger matrices:

20 stations, 30 years: .02 vs .07 seconds

80 stations, 120 years: .04 vs .32 seconds

240 stations, 360 years: .34 vs 10.20 seconds

So for 200×200, it’s 8 times faster, but if that doesn’t happen very often… no big deal.

Hey, I had fun writing the program and it works pretty well. When doing the analysis by month, the matrices are not huge and the script moves along at a good pace – I don’t even have time to pour another cup of coffee (or a beer) between runs. 😉 .

Have you tried the new version? I am now working on streamlining the error bars so there is always something new to do even though I’m an old retired guy.

I’m new here, so don’t really know what’s going on. 🙂 I don’t even have R installed, nor do I know where to get temperature data to play with.

I do have a question for you about absolute temperature measurements though. Everywhere I look for data on the web it seems to be anomalies, not original temps. Why? I mean, if you want anomalies, it’s easy to get from the original temps, right? How hard is it to calculate an average over some period and subtract? It’s trivial. However, if only anomalies are put out on the web, then clearly you can’t get back the original data. Why remove information before posting data on the web?

Roman,

I don’t want to bash a point to death, but there are some interesting papers about satellite orbit accuracy and hence relation to sea level at

See especially

Goodness-of-Fit Tests for Sequential Orbit Determination

John H. Seago – Analytical Graphics, Inc.; David A. Vallado – Center for Space Standards and Innovation

Goodness-of-fit tests – sometimes called consistency tests – are useful for investigating the lack of optimality of an estimator. Statistical hypothesis tests using observation residuals are often suggested as goodness-of-fit tests for the orbit-determination problem; however, the most-commonly recommended diagnostic tests tend to be somewhat restricted in their scope and utility. In this paper, less-familiar statistical tests are discussed that potentially extend consistency testing into both the time- and frequency-domains, and also apply to observation residuals that are not necessarily spaced evenly with time. These supplemental tests are used to assess sequential-estimation results from actual satellite tracking data and simulated data. (PDF available, gives estimates of error).

The link you gave is a program listing for a meeting and not particularly informative to what their tests are all about. I looked around the web a bit and couldn’t find a copy of their presentation, but I found a paper written by the pair earlier. A cursory glance at that paper suggested to me that the statistical work was not particularly impressive due to the apparent multitude of assumptions required of the data.

You gotta stop these OT comments… 😉

Pingback: Faster Version for Combining Series « Statistics and Other Things

Hey, you should download R. When you have used Matlab, the transition to R is not so difficult. Three years ago I sneered at one of my colleagues who praised its virtues – now, I am one of R’s biggest fans.

Many of the temperature series are anomalies because they are mainly there for graphic presentation purposes. Oscillatory monthly series are difficult to make heads or tails of when looking at trends (although they contain more information about possible trends at different times of the year. You may find anomalies simple to calculate, but that is not necessarily the same skill that some others may have. I agree with you that stripping information (or altering it in other ways for spurious “homogenization” purposes is definitely not the way to go.

Actual temperatures are available from numerous places. Some blogs have pages like this one at Lucia’s Blackboard. Try checking it out to see what’s there. In this marvelous age,

everythingis available on the web.Pingback: Thermal Hammer « the Air Vent

Pingback: Roman M’s anomaly combination incorporated into R « the Air Vent

Pingback: Combining Stations (Plan B) | Statistics and Other Things

Pingback: Sampling Control with Semivariogram Analysis and Estimation Uncertanity | Tabagus - The Web and Data Portal