Category Archives: Uncategorized

Temporary Comment

[This was posted as a comment on Lucia's Blackboard blog but disappeared down a borehole. It is repeated here but may disappear if the original comment shows up]

No, I don’t have any references because I did not look for any. However, I can give you some visual examples of this unequal variability property.

For a while now, I have been studying daily station series and it seems quite clear that winter temperatures are more variable than summer ones and that this effect is very strongly accentuated as one moves toward the poles. Recently BEST published their newly minted temperature series and at their web page, they produced a gridded equal-area cell construction of 5498 monthly land temperature series.

I took a time-truncated subset of all of these series starting in February, 1956 (the date was chosen because it was the earliest date from which all of the cells had values for each month) and continuing to the most recent values. For each series, the standard deviation of the temperature was calculated separately for each month.

The first pair of plots gives a plot of the SD by latitude for the months of February and August. The red lines are references for the tropics.

The second pair is a plot versus longitude with point from the northern and southern hemisphere indicated by color.

Admittedly, I have not detrended the anomalies first, but the climate variation for a cell should not be large enough to create such large differences in the SDs between months.

Unequal variability would create a fairly substantial stumbling block for finding changepoints in series.


Filed under Uncategorized

Wegman and the Ankle-Biters

Since the initial publication of the “hockey sticks” by Michael Mann and members of his self-described “team,” there has been controversy over the methodology used in the studies and the reliability of the results.  The subsequent history is well known.

Papers were published by Steve McIntyre and Ross McKitrick showing that the centering methods used in the Mann papers produced statistically biased results.  Questions were raised about the proxies used and it became evident that one could produce similar “temperature” series by utilizing certain types of artificially generated random series rather than physical proxies which purportedly were related to the temperatures.  This eventually led to a congressional hearing to establish whether the M & M criticism was valid.  Prof. Edward Wegman was commissioned to produce a report examining the various claims.

For further discussion , it is important to understand the makeup of the contents of this report:

Executive summary (5 pages)

Introduction  (3 pages)

Background on Paleoclimate Temperature Reconstruction (13 pages)

-  Includes Paleo Info, PCs, Social networks

Literature Review of Global Climate Change Research (5 pages)

Reconstructions And Exploration Of Principal Component Methodologies (10 pages)

Social Network Analysis Of Authorships In Temperature Reconstructions (10 pages)

Findings (3 pages)

Conclusions and Recommendations (2 pages)

Bibliography (7 pages)

Appendix (33 pages)

- Includes PCA math, Summaries of papers and Sundry

The three topics which have been most discussed on the web include the background explanatory material on proxies (~6 pages), the analysis of the Mann applications of PCA (~16 pages) and the social network analysis (~15 pages which include 9 figures, each of which occupies a major portion of a page).

It should also be understood that these three topics are stand-alone.  The material discussed within a topic along with any results obtained is independent of that in each of the others. Therefore, criticism of some aspect of a particular topic will have no bearing on the accuracy or correctness of any portion of the other topics.

The material on social networks was then used in preparing a publication using social networking procedures to relationships among authors in the paleo-climate community.  The resulting paper was published in the Journal, Computational Statistics and Data Analysis, about a year later under the authorship of Yasmin H. Said, Edward J.Wegman, Walid K. Sharabati, and John T. Rigsby.

The original report has been a thorn in the side of advocates of Global Warming since it was presented.  Many attempts have been made to discredit the report including accusations of plagiarism of parts of a work by R. S. Bradley who made the rather extraordinary demand that the entire report be withdrawn even though, as I point out above, any such use of the material from his work had no impact on other portions of the report.

The concerted effort by the ankle-biters to discredit Prof. Wegman continued until it seems to have resulted in the withdrawal of the social networking paper due to charges that some material in the paper was not properly referenced.  The quality of the paper was further discussed in USA Today in an email interview with a “well-established expert in network analysis”, Kathleen Carley of Carnegie Mellon.  It is this latter news article that I wish to discuss.

Q: Would you have recommended publication of this paper if you were asked to review it for regular publication — not as an opinion piece — in a standard peer-reviewed network analysis journal?

A: No – I would have given it a major revision needed.

Over the past thirty or so years, there has been a move in the statistical community to present a greater number of papers illustrating applications of newer statistical techniques.  These are not papers intended to “move the science forward” as much as to inform other statisticians about the use of the techniques and to often highlight an innovative application of the methodology.  As such, they would certainly not be the type of paper submitted to a journal which specialized in the subject matter.  This sort of paper may sometimes written by students with their supervisor’s cooperation.

Note the letter Prof. Wegman sent to the editor of the journal:

Yasmin Said and I along with student colleagues are submitting a manuscript entitled ―Social Network Analysis of Author-Coauthor Relationships.This was motivated in part by our experience with Congressional Testimony last summer. We introduce the idea of allegiance as a way of clustering these networks. We apply these methods to the coauthor social networks of some prominent scholars and distinguish several fundamentally different modes of co-authorship relations. We also speculate on how these might affect peer review.

The indication is clear that the paper is intended to present a simple application of the methodology to the wider statistical audience.

Q: (How would you assess the data in this study?)

Data[sic]: Compared to many journal articles in the network area the description of the data is quite poor. That is the way the data was collected, the total number of papers, the time span, the method used for selecting articles and so on is not well described.

I agree that the data is not described in depth in this paper.  However, a better description of the data was given in the earlier report in which it was initially used.  That report was referenced in the submitted paper, but it is quite possible that it was not read by Prof. Carley.

It should be noted that the authors had decided not to mention any names of the subjects whose author network was analyzed. This would reduce both the type and the amount of information that could be included without violating that anonymity.

Q: (So is what is said in the study wrong?)

A: Is what is said wrong? As an opinion piece – not really.

Is what is said new results? Not really. Perhaps the main “novelty claim” are the definitions of the 4 co-authorship styles. But they haven’t shown what fraction of the data these four account for.

As I mention above, this is an expository presentation of the use of the methodology, NOT an “opinion piece” as characterized by Prof. Carley.  No “new results” as such are needed.  Furthermore, she does not indicate that the results in the presentation could be incorrect in any way.

There was one other paragraph in the article which caught my attention:

Carley is a well-established expert in network analysis. She even taught the one-week course that one of Wegman’s students took before 2006, making the student the “most knowledgeable” person about such analyses on Wegman’s team, according to a note that Wegman sent to CSDA in March.

Social network methodology is not rocket science. The average statistician could become reasonably proficient in applying the methodology in a relatively short period of time. Understanding the methodology sufficiently to “advance the science” would indeed require considerably more study and time to develop the skills needed. It is unfortunate that the journalist exposed his uninformed biases with a negative comment such as this.

There have been questions about the short review period before the paper was accepted.  Dr. Azen indicated in the USA Today article that he personally reviewed and accepted the paper, not a surprise if you take into account my earlier comments about the expository nature of the paper.  However, It is unfortunate that he did not demand a more comprehensive bibliography since it is indeed much too sparse.  If he had done so, we would likely not even be discussing this subject.

Nonetheless, nobody has demonstrated that the science in the paper has been faulty and, regardless of its demise, the fact stands that the original report and its conclusions regarding the flawed hockey stick cannot be impacted by this is any way.

But the ankle-biters keep yapping…  


Filed under Uncategorized

The Two-and-One-Half PC Solution

In his RealClimate post, West Antarctica Still Warming 2, Eric Steig discusses some of the criticisms made by Ryan O’donnell et al. (Improved methods for PCA-based reconstructions: case study using the Steig et al. 2009 Antarctic temperature reconstruction) of his 2009 Nature paper, Warming of the Antarctic ice-sheet surface since the 1957 International Geophysical Year.

Second, that in doing the analysis, we retain too few (just 3) EOF patterns. These are decompositions of the satellite field into its linearly independent spatial patterns. In general, the problem with retaining too many EOFs in this sort of calculation is that one’s ability to reconstruct high order spatial patterns is limited with a sparse data set, and in general it does not makes sense to retain more than the first few EOFs. O’Donnell et al. show, however, that we could safely have retained at least 5 (and perhaps more) EOFs, and that this is likely to give a more complete picture.


Continue reading


Filed under Uncategorized

EIV/TLS Regression – Why Use It?

Over the last month or two, I have been looking at the response by Schmidt, Mann, and Rutherford to McShane and Wyner’s paper on the hockey stick.  In the process, I took a closer look at the total least squares (error –in-variables or EIV) regression procedure which is an integral part of the methodology used by the hockey time in their paleo reconstructions.  Some of what I found surprised me.

A brief explanation of the difference between ordinary least squares (OLS) and EIV is in order.  Some further information can be found on the Wiki Error-in-Variables and Total least squares pages.  We will first look at the case where there is a single predictor.

Continue reading


Filed under Uncategorized

GHCN Twins

There has been a flurry of activity during the last several months in the area of constructing global temperature series. Although a variety of methods were used there seemed to be a fair amount of similarity in the results.

Some people have touted this as a “validation” of the work performed by the “professional” climate agencies which have been creating the data sets and working their sometimes obscure manipulation of the recorded temperatures obtained from the various national meteorological organizations that collected the data. I for one do not find the general agreement too surprising since most of us have basically used the same initial data sets for our calculations. I decided to take a closer look at the GHCN data since many of the reconstructions seem to use it.

Continue reading


Filed under Uncategorized

2010 Spring Arctic Sea Ice Extent

The decline in the sea ice extent in May and June of 2010 appeared to be extremely fast. According to NSIDC,

Arctic sea ice extent averaged 13.10 million square kilometers (5.06 million square miles) for the month of May, 500,000 square kilometers (193,000 square miles) below the 1979 to 2000 average. The rate of ice extent decline for the month was -68,000 kilometers (-26,000 square miles) per day, almost 50% more than the average rate of -46,000 kilometers (18,000 square miles) per day. This rate of loss is the highest for the month of May during the satellite record.

However, later on the same page, they also state under Conditions in Context:

As we noted in our May post, several regions of the Arctic experienced a late-season spurt in ice growth. As a result, ice extent reached its seasonal maximum much later than average, and in turn the melt season began almost a month later than average. As ice began to decline in April, the rate was close to the average for that time of year.

In sharp contrast, ice extent declined rapidly during the month of May. Much of the ice loss occurred in the Bering Sea and the Sea of Okhotsk, indicating that the ice in these areas was thin and susceptible to melt. Many polynyas, areas of open water in the ice pack, opened up in the regions north of Alaska, in the Canadian Arctic Islands, and in the Kara and Barents and Laptev seas.

This latter observation that the seasonal maximum was reached later in the season and the melt season started later is important. Regardless of specific annual weather conditions, May and June are melt season months in the Arctic. Furthermore, if there is more ice available, then it stands to reason that more melting will take place. What might a better way to look at the data than simply plotting the total extent?

From the JAXA site:

Why not graph the rate of change, as well? In particular, because a wider extent will naturally imply a higher areal melt under the same melting conditions, it makes sense to look at the daily percentage change.

To do this, I downloaded the JAXA daily ice data into R (from 2002 to the present). For convenience purposes, December 31 was deleted from both 2004 and 2008 to reduce the number of days to 365. The percentage change was calculated for each day for which the corresponding data was available. No infilling was done for missing data. The data was plotted:

(Click graph for larger version)

Here, all of the years prior to 2010 are plotted in gray and the current year in red. The plot gives graphic insight into the patterns of thawing and freezing: the thaw season goes from roughly mid-March to mid-September. The very high variability in October is likely due to a reasonably similar annual speed of recovery which is expressed as a percentage of quite varied minima starting points in September.

How does 2010 compare in May and June? For May, it is somewhat toward the lower part the combined record, but I would not classify it as extreme in any way. June was definitely below the other recent years during three periods of several days each. What will July and August look like? I guess we will have to wait and see…

The R script follows:

#get latest JAXA extent data

iceurl = url("")
latest = read.csv(iceurl,header=F,na.strings="-9999")
colnames(latest) = c("month","day","year","ext")

#remove Dec 31, 2004 and 2008 (extra leap year day) for convenience
#fill in with missing values for early part of 2002 (for convenience)

arc.ext = latest$ext
which((latest$month==12)&(latest$day==31)) # 214 579 945 1310 1675 2040 2406 2771 3136
arc.ext = arc.ext[-c(945,2406)]
arc.ext = c(rep(NA,365-214),arc.ext)

#length(arc.ext)/365 # 9

#calculate changes as % of current value
#form matrix with 9 columns (one for each year)

pct.change = matrix(100*c(diff(arc.ext),NA)/arc.ext,ncol=9)

#plot data
#years 2002 to 2009 as gray background
#year 2010 in red
#add month boundaries
modays = c(31,28,31,30,31,30,31,31,30,31,30,31)

matplot(pct.change[,1:9],type="l",main ="Arctic ice Extent Change Relative to Area",xlab="Day",
ylab="Daily % Change", col=c(rep("grey",8),"red"),lty=1)
abline(v=c(0,cumsum(modays)), col="green")
text(x =14+c(0,cumsum(modays)[-12]),y =c(rep(3,9),rep(-1,3)),,col="blue")


Filed under Uncategorized

Temporary Absence

I will be away from home (carrying out a study of the effect of Seasonal Local Warming on golf courses of the Dominican Republic) for a week so I will likely not reply to any comments during that time.  It`s a tough job, but somebody has to do it…

1 Comment

Filed under Uncategorized

Will the Real Rapid City Please Stand UP!

This is a post about the quality of temperature data.  You can spend a lot of time to find methods to maximize the information you squeeze out of data, but unless the data itself is reliable, all of the effort is wasted.  Recently, I ran across an example which I found somewhat disconcerting.

I had been testing some methods for estimating the temperature at particular locations in a geographic grid cell from the temperature data set released by the Met Office.  The grid cell was chosen on the basis that was a reasonable collection of stations available for use in the procedure: 40 – 45 N by 100 – 105 W in the north central region of the United States.  I chose a station with a longer fairly complete record and my intent was to look at distance based weighting for estimating the temperature at that station site using the neighboring stations.  Then I could compare the actual measured temperature to the estimated temperature to evaluate how well I had done.  But my results seemed poorer than I had expected. At that point, I thought that perhaps I should look more closely at the station record.

The station I had chosen was Rapid City, South Dakota – ID number 726620  with a current  population close to 60000 people according to Wikipedia .  For comparison purposes, I collected the same station’s records from a variety of other sources: GISS “raw” (same as “combined”) and homogenized directly from the Gistemp web pages, GHCN “raw” and adjusted from the v2 data set and what was listed as the same two GHCN records from the Climate Explorer web site.   The subsequent analysis proved quite interesting.

Continue reading


Filed under Uncategorized

Anomaly Regression – Do It Right!

I have been meaning to do this post for several years, but until now I have not found a particularly relevant time to do it.  In his recent post at the Air Vent, Jeff Id makes the following statement:

Think about that.  We’re re-aligning the anomaly series with each other to remove the steps.  If we use raw data  (assuming up-sloping data), the steps in this case were positive with respect to trend, sometimes the steps can be negative.  If we use anomaly alone (assuming up-sloping data), the steps from added and removed series are always toward a reduction in actual trend.  It’s an odd concept, but the key is that they are NOT TRUE trend as the true trend, in this simple case, is of course 0.12C/Decade.

The actual situation is deeper than Jeff thinks.  The usual method used by climate scientists for doing monthly anomaly regression is wrong! Before you say, “Whoa! How can a consensus be wrong?”, let me first give an example which I will follow up with the math to show you what the problem is.

Continue reading


Filed under Uncategorized

Faster Version for Combining Series

I have written a faster version for combining temperature time series (allowing for weights) in R.  I had hoped to post it along with an example using it, but I got sidetracked so the example is not completed yet.  However, I am posting the new function so that Jeff can use it in his latest efforts.    This version runs a LOT faster and more efficiently than the one posted earlier .

As well, I have found a “bug” in that version which causes the script to fail when any series is missing all of the values for some  month.  When the data is run in the newer version, you get results for all months, but I don’t think the results are necessarily realistic for the month in question.  There is no way to infer what the values for that month might look like for that station without making further assumptions so, at the moment, the best bet is to remove the offending series from the analysis and run the program without it.  I have included a short program to identify possible problem series.

Anyway, here is the updated version.  I might get the example done tomorrow if the nice weather we are currently having goes away. ;)

####Function for combining series
# For even faster calculation, use all=F
# to speed up multiple grid calculations

temp.combine = function(tsdat, wts=NULL, all=T) {
##### version2.0
### subfunction to do pseudoinverse
psx.inv = function(mat,tol = NULL) {
 if (NCOL(mat)==1) return( mat /sum(mat^2))
msvd = svd(mat)
 dind = msvd$d
if (is.null(tol)) {tol = max(NROW(mat),NCOL(mat))*max(dind)*.Machine$double.eps}
 dind[dind>0] = 1/dind[dind>0]
 inv = msvd$v %*% diag(dind, length(dind)) %*% t(msvd$u)
### subfunction to do offsets
calcx.offset = function(tdat,wts) {
## new version
 nr = length(wts)
 delt.mat = !
 delt.vec = rowSums(delt.mat)
 row.miss= (delt.vec ==0)
 delt2 = delt.mat/(delt.vec+row.miss)
 co.mat = diag(colSums(delt.mat)) - (t(delt.mat)%*% delt2)
 co.vec = colSums(delt.mat*tdat,na.rm=T) - colSums(rowSums(delt.mat*tdat,na.rm=T)*delt2)
 co.mat[nr,] = wts
 psx.inv(co.mat)%*%co.vec }
### main routine
 nr = nrow(tsdat)
 nc = ncol(tsdat)
 dims = dim(tsdat)
 if (is.null(wts)) wts = rep(1,nc)
 off.mat = matrix(NA,12,nc)
 dat.tsp = tsp(tsdat)
 for (i in 1:12) off.mat[i,] = calcx.offset(window(tsdat,start=c(dat.tsp[1],i), deltat=1), wts)
 colnames(off.mat) = colnames(tsdat)
 rownames(off.mat) =
 matoff = matrix(NA,nr,nc)
 for (i in 1:nc) matoff[,i] = rep(off.mat[,i],length=nr)
 temp = rowMeans(tsdat-matoff,na.rm=T)
 if (all==T) { pred =  c(temp) + matoff
 residual = tsdat-pred }
list(temps = ts(temp,start=c(dat.tsp[1],1),freq=12),pred =pred, residual = residual, offsets=off.mat) }

#pick out those series with have at least nn + 1 observations in every month
#Outputs a logical vector with TRUE indicating that that sereis is OK
dat.check = function(tsdat, nn=0) {  good = rep(NA,ncol(tsdat))
 for (i in 1:ncol(tsdat)) good[i]= (min(rowSums(![,i],nrow=12))))>nn)
 good }


Filed under Uncategorized