<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Statistics and Other Things</title>
	<atom:link href="http://statpad.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://statpad.wordpress.com</link>
	<description></description>
	<lastBuildDate>Wed, 15 May 2013 19:42:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='statpad.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Statistics and Other Things</title>
		<link>http://statpad.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://statpad.wordpress.com/osd.xml" title="Statistics and Other Things" />
	<atom:link rel='hub' href='http://statpad.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Temporary Comment</title>
		<link>http://statpad.wordpress.com/2013/01/28/594/</link>
		<comments>http://statpad.wordpress.com/2013/01/28/594/#comments</comments>
		<pubDate>Mon, 28 Jan 2013 20:02:10 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=594</guid>
		<description><![CDATA[[This was posted as a comment on Lucia's Blackboard blog but disappeared down a borehole. It is repeated here but may disappear if the original comment shows up] No, I don&#8217;t have any references because I did not look for &#8230; <a href="http://statpad.wordpress.com/2013/01/28/594/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=594&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>[This was posted as a comment on Lucia's <em>Blackboard</em> blog but disappeared down a borehole.  It is repeated here but may disappear if the original comment shows up] </p>
<p>No, I don&#8217;t have any references because I did not look for any.  However, I can give you some visual examples of this unequal variability property.</p>
<p>For a while now, I have been studying daily station series and it seems quite clear that winter temperatures are more variable than summer ones and that this effect is very strongly accentuated as one moves toward the poles.  Recently BEST published their newly minted temperature series and at their web page, they produced a gridded   equal-area cell construction of 5498 monthly land temperature series.</p>
<p>I took a time-truncated subset of all of these series starting in February, 1956 (the date was chosen because it was the earliest date from which all of the cells had values for each month) and continuing to the most recent values.  For each series, the standard deviation of the temperature was calculated separately for each month.</p>
<p>The first pair of plots gives a plot of the SD  by latitude for the months of February and August.  The red lines are references for the tropics.</p>
<p><img src="http://climateaudit.files.wordpress.com/2013/01/lat_sds.jpeg?w=500" alt="" /></p>
<p>The second pair is a plot versus longitude with point from the northern and southern hemisphere indicated by color.</p>
<p><img src="http://climateaudit.files.wordpress.com/2013/01/long_sds.jpeg?w=500" alt="" /></p>
<p>Admittedly, I have not detrended the anomalies first, but the climate variation for a cell should not be large enough to create such large differences in the SDs between months.</p>
<p>Unequal variability would create a fairly substantial stumbling block for finding changepoints in series.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/594/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/594/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=594&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2013/01/28/594/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>

		<media:content url="http://climateaudit.files.wordpress.com/2013/01/lat_sds.jpeg" medium="image" />

		<media:content url="http://climateaudit.files.wordpress.com/2013/01/long_sds.jpeg" medium="image" />
	</item>
		<item>
		<title>Wegman and the Ankle-Biters</title>
		<link>http://statpad.wordpress.com/2011/05/21/wegman-and-the-ankle-biters/</link>
		<comments>http://statpad.wordpress.com/2011/05/21/wegman-and-the-ankle-biters/#comments</comments>
		<pubDate>Sat, 21 May 2011 19:56:09 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=498</guid>
		<description><![CDATA[Since the initial publication of the &#8220;hockey sticks&#8221; by Michael Mann and members of his self-described &#8221;team,&#8221; there has been controversy over the methodology used in the studies and the reliability of the results.  The subsequent history is well known. Papers &#8230; <a href="http://statpad.wordpress.com/2011/05/21/wegman-and-the-ankle-biters/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=498&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Since the initial publication of the &#8220;hockey sticks&#8221; by Michael Mann and members of his self-described &#8221;team,&#8221; there has been controversy over the methodology used in the studies and the reliability of the results.  The subsequent history is well known.</p>
<p>Papers were published by Steve McIntyre and Ross McKitrick showing that the centering methods used in the Mann papers produced statistically biased results.  Questions were raised about the proxies used and it became evident that one could produce similar &#8220;temperature” series by utilizing certain types of artificially generated random series rather than physical proxies which purportedly were related to the temperatures.  This eventually led to a congressional hearing to establish whether the M &amp; M criticism was valid.  Prof. Edward Wegman was commissioned to produce a <a href="www.uoguelph.ca/~rmckitri/research/WegmanReport.pdf">report</a> examining the various claims.</p>
<p>For further discussion , it is important to understand the makeup of the contents of this report:</p>
<blockquote><p>Executive summary (5 pages)</p>
<p>Introduction  (3 pages)</p>
<p>Background on Paleoclimate Temperature Reconstruction (13 pages)</p>
<p>-  Includes Paleo Info, PCs, Social networks</p>
<p>Literature Review of Global Climate Change Research (5 pages)</p>
<p>Reconstructions And Exploration Of Principal Component Methodologies (10 pages)</p>
<p>Social Network Analysis Of Authorships In Temperature Reconstructions (10 pages)</p>
<p>Findings (3 pages)</p>
<p>Conclusions and Recommendations (2 pages)</p>
<p>Bibliography (7 pages)</p>
<p>Appendix (33 pages)</p>
<p>- Includes PCA math, Summaries of papers and Sundry</p></blockquote>
<p>The three topics which have been most discussed on the web include the background explanatory material on proxies (~6 pages), the analysis of the Mann applications of PCA (~16 pages) and the social network analysis (~15 pages which include 9 figures, each of which occupies a major portion of a page).</p>
<p>It should also be understood that these three topics are stand-alone.  <em>The material discussed within a topic along with any results obtained is independent of that in each of the others.</em> Therefore, criticism of some aspect of a particular topic will have no bearing on the accuracy or correctness of any portion of the other topics.</p>
<p>The material on social networks was then used in preparing a publication using social networking procedures to relationships among authors in the paleo-climate community.  The <a href="http://www.dean2016.com/wp-content/uploads/2011/05/wegman-retracted.pdf">resulting paper</a> was published in the Journal, Computational Statistics and Data Analysis, about a year later under the authorship of Yasmin H. Said, Edward J.Wegman, Walid K. Sharabati, and John T. Rigsby.</p>
<p>The original report has been a thorn in the side of advocates of Global Warming since it was presented.  Many attempts have been made to discredit the report including accusations of plagiarism of parts of a work by R. S. Bradley who made the rather <em>extraordinary</em> demand that the entire report be withdrawn even though, as I point out above, any such use of the material from his work had no impact on other portions of the report.</p>
<p>The concerted effort by the <em>ankle-biters</em> to discredit Prof. Wegman continued until it seems to have resulted in the <a href="http://www.usatoday.com/weather/climate/globalwarming/2011-05-15-climate-study-plagiarism-Wegman_n.htm">withdrawal of the social networking paper</a> due to charges that some material in the paper was not properly referenced.  The quality of the paper was <a href="http://content.usatoday.com/communities/sciencefair/post/2011/05/retracted-climate-critics-study-panned-by-expert-/1">further discussed in USA Today</a> in an email interview with a “well-established expert in network analysis”, Kathleen Carley of Carnegie Mellon.  It is this latter news article that I wish to discuss.</p>
<blockquote><p>Q: Would you have recommended publication of this paper if you were asked to review it for regular publication &#8212; not as an opinion piece &#8212; in a standard peer-reviewed network analysis journal?</p>
<p>A: No &#8211; I would have given it a major revision needed.</p></blockquote>
<p>Over the past thirty or so years, there has been a move in the statistical community to present a greater number of papers illustrating applications of newer statistical techniques.  These are not papers intended to “move the science forward” as much as to inform other statisticians about the use of the techniques and to often highlight an innovative application of the methodology.  As such, they would certainly not be the type of paper submitted to a journal which specialized in the subject matter.  This sort of paper may sometimes written by students with their supervisor&#8217;s cooperation.</p>
<p>Note the letter Prof. Wegman sent to the editor of the journal:</p>
<blockquote><p>Yasmin Said and I along with student colleagues are submitting a manuscript entitled ―Social Network Analysis of Author-Coauthor Relationships.This was motivated in part by our experience with Congressional Testimony last summer. We introduce the idea of allegiance as a way of clustering these networks. We apply these methods to the coauthor social networks of some prominent scholars and distinguish several fundamentally different modes of co-authorship relations. We also speculate on how these might affect peer review.</p></blockquote>
<p>The indication is clear that the paper is intended to present a simple application of the methodology to the wider statistical audience.</p>
<blockquote><p>Q: (How would you assess the data in this study?)</p>
<p>Data[sic]: Compared to many journal articles in the network area the description of the data is quite poor. That is the way the data was collected, the total number of papers, the time span, the method used for selecting articles and so on is not well described.</p></blockquote>
<p>I agree that the data is not described in depth in this paper.  However, a better description of the data was given in the earlier report in which it was initially used.  That report was referenced in the submitted paper, but it is quite possible that it was not read by Prof. Carley.</p>
<p>It should be noted that the authors had decided not to mention any names of the subjects whose author network was analyzed.  This would reduce both the type and the amount of information that could be included without violating that anonymity.</p>
<blockquote><p>Q: (So is what is said in the study wrong?)</p>
<p>A: Is what is said wrong? As an opinion piece &#8211; not really.</p>
<p>Is what is said new results? Not really. Perhaps the main &#8220;novelty claim&#8221; are the definitions of the 4 co-authorship styles. But they haven&#8217;t shown what fraction of the data these four account for.</p></blockquote>
<p>As I mention above, this is an expository presentation of the use of the methodology, NOT an “opinion piece” as characterized by Prof. Carley.  No “new results” as such are needed.  Furthermore, she does not indicate that the results in the presentation could be incorrect in any way.</p>
<p>There was one other paragraph in the article which caught my attention:</p>
<blockquote><p>Carley is a well-established expert in network analysis. She even taught the one-week course that one of Wegman&#8217;s students took before 2006, making the student the &#8220;most knowledgeable&#8221; person about such analyses on Wegman&#8217;s team, according to a note that Wegman sent to CSDA in March.</p></blockquote>
<p>Social network methodology is not rocket science. The average statistician could become reasonably proficient in applying the methodology in a relatively short period of time. Understanding the methodology sufficiently to “advance the science” would indeed require considerably more study and time to develop the skills needed. It is unfortunate that the journalist exposed his uninformed biases with a negative comment such as this.</p>
<p>There have been questions about the short review period before the paper was accepted.  Dr. Azen indicated in the USA Today article that he personally reviewed and accepted the paper, not a surprise if you take into account my earlier comments about the expository nature of the paper.  However, It <em>is</em> unfortunate that he did not demand a more comprehensive bibliography since it is indeed much too sparse.  If he had done so, we would likely not even be discussing this subject.</p>
<p>Nonetheless, nobody has demonstrated that the science in the paper has been faulty and, regardless of its demise, the fact stands that the original report and its conclusions regarding the flawed <em>hockey stick</em> cannot be impacted by this is any way.</p>
<p><a href="http://deepclimate.org/2011/05/16/retraction-of-said-wegman-et-al-2008-part-2/">But the ankle-biters keep yapping…</a>  </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/498/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/498/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=498&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2011/05/21/wegman-and-the-ankle-biters/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>
	</item>
		<item>
		<title>The Two-and-One-Half PC Solution</title>
		<link>http://statpad.wordpress.com/2011/02/10/the-two-and-one-half-pc-solution/</link>
		<comments>http://statpad.wordpress.com/2011/02/10/the-two-and-one-half-pc-solution/#comments</comments>
		<pubDate>Thu, 10 Feb 2011 21:34:08 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=471</guid>
		<description><![CDATA[In his RealClimate post, West Antarctica Still Warming 2, Eric Steig discusses some of the criticisms made by Ryan O’donnell et al. (Improved methods for PCA-based reconstructions: case study using the Steig et al. 2009 Antarctic temperature reconstruction) of his &#8230; <a href="http://statpad.wordpress.com/2011/02/10/the-two-and-one-half-pc-solution/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=471&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In his RealClimate post, <a href="http://www.realclimate.org/index.php/archives/2011/02/west-antarctica-still-warming-2/">West Antarctica Still Warming 2</a>,  Eric Steig discusses some of the criticisms made by Ryan O’donnell et al. (Improved methods for PCA-based reconstructions: case study using the Steig et al. 2009 Antarctic temperature reconstruction) of his  2009 Nature paper, Warming of the Antarctic ice-sheet surface since the 1957 International Geophysical Year.</p>
<blockquote><p>Second, that in doing the analysis, we retain too few (just 3) EOF patterns. These are decompositions of the satellite field into its linearly independent spatial patterns. In general, the problem with retaining too many EOFs in this sort of calculation is that one’s ability to reconstruct high order spatial patterns is limited with a sparse data set, and in general it does not makes sense to retain more than the first few EOFs. O’Donnell et al. show, however, that we could safely have retained at least 5 (and perhaps more) EOFs, and that this is likely to give a more complete picture.</p></blockquote>
<p>&nbsp;</p>
<p><span id="more-471"></span></p>
<p>Some background may be required for the non-statistically oriented reader.  A singular value decomposition allows one to take a set of numerical data sequences (usually arranged into a matrix array) and to decompose it into a new set of sequences (called Principal Components or Empirical Orthogonal Functions), a set of weights indicating the amount of variability of the original set accounted for by each of the PCs (called eigenvalues or singular values) and a set of coefficients which relate each PC to each of the original sequences.  To reconstruct the original sequences, one can take each PC, multiply it by its eigenvalue and then use the coefficients to reproduce the original data.</p>
<p>This decomposition has certain properties which can be very useful in understanding and analyzing the original data.  In particular, when there are strong relationships between the data sequences, several of the eigenvalues may be much larger than the rest.  Using only the PCs which belong to those eigenvalues can create a very good replica of the data, but with fewer “moving parts”.</p>
<p>In the Steig paper, the authors divided the Antarctic into 5509 grid cells.  They took a huge amount of satellite data and from it formed a monthly  temperature sequence ( from January 1982 to December 2006 – 300 months) for each of the grid cells.   The problem was to estimate the behavior of various regions of Antarctica during the longer period from January 1957 to the end of 2006 (a total of 600 months).  Since the data before the satellite era was sparse both geographically and temporally, it was decided to try to “extend” the satellite data to the earlier period by first relating it to the ground temperatures that were available  and then using that relationship to guess what the satellite temperatures might have been prior to 1982.</p>
<p>This is a good idea, but, as always, the devil is in the details.  Using the totality of the available satellite sequences was unwieldy, both from a mathematical and statistical standpoint. This is where the decision to use a PC approach came in handy.  The satellite temperature sequences have a great deal of relationship within their structure.  For example, one would expect that geographically adjacent grid cells would have very similar behavior so it was apparent that this approach could reasonably produce something useful.</p>
<p>How many PCs should one use?  This is the specific disagreement mentioned in the above quote from the RC post.  Too many PCs mean a larger number of values to be estimated from the earlier station data (300 for each PC wherein the “overfitting” claim arises).  On the other hand, too few PCs mean that the reconstruction will be unable to properly separate the spatial and temporal temperature patterns.  The temperatures from the peninsula can be “smeared” to West Antarctica, there could be no ability to make separate examinations of the temperatures during the various seasons or any combination of these items.<br />
One could argue that the graph from Odonnell et al. displayed in the RC post illustrates this difference:</p>
<p><a href="http://statpad.files.wordpress.com/2011/02/rc_ondonplot1.png"><img class="aligncenter size-full wp-image-477" title="RC_ondonplot1" src="http://statpad.files.wordpress.com/2011/02/rc_ondonplot1.png?w=500&#038;h=309" alt="" width="500" height="309" /></a></p>
<p>Too few PCs will produce a monotone colored plot:  With a single PC, every grid cell will have exactly the same characteristics since only a single multiplier and a single PC are available to reconstruct it.  With two PCs, only two coefficients are available to differentiate the entire cell sequence from all of the others, etc.  As more PCs are added, more variation in the coloring becomes possible.  Whether this represents a greater reality or not becomes an issue.</p>
<p>What method should be used to “establish the relationship” between the satellites and the ground and to extend the sequences to the pre-1982 era?  There are several of these that are available.  They have different properties and it is important to understand what the drawbacks of each can be.  The two mentioned in the RC post are TTLS (truncated total least squares &#8211; advocated by Mann and Steig) and ‘iridge’ (individual ridge regression – part of the methodology used by OD 2010).  Discussion of these is a complicated matter which is beyond the discussion here.</p>
<p>The point of this post is to look at the specific 3 PCs which were used by the Steig paper.  One can download the Antarctic reconstruction from the papers <a href="http://faculty.washington.edu/steig/nature09data/data/ant_recon.txt">web site (warning –it is a VERY large text file)</a> and using the R svd function decompose it using a singular value decomposition.  Since the means of the sequences are not all zero, the “PCs” are not orthogonal, but it does not impact the point being made here.  A plot of the 3 PCs used by Steig et al (oriented so that all trends are positive) produce the following:</p>
<p><a href="http://statpad.files.wordpress.com/2011/02/steigpcs.jpeg"><img class="aligncenter size-full wp-image-478" title="steigpcs" src="http://statpad.files.wordpress.com/2011/02/steigpcs.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>The third PC looks somewhat different from the other two.  The (extended) portion prior to 1982 is pretty close to being identically zero.  Any reconstruction of the satellite data using these PCS becomes essentially a two PC reconstruction prior to 1982 and three PCs afterward.  The end effect of the third PC on the overall results is to put a bend in the trends at that point (upward if the coefficient is positive and down if negative).</p>
<p>However, can this reconstruction differentiate well between the Antarctic regions in the early temperature record?  I sincerely doubt it unless someone believes that the record is sufficiently homogeneous both spatially and temporally to justify that possibility.  Perhaps, the authors of Steig et al. could explain this in more detail &#8211; I presume that they would have seen the same graph when they were writing the paper.</p>
<p>Why did this occur?  My best guess is that it might be a result of <a href="http://statpad.wordpress.com/2010/12/19/eivtls-regression-why-use-it/">using the total least squares function</a> in the procedure.</p>
<p>I will give three more plots.  Each of these is a plot of the relative size of the grid cell coefficient for each of the PCs when combining them to create the three PC satellite series.  NOTE:  These are NOT the values of the trends.</p>
<p><a href="http://statpad.files.wordpress.com/2011/02/pccoef1.jpeg"><img class="aligncenter size-full wp-image-474" title="pccoef1" src="http://statpad.files.wordpress.com/2011/02/pccoef1.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>PC1 produces a general increasing trend throughout the continent.  This trend is somewhat more pronounced in the central area.</p>
<p><a href="http://statpad.files.wordpress.com/2011/02/pccoef2.jpeg"><img class="aligncenter size-full wp-image-475" title="pccoef2" src="http://statpad.files.wordpress.com/2011/02/pccoef2.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>PC2 is the main driver for the peninsula – West Antarctica relationship.</p>
<p><a href="http://statpad.files.wordpress.com/2011/02/pccoef3.jpeg"><img class="aligncenter size-full wp-image-476" title="pccoef3" src="http://statpad.files.wordpress.com/2011/02/pccoef3.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>The main effects of this are felt after 1982 – it adds to the cooling lower right and the warming upper left.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/471/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/471/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=471&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2011/02/10/the-two-and-one-half-pc-solution/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2011/02/rc_ondonplot1.png" medium="image">
			<media:title type="html">RC_ondonplot1</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2011/02/steigpcs.jpeg" medium="image">
			<media:title type="html">steigpcs</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2011/02/pccoef1.jpeg" medium="image">
			<media:title type="html">pccoef1</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2011/02/pccoef2.jpeg" medium="image">
			<media:title type="html">pccoef2</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2011/02/pccoef3.jpeg" medium="image">
			<media:title type="html">pccoef3</media:title>
		</media:content>
	</item>
		<item>
		<title>EIV/TLS Regression &#8211; Why Use It?</title>
		<link>http://statpad.wordpress.com/2010/12/19/eivtls-regression-why-use-it/</link>
		<comments>http://statpad.wordpress.com/2010/12/19/eivtls-regression-why-use-it/#comments</comments>
		<pubDate>Sun, 19 Dec 2010 17:19:55 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=440</guid>
		<description><![CDATA[Over the last month or two, I have been looking at the response by Schmidt, Mann, and Rutherford to McShane and Wyner’s paper on the hockey stick.  In the process, I took a closer look at the total least squares &#8230; <a href="http://statpad.wordpress.com/2010/12/19/eivtls-regression-why-use-it/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=440&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Over the last month or two, I have been looking at the <a href="http://www.e-publications.org/ims/submission/index.php/AOAS/user/submissionFile/8349?confirm=2d010a44">response by Schmidt, Mann, and Rutherford</a> to McShane and Wyner’s paper on the hockey stick.  In the process, I took a closer look at the total least squares (error –in-variables or EIV) regression procedure which is an integral part of the methodology used by the hockey time in their paleo reconstructions.  Some of what I found surprised me.</p>
<p>A brief explanation of the difference between ordinary least squares (OLS) and EIV is in order.  Some further information can be found on the Wiki <a href="http://en.wikipedia.org/wiki/Errors-in-variables_models">Error-in-Variables</a> and <a href="http://en.wikipedia.org/wiki/Total_least_squares">Total least squares</a> pages.  We will first look at the case where there is a single predictor.</p>
<p><span id="more-440"></span> <strong>Univariate Case<br />
</strong></p>
<p>The OLS model for predicting a response Y from a predictor X through a linear relationship looks like:</p>
<p><img src='http://s0.wp.com/latex.php?latex=Y_k++%3D+%5Calpha++%2B+%5Cbeta+X_k++%2B+e_k+%2C%5Cquad+k+%3D+1%2C2%2C...%2Cn+&amp;bg=ffffff&amp;fg=333333&amp;s=1' alt='Y_k  = &#92;alpha  + &#92;beta X_k  + e_k ,&#92;quad k = 1,2,...,n ' title='Y_k  = &#92;alpha  + &#92;beta X_k  + e_k ,&#92;quad k = 1,2,...,n ' class='latex' /></p>
<p>α and β are the intercept and the slope of the relationship, e is the “random error” in the response  variable due to sampling and/or other considerations and n is the sample size.  The model is fitted by choosing the linear coefficients  which minimize the sum of the squared errors (which is also consistent with maximum likelihood estimation for a Normally distributed response:</p>
<p><img src='http://s0.wp.com/latex.php?latex=SS+%3D+%5Csum%5Climits_%7Bk+%3D+1%7D%5En+%7B%5Cleft%28+%7BY_k++-+%5Calpha++-+%5Cbeta+X_k+%7D+%5Cright%29%5E2+%7D+&amp;bg=ffffff&amp;fg=333333&amp;s=1' alt='SS = &#92;sum&#92;limits_{k = 1}^n {&#92;left( {Y_k  - &#92;alpha  - &#92;beta X_k } &#92;right)^2 } ' title='SS = &#92;sum&#92;limits_{k = 1}^n {&#92;left( {Y_k  - &#92;alpha  - &#92;beta X_k } &#92;right)^2 } ' class='latex' /></p>
<p>The problem is easily solved by using matrix algebra and estimation of uncertainties in the coefficients is relatively trivial.</p>
<p>EIV regression attempts to solve the problem when there may also be “errors”, f, in the predictors themselves:</p>
<p><img src='http://s0.wp.com/latex.php?latex=Y_k++%3D+%5Calpha++%2B+%5Cbeta+%28X_k++%2B+f_k+%29+%2B+e_k+%2C%5Cquad+k+%3D+1%2C2%2C...%2Cn+&amp;bg=ffffff&amp;fg=333333&amp;s=1' alt='Y_k  = &#92;alpha  + &#92;beta (X_k  + f_k ) + e_k ,&#92;quad k = 1,2,...,n ' title='Y_k  = &#92;alpha  + &#92;beta (X_k  + f_k ) + e_k ,&#92;quad k = 1,2,...,n ' class='latex' /></p>
<p>The f-errors are usually assumed to be independent of the e-errors and the estimation of all the parameters is done by minimizing a somewhat different looking expression:</p>
<p><img src='http://s0.wp.com/latex.php?latex=SS+%3D+%5Csum%5Climits_%7Bk+%3D+1%7D%5En+%7B%5Cleft%28+%7BY_k++-+Y_k%5E%2A+%7D+%5Cright%29%5E2++%2B+%7D+%5Csum%5Climits_%7Bk+%3D+1%7D%5En+%7B%5Cleft%28+%7BX_k++-+X_k%5E%2A+%7D+%5Cright%29%5E2+%7D+&amp;bg=ffffff&amp;fg=333333&amp;s=1' alt='SS = &#92;sum&#92;limits_{k = 1}^n {&#92;left( {Y_k  - Y_k^* } &#92;right)^2  + } &#92;sum&#92;limits_{k = 1}^n {&#92;left( {X_k  - X_k^* } &#92;right)^2 } ' title='SS = &#92;sum&#92;limits_{k = 1}^n {&#92;left( {Y_k  - Y_k^* } &#92;right)^2  + } &#92;sum&#92;limits_{k = 1}^n {&#92;left( {X_k  - X_k^* } &#92;right)^2 } ' class='latex' /></p>
<p>under the condition</p>
<p><img src='http://s0.wp.com/latex.php?latex=Y_k%5E%2A++%3D+%5Calpha++%2B+%5Cbeta+X_k%5E%2A+%2C%5Cquad+k+%3D+1%2C2%2C...%2Cn+&amp;bg=ffffff&amp;fg=333333&amp;s=1' alt='Y_k^*  = &#92;alpha  + &#92;beta X_k^* ,&#92;quad k = 1,2,...,n ' title='Y_k^*  = &#92;alpha  + &#92;beta X_k^* ,&#92;quad k = 1,2,...,n ' class='latex' /></p>
<p>X<sup>* </sup> and Y<sup>*</sup> (often called scores) are the unknown  actual values of X and Y.  The minimization problem can be recognized as calculating the minimum  total of the perpendicular squared distances from the data points to a line which contains the estimated scores.  Mathematically, the problem of calculating the estimated coefficients of the line can be solved by using a principal components calculation on the data.  It should be noted that the data (predictors and responses should each be centered at zero beforehand.</p>
<p>The following graphs illustrate the difference in the two approaches:</p>
<p><a href="http://statpad.files.wordpress.com/2010/12/ols_eiv_plots.jpeg"><img class="aligncenter size-full wp-image-443" title="ols_eiv_plots" src="http://statpad.files.wordpress.com/2010/12/ols_eiv_plots.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>What could be better, you ask.   Well, all is not what it may seem at first glance.</p>
<p>First, you might have noticed that the orange lines connecting the data to the scores in the EIV plot are all parallel.  The adept reader can see from considerations of similar triangles that the ratio of the estimated errors, e and f (the green lines plotted for one of the sample points), is a constant equal to minus one times the slope coefficient (or one over that coefficient dependent on which is the numerator term).  The claim that somehow this regression properly takes into account the error uncertainty of the predictors seems spurious at best.</p>
<p>The second and considerably more important problematic feature is that, as the  total-least squares page of Wiki linked above states:  “total least squares does not have the property of units-invariance (it is not scale invariant).”   Simply put, if you rescale a variable (or express it in different units), you will NOT get the same result as for the unscaled case.  Thus, if we are doing a paleo reconstruction and we decide to calibrate to temperature anomalies as F’s rather than C’s, we will end up with a different reconstruction.  How much different will depend on the details of the data.  However, the point is that we can get two different answers simple by using different units in our analysis.  Since all sorts of rescaling can be done on the proxies, the end result is subject to the choices made.</p>
<p>To illustrate this point, we use the data from the example above.  The Y variable is multiplied by a scale factor ranging from .1 to 20.  The new slope is calculated and divided by the old EIV slope which has also been scaled by the same factor.</p>
<p><a href="http://statpad.files.wordpress.com/2010/12/univscaling.jpeg"><img class="aligncenter size-full wp-image-444" title="univscaling" src="http://statpad.files.wordpress.com/2010/12/univscaling.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>If the procedure was invariant under scaling (as OLS is), then the result should be equal to unity in all cases.  Instead, one can see that for scale factors close to zero, the EIV behaves basically like ordinary OLS regression .   As the scale factor increases, the result (after unscaling) looks like 1/the OLS slope with the X and Y variables switched.</p>
<p>However, that is not the end of the story.  What happens if both X and Y are each scaled to have standard deviation 1?  This surprised me somewhat.  The slope can only take  either +1 or -1 (Except for some cases where the data form an exactly symmetric pattern for which ALL slopes produce exactly the same SS).</p>
<p>In effect, this would imply that, after unscaling , the EIV calculated slope = sd(Y) / sd(X).  To a statistician, this would be very disconcerting since this slope is not determined in any way shape or form by  any existing relationship between X and Y – this is the answer when the data points are in an exactly straight line or when they are uncorrelated.  It is not affected by sample size so clearly large sample convergence results would not be applicable.  On the other hand, the OLS slope = Corr(X,Y) * sd(Y) / sd(X)  for the same case so that this criticism would not apply to that result.</p>
<p><strong>Multilinear Case<br />
</strong></p>
<p>So far we have only dealt with the univariate case.   Perhaps if there are more predictors, this would alleviate the problems we have seen here.  All sorts of comparisons are possible, but to shorten the post, we will only look at the effect of rescaling the all of the variables to unit variance.</p>
<p>Using R, we generate a sample of 5 predictors and a single response variable with 20000 values each.   The variables are generated “independently” (subject to the limits of a random number generator).  We calculate the slope coefficients for both the straight OLS regression and also for EIV/TLS:</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top">Variable</td>
<td valign="top">OLS Reg</td>
<td valign="top">EIV-TLS</td>
</tr>
<tr>
<td valign="top">X1</td>
<td valign="top">0.005969581</td>
<td valign="top">1.9253757</td>
</tr>
<tr>
<td valign="top">X2</td>
<td valign="top">0.010657532</td>
<td valign="top">1.8661962</td>
</tr>
<tr>
<td valign="top">X3</td>
<td valign="top">-0.005656248</td>
<td valign="top">3.7607298</td>
</tr>
<tr>
<td valign="top">X4</td>
<td valign="top">-0.003537972</td>
<td valign="top">0.6509362</td>
</tr>
<tr>
<td valign="top">X5</td>
<td valign="top">0.003616522</td>
<td valign="top">4.4236177</td>
</tr>
</tbody>
</table>
<p>All of the theoretical coefficients are supposed to be zero and with 20000 observations, the difference should not be large.  In fact 95% confidence intervals for the OLS coefficients all contain the value 0.  However, the EIV result is completely out to lunch.  The response Y must be scaled down by about 20%, to have all of the EIV coefficients become small enough to be inside the 95% CIs calculated by the OLS procedure.</p>
<p><strong>EIV on Simulated Proxy Data</strong></p>
<p>We give one more example of what the effect of applying EIV in the paleo environment can be.</p>
<p>As I mentioned earlier, I have been looking at the <a href="http://www.e-publications.org/ims/submission/index.php/AOAS/user/submissionFile/8349?confirm=2d010a44">response by Gavin and crew</a> to the M-W paper.   In their response, the authors use artificial proxy data to compare their EIV construct to other methods.   Two different climate models are used to generate a “temperature series” and proxies (which have auto-regressive errors) are provided.  I took the CSM model (time frame used 850 to 1980) with 59 proxy sequences as the data.  An EIV fit with these 59 predictors was carried out using the calibration period 1856 to 1980.  A simple reconstruction was calculated from these coefficients for the entire time range.</p>
<p>This reconstruction was done for each of the three cases:  (i) Temperature anomalies in C, (ii) Temperature anomalies in F, and (iii) Temperature anomalies scaled to unit variance during the  calibration period.  The following plots represent the difference in the resulting reconstructions:  (i) – (ii) and (i) – (iii):</p>
<p><a href="http://statpad.files.wordpress.com/2010/12/proxyrecons.jpeg"><img class="aligncenter size-full wp-image-445" title="proxyrecons" src="http://statpad.files.wordpress.com/2010/12/proxyrecons.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>The differences here are non-trivial.  I realize that is not a <em>reproduction</em> of the total method used by the Mann team.  However, the EIV methodology is central to the current spate of their reconstructions so some effect must be there.  How strong is it?  I don’t know – maybe they can calculate the Fahrenheit version for us so we can all see it.  Surely, you would think that they would be aware of all the features of a statistical method before deciding to use it.  Maybe I missed their discussion of it.</p>
<p>A script for running the above analysis is available here (the file is labeled <em>.doc</em>, but it is a simple text file).  Save it and load into R directly: <a href="http://statpad.files.wordpress.com/2010/12/reivpost1.doc">Reivpost</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/440/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/440/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=440&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2010/12/19/eivtls-regression-why-use-it/feed/</wfw:commentRss>
		<slash:comments>27</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/12/ols_eiv_plots.jpeg" medium="image">
			<media:title type="html">ols_eiv_plots</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/12/univscaling.jpeg" medium="image">
			<media:title type="html">univscaling</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/12/proxyrecons.jpeg" medium="image">
			<media:title type="html">proxyrecons</media:title>
		</media:content>
	</item>
		<item>
		<title>GHCN Twins</title>
		<link>http://statpad.wordpress.com/2010/07/19/ghcn-twins/</link>
		<comments>http://statpad.wordpress.com/2010/07/19/ghcn-twins/#comments</comments>
		<pubDate>Mon, 19 Jul 2010 17:05:13 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=417</guid>
		<description><![CDATA[There has been a flurry of activity during the last several months in the area of constructing global temperature series. Although a variety of methods were used there seemed to be a fair amount of similarity in the results. Some &#8230; <a href="http://statpad.wordpress.com/2010/07/19/ghcn-twins/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=417&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://statpad.files.wordpress.com/2010/07/kuska.jpg"></a><a href="http://statpad.files.wordpress.com/2010/07/twins.jpg"><img class="alignleft size-full wp-image-426" title="Twins" src="http://statpad.files.wordpress.com/2010/07/twins.jpg?w=500" alt=""   /></a></p>
<p>There has been a flurry of activity during the last several months in the area of constructing global temperature series. Although a variety of methods were used there seemed to be a fair amount of similarity in the results.</p>
<p>Some people have touted this as a “validation” of the work performed by the “professional” climate agencies which have been creating the data sets and working their sometimes obscure manipulation of the recorded temperatures obtained from the various national meteorological organizations that collected the data. I for one do not find the general agreement too surprising since most of us have basically used the same initial data sets for our calculations. I decided to take a closer look at the GHCN data since many of the reconstructions seem to use it.</p>
<p><span id="more-417"></span></p>
<p>At this point, I will look at some of the data which can be found in the <a href="ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.mean.Z">GHCN v2.mean.z</a> file. This represents monthly mean temperatures calculated for a collection of temperature stations around the globe. People tend to refer to these as “raw” data, but statistically this is not really the case. Each monthly record represents a <em>calculated</em> value of a larger set of data.  The means can be calculated in different ways. Since there can be missing daily values due to equipment malfunctions and other reasons, decisions have been made and implemented in how to do the calculation in such cases. It is very possible that further  &#8220;adjustments&#8221; may have also been made before the data reaches GHCN.</p>
<p>A description of the format of all the temperature datasets is given in the <a href="ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.temperature.readme">readme file </a>at the same site. In particular, station series id formats are as follows:</p>
<blockquote><p>Each line of the data file has:</p>
<p>station number which has three parts:<br />
country code (3 digits)<br />
nearest WMO station number (5 digits)<br />
modifier (3 digits) (this is usually 000 if it is that WMO station)</p>
<p>Duplicate number:<br />
one digit (0-9). The duplicate order is based on length of data.<br />
Maximum and minimum temperature files have duplicate numbers but only one time series (because there is only one way to calculate the mean monthly maximum temperature). The duplicate numbers in max/min refer back to the mean temperature duplicate time series created by (Max+Min)/2.</p></blockquote>
<p>In this analysis, I will be referencing the station with a single id which is constructed from the station number by connecting it to the modifier with a period. The temperature data in the R program will consist of a list of (possibly multivariate) time series with each element of the list containing all of the “duplicates” for a particular station.</p>
<p>Some &#8220;quality control&#8221; has also been done by GHCN.  The earlier &#8220;readme&#8221; also explains the file <a href="ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.mean.failed.qc.Z">v2.mean.failed.qc.Z</a>:</p>
<blockquote><p>Data that have failed Quality Control:<br />
We&#8217;ve run a Quality Control system on GHCN data and removed data points that we determined are probably erroneous. However, there are some cases where additional knowledge provides adequate justification for classifying some of these data as valid. For example, if an isolated station in 1880 was extremely cold in the month of March, we may have to classify it as suspect. However, a researcher with an 1880 newspaper article describing the first ever March snowfall in that area may use that special information to reclassify the extremely cold data point as good. Therefore, we are providing a file of the data points that our QC flagged as probably bad. We do not recommend that they be used without special scrutiny. And we ask that if you have corroborating evidence that any of the &#8220;bad&#8221; data points should be reclassified as good, please send us that information so we can make the appropriate changes in the GHCN data files. The data points that failed QC are in the files v2.m*.failed.qc. Each line in these files contains station number, duplicate number, year, month, and the value (again the value needs to be divided by 10 to get degrees C). A detailed description of GHCN&#8217;s Quality Control can be found through <a href="http://www.ncdc.noaa.gov/ghcn/ghcn.html" rel="nofollow">http://www.ncdc.noaa.gov/ghcn/ghcn.html</a>.</p></blockquote>
<p>I didn’t really find the “detailed description”, but a check of the file indicated that almost all of the entries in the file represented temperature values that had been removed from the data set (replaced by NAs). I could only find seven monthly temperatures where the original value was replaced by a new one. Without the necessary supplementary metadata, there is no sense at looking at that file any further.</p>
<p>The names of the stations and other geographic data for them can be found in the <a href="ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.temperature.inv">v2.temperature.inv </a>file. There are 7280 stations listed with 4495 unique WMO numbers. Each station can have one or more (up to a maximum of 10) “duplicates” so there are a total of 13486 temperature series in the data set. The duplicate counts look like:</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top">Dups:</td>
<td valign="top">1</td>
<td valign="top">2</td>
<td valign="top">3</td>
<td valign="top">4</td>
<td valign="top">5</td>
<td valign="top">6</td>
<td valign="top">7</td>
<td valign="top">8</td>
<td valign="top">9</td>
<td valign="top">10</td>
</tr>
<tr>
<td valign="top">Freq:</td>
<td valign="top">4574</td>
<td valign="top">1109</td>
<td valign="top">601</td>
<td valign="top">502</td>
<td valign="top">269</td>
<td valign="top">111</td>
<td valign="top">56</td>
<td valign="top">44</td>
<td valign="top">12</td>
<td valign="top">2</td>
</tr>
</tbody>
</table>
<p> </p>
<p>Before the data can be used to construct the global record, it is necessary to somehow combine the information from the various duplicate versions into a single series.  One reasonably expects that the duplicates should be pretty much identical (with the occasional error) since they are supposedly different transcriptions of the same temperature series. The difficulty is that there are almost 13500 series which have to be looked at – not a simple matter.</p>
<p>The 4574 stations which were represented by a single series can be ignored for the moment – there is little that can be done to evaluate – so, for simplicity, I decided to only look at the “twins”, i.e. those 1109 stations which have exactly two records. These were identified and the range for the simple difference between the two series was calculated. No heavy duty stats were necessary to take a look at the amount of agreement there was between them.</p>
<p>I expected most of the stations to look like this:</p>
<p><a href="http://statpad.files.wordpress.com/2010/07/kobenhavn.jpg"><img class="aligncenter size-full wp-image-422" title="kobenhavn" src="http://statpad.files.wordpress.com/2010/07/kobenhavn.jpg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>However, there were others that looked like this one:</p>
<p><img title="kuska" src="http://statpad.files.wordpress.com/2010/07/kuska.jpg?w=500&#038;h=499" alt="" width="500" height="499" /></p>
<p>How many? Well that was the surprise! I graphed those which were not identical over their overlap periods and put the graphs into pdfs.</p>
<p>No overlap: 232 stations (no plots)<br />
Zero difference: 152 stations (no plots)<br />
Range between 0 and 1- : <a href="http://statpad.files.wordpress.com/2010/07/twins_1minus.pdf">233 stations </a>(4.9 MB pdf)<br />
Range between 1 and 3- : <a href="http://statpad.files.wordpress.com/2010/07/twins_1_3.pdf">321 stations </a>(6.9 MB pdf)<br />
Range between 3 and 12.9 : <a href="http://statpad.files.wordpress.com/2010/07/twins_3plus.pdf">171 stations </a>(4.1 MB pdf).</p>
<p>The latter two files are the more interesting ones.  “Duplicate” has taken on a whole new meaning for me.</p>
<p>If there are any errors in my results, R scripts or explanations of the phenomena in the plots, I would like to hear about them. </p>
<p>I have uploaded the R script as an ordinary text file called <a href="http://statpad.files.wordpress.com/2010/07/twin-analysis.doc">twin analysis.doc</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/417/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/417/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=417&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2010/07/19/ghcn-twins/feed/</wfw:commentRss>
		<slash:comments>42</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/07/twins.jpg" medium="image">
			<media:title type="html">Twins</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/07/kobenhavn.jpg" medium="image">
			<media:title type="html">kobenhavn</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/07/kuska.jpg" medium="image">
			<media:title type="html">kuska</media:title>
		</media:content>
	</item>
		<item>
		<title>2010 Spring Arctic Sea Ice Extent</title>
		<link>http://statpad.wordpress.com/2010/07/06/2010-spring-arctic-sea-ice-extent/</link>
		<comments>http://statpad.wordpress.com/2010/07/06/2010-spring-arctic-sea-ice-extent/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 16:22:55 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=402</guid>
		<description><![CDATA[The decline in the sea ice extent in May and June of 2010 appeared to be extremely fast. According to NSIDC, Arctic sea ice extent averaged 13.10 million square kilometers (5.06 million square miles) for the month of May, 500,000 &#8230; <a href="http://statpad.wordpress.com/2010/07/06/2010-spring-arctic-sea-ice-extent/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=402&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The decline in the sea ice extent in May and June of 2010 appeared to be extremely fast.  According to <a href="http://nsidc.org/arcticseaicenews/2010/060810.html">NSIDC</a>, </p>
<blockquote><p>Arctic sea ice extent averaged 13.10 million square kilometers (5.06 million square miles) for the month of May, 500,000 square kilometers (193,000 square miles) below the 1979 to 2000 average. The rate of ice extent decline for the month was -68,000 kilometers (-26,000 square miles) per day, almost 50% more than the average rate of -46,000 kilometers (18,000 square miles) per day. This rate of loss is the highest for the month of May during the satellite record.
</p></blockquote>
<p>However, later on the same page, they also state under Conditions in Context:</p>
<blockquote><p>As we noted in our May post, several regions of the Arctic experienced a late-season spurt in ice growth. As a result, ice extent reached its seasonal maximum much later than average, and in turn the melt season began almost a month later than average. As ice began to decline in April, the rate was close to the average for that time of year.</p>
<p>In sharp contrast, ice extent declined rapidly during the month of May. Much of the ice loss occurred in the Bering Sea and the Sea of Okhotsk, indicating that the ice in these areas was thin and susceptible to melt. Many polynyas, areas of open water in the ice pack, opened up in the regions north of Alaska, in the Canadian Arctic Islands, and in the Kara and Barents and Laptev seas.</p></blockquote>
<p>This latter observation that the seasonal maximum was reached later in the season and the melt season started later is important.  Regardless of specific annual weather conditions, May and June are melt season months in the Arctic.  Furthermore, if there is more ice available, then it stands to reason that more melting will take place.  What might a better way to look at the data than simply plotting the total extent?</p>
<p>From the <a href="http://www.ijis.iarc.uaf.edu/en/home/seaice_extent.htm">JAXA site</a>:</p>
<p><a href="http://statpad.files.wordpress.com/2010/07/amsre_sea_ice_extent_july_5_2010.png"><img src="http://statpad.files.wordpress.com/2010/07/amsre_sea_ice_extent_july_5_2010.png?w=500&#038;h=312" alt="" title="AMSRE_Sea_Ice_Extent_July_5_2010" width="500" height="312" class="alignleft size-full wp-image-405" /></a></p>
<p>Why not graph the rate of change, as well?  In particular, because a wider extent will naturally imply a higher areal melt under the same melting conditions, it makes sense to look at the daily percentage change.</p>
<p>To do this, I downloaded the JAXA daily ice data into R (from 2002 to the present).  For convenience purposes, December 31 was deleted from both 2004 and 2008 to reduce the number of days to 365.  The percentage change was calculated for each day for which the corresponding data was available.  No infilling was done for missing data.  The data was plotted:</p>
<p>(Click graph for larger version)</p>
<p><a href="http://statpad.files.wordpress.com/2010/07/arctic_seaice_pct_change.jpeg"><img src="http://statpad.files.wordpress.com/2010/07/arctic_seaice_pct_change.jpeg?w=500&#038;h=367" alt="" title="arctic_seaice_pct_change" width="500" height="367" class="alignleft size-full wp-image-406" /></a></p>
<p>Here, all of the years prior to 2010 are plotted in gray and the current year in red. The plot gives graphic insight into the patterns of thawing and freezing:  the thaw season goes from roughly mid-March to mid-September.  The very high variability in October is likely due to a reasonably similar annual speed of recovery which is expressed as a percentage of quite varied  minima starting points in September.</p>
<p>How does 2010 compare in May and June?  For May, it is somewhat toward the lower part the combined record, but I would not classify it as extreme in any way.  June was definitely below the other recent years during three periods of several days each.  What will July and August look like?  I guess we will have to wait and see…</p>
<p>The R script follows:</p>
<p><code><br />
#get latest JAXA extent data</p>
<p>iceurl = url("http://www.ijis.iarc.uaf.edu/seaice/extent/plot.csv")<br />
latest = read.csv(iceurl,header=F,na.strings="-9999")<br />
colnames(latest) = c("month","day","year","ext")</p>
<p>#remove Dec 31, 2004 and 2008 (extra leap year day) for convenience<br />
#fill in with missing values for early part of 2002 (for convenience)</p>
<p>arc.ext = latest$ext<br />
which((latest$month==12)&amp;(latest$day==31)) # 214  579  945 1310 1675 2040 2406 2771 3136<br />
arc.ext = arc.ext[-c(945,2406)]<br />
arc.ext = c(rep(NA,365-214),arc.ext)</p>
<p>#length(arc.ext)/365 # 9</p>
<p>#calculate changes as % of current value<br />
#form matrix with 9 columns (one for each year)</p>
<p>pct.change = matrix(100*c(diff(arc.ext),NA)/arc.ext,ncol=9)</p>
<p>#plot data<br />
#years 2002 to 2009 as gray background<br />
#year 2010 in red<br />
#add month boundaries<br />
modays = c(31,28,31,30,31,30,31,31,30,31,30,31)</p>
<p>matplot(pct.change[,1:9],type="l",main ="Arctic ice Extent Change Relative to Area",xlab="Day",<br />
   ylab="Daily % Change", col=c(rep("grey",8),"red"),lty=1)<br />
 abline(h=0)<br />
 abline(v=c(0,cumsum(modays)), col="green")<br />
text(x =14+c(0,cumsum(modays)[-12]),y =c(rep(3,9),rep(-1,3)), labels=month.abb,col="blue")</p>
<p></code></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/402/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/402/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=402&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2010/07/06/2010-spring-arctic-sea-ice-extent/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/07/amsre_sea_ice_extent_july_5_2010.png" medium="image">
			<media:title type="html">AMSRE_Sea_Ice_Extent_July_5_2010</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/07/arctic_seaice_pct_change.jpeg" medium="image">
			<media:title type="html">arctic_seaice_pct_change</media:title>
		</media:content>
	</item>
		<item>
		<title>Temporary Absence</title>
		<link>http://statpad.wordpress.com/2010/04/05/temporary-absence/</link>
		<comments>http://statpad.wordpress.com/2010/04/05/temporary-absence/#comments</comments>
		<pubDate>Mon, 05 Apr 2010 16:30:24 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=399</guid>
		<description><![CDATA[I will be away from home (carrying out a study of the effect of Seasonal Local Warming on golf courses of the Dominican Republic) for a week so I will likely not reply to any comments during that time.  It`s &#8230; <a href="http://statpad.wordpress.com/2010/04/05/temporary-absence/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=399&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I will be away from home (carrying out a study of the effect of Seasonal Local Warming on golf courses of the Dominican Republic) for a week so I will likely not reply to any comments during that time.  It`s a tough job, but somebody has to do it&#8230;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/399/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/399/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=399&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2010/04/05/temporary-absence/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>
	</item>
		<item>
		<title>Will the Real Rapid City Please Stand UP!</title>
		<link>http://statpad.wordpress.com/2010/03/29/will-the-real-rapid-city-please-stand-up/</link>
		<comments>http://statpad.wordpress.com/2010/03/29/will-the-real-rapid-city-please-stand-up/#comments</comments>
		<pubDate>Mon, 29 Mar 2010 20:40:59 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=390</guid>
		<description><![CDATA[This is a post about the quality of temperature data.  You can spend a lot of time to find methods to maximize the information you squeeze out of data, but unless the data itself is reliable, all of the effort &#8230; <a href="http://statpad.wordpress.com/2010/03/29/will-the-real-rapid-city-please-stand-up/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=390&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>This is a post about the quality of temperature data.  You can spend a lot of time to find methods to maximize the information you squeeze out of data, but unless the data itself is reliable, all of the effort is wasted.  Recently, I ran across an example which I found somewhat disconcerting.</p>
<p>I had been testing some methods for estimating the temperature at particular locations in a geographic grid cell from the temperature data set released by the Met Office.  The grid cell was chosen on the basis that was a reasonable collection of stations available for use in the procedure: 40 – 45 N by 100 – 105 W in the north central region of the United States.  I chose a station with a longer fairly complete record and my intent was to look at distance based weighting for estimating the temperature at that station site using the neighboring stations.  Then I could compare the actual measured temperature to the estimated temperature to evaluate how well I had done.  But my results seemed poorer than I had expected. At that point, I thought that perhaps I should look more closely at the station record.</p>
<p>The station I had chosen was Rapid City, South Dakota – ID number 726620  with a current  population close to 60000 people according to Wikipedia .  For comparison purposes, I collected the same station’s records from a variety of other sources: GISS “raw” (same as “combined”) and homogenized directly from the <a href="http://data.giss.nasa.gov/gistemp/">Gistemp web pages</a>, GHCN “raw” and adjusted from the v2 data set and what was listed as the same two GHCN records from the <a href="http://climexp.knmi.nl/">Climate Explorer web site</a>.   The subsequent analysis proved quite interesting.</p>
<p><span id="more-390"></span>To start with, the data from the Climate Explorer web site proved to be identical to the GHCN data that it purported to be.  A plot of the remaining five data sets looks like this:</p>
<p><a href="http://statpad.files.wordpress.com/2010/03/rapid1.jpg"><img class="alignleft size-full wp-image-384" title="rapid1" src="http://statpad.files.wordpress.com/2010/03/rapid1.jpg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>At a glance, the records look quite similar with what appear to be some minor variations particularly in the later portions of the series.  Next, I compared the effects of the adjustments made in the case of GISS and GHCN.</p>
<p><a href="http://statpad.files.wordpress.com/2010/03/rapid2.jpg"><img class="alignleft size-full wp-image-385" title="rapid2" src="http://statpad.files.wordpress.com/2010/03/rapid2.jpg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>This proved rather interesting.  Whereas the GISS homogenization was a simple increase of about 0.4 degrees in several stages, nothing we haven’t seen before.  However, the GHCN adjustment is quite complicated so a further plot of the adjustments by month seemed to be a good idea.</p>
<p>Monthly GHCN Adjustments:</p>
<p><a href="http://statpad.files.wordpress.com/2010/03/rapid3.jpg"><img class="alignleft size-full wp-image-386" title="rapid3" src="http://statpad.files.wordpress.com/2010/03/rapid3.jpg?w=500&#038;h=531" alt="" width="500" height="531" /></a></p>
<p>Now this I find difficult to understand.  The pattern of the adjustments differs substantially for the various months with a strongly induced increasing trend in the summer and fall.  Since it is unlikely that the station is moved each spring to another location (the weatherman’s summer residence? <img src='http://s1.wp.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />   ) and then back into town in the fall, I cannot find a reasonable explanation for either the type or the amounts of the adjustment.  However, it was downhill from here.</p>
<p>A comparison of GISS to GHCN:</p>
<p><a href="http://statpad.files.wordpress.com/2010/03/rapid4.jpg"><img class="alignleft size-full wp-image-387" title="rapid4" src="http://statpad.files.wordpress.com/2010/03/rapid4.jpg?w=500&#038;h=531" alt="" width="500" height="531" /></a></p>
<p>None of the GISS series is even close to either of those from GHCN!  Yet they all purport to be the temperatures measured at a single site.  However, we have forgotten the Met series which initiated the entire exercise.  Met Office’s version compared to the other four:</p>
<p><a href="http://statpad.files.wordpress.com/2010/03/rapid5.jpg"><img class="alignleft size-full wp-image-388" title="rapid5" src="http://statpad.files.wordpress.com/2010/03/rapid5.jpg?w=500&#038;h=473" alt="" width="500" height="473" /></a></p>
<p>Well, not only different, but in a strange way.  Starting in the early 1980’s, met decides to go off on its own.  Where the other four series have missing values (consistent with each other), Met always has a measurement available.  As well, the difference between Met and the others at times becomes relatively large.</p>
<p>As a final step, I calculated the trends from 1970 onwards for each of the five.  On a decadal basis:</p>
<p>Met 0.021  C / decade</p>
<p>GISS  0.167  C / decade</p>
<p>Homogenized GISS 0.253 C / decade</p>
<p>GHCN  -0.185  C / decade</p>
<p>Adjusted GHCN 0.302 C / decade</p>
<p>Why are they all so different?  I haven’t got a clue!  I triple-checked the data sources, but couldn’t find any errors in my versions of the data.  Maybe someone out there can provide some enlightenment.</p>
<p>I hope the rest of the records are not like this&#8230;</p>
<p>The data I used are on the website and can be downloaded through the script below.</p>
<pre class="brush: css; title: ; notranslate">
#get data

rapiddat = url(&quot;http://statpad.files.wordpress.com/2010/03/rapidcity.doc&quot;)
rapidcityx = dget(rapiddat)

rapidcity = rapidcityx[,1:5]

plot(rapidcity) #rapid1

par(mfrow=c(2,1))

#giss is raw and/or combined sources)
#rapid2
plot(rapidcity[,&quot;homgiss&quot;]-rapidcity[,&quot;giss&quot;], main = &quot;GISS ... Homogenized  - Raw&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot; )

plot(rapidcity[,&quot;adjghcn&quot;]-rapidcity[,&quot;ghcn&quot;],main = &quot;GHCN ... Adjusted  - Raw&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot;)

#monthly pattern of diffs for ghcn rapid3
par(mfrow=c(4,3))
for (i in 1:12) {
 plot(window(rapidcity[,&quot;adjghcn&quot;]-rapidcity[,&quot;ghcn&quot;], start=c(1888,i),deltat=1),ylim=c(-2,1),main = month.name[i],ylab =&quot;Degrees C&quot;,xlab=&quot;Year&quot;)
 abline(h=0,col=&quot;red&quot;)}

par(mfrow=c(2,2))

plot(rapidcity[,&quot;giss&quot;]-rapidcity[,&quot;ghcn&quot;], main = &quot;GISS - GHCN&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot; )
plot(rapidcity[,&quot;homgiss&quot;]-rapidcity[,&quot;ghcn&quot;],main = &quot;Homogenized GISS - GHCN&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot;)
plot(rapidcity[,&quot;giss&quot;]-rapidcity[,&quot;adjghcn&quot;], main = &quot;GISS - Adjusted GHCN&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot; )
plot(rapidcity[,&quot;homgiss&quot;]-rapidcity[,&quot;adjghcn&quot;],main = &quot;Homogenized GISS -  Adjusted GHCN&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot;)

par(mfrow=c(2,2))
plot(rapidcity[,&quot;met&quot;]-rapidcity[,&quot;ghcn&quot;], main = &quot;Met - GHCN&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot; )
 abline(h = 0,col=&quot;red&quot;)
plot(rapidcity[,&quot;met&quot;]-rapidcity[,&quot;adjghcn&quot;],main = &quot;Met - Adjusted GHCN&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot;)
 abline(h = 0,col=&quot;red&quot;)
plot(rapidcity[,&quot;met&quot;]-rapidcity[,&quot;giss&quot;], main = &quot;Met - GISS&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot; )
 abline(h = 0,col=&quot;red&quot;)
plot(rapidcity[,&quot;met&quot;]-rapidcity[,&quot;homgiss&quot;],main = &quot;Met - Homogenized GISS&quot;,xlab=&quot;Year&quot;,ylab = &quot;Degrees C&quot;)
 abline(h = 0,col=&quot;red&quot;)

#trends
rapid70 = window(rapidcity,start=c(1970,1))
trend = rep(NA,5)
mons = factor(cycle(rapid70))
tim = time(rapid70)
for (i in 1:5) trend[i] = lm(rapid70[,i]~0+mons+tim)$coe[13]
names(trend)=colnames(rapid70)
trend

#         met         giss      homgiss         ghcn      adjghcn
# 0.002050500  0.016706298  0.025326090 -0.018540570  0.030191120

</pre>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/390/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/390/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=390&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2010/03/29/will-the-real-rapid-city-please-stand-up/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/03/rapid1.jpg" medium="image">
			<media:title type="html">rapid1</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/03/rapid2.jpg" medium="image">
			<media:title type="html">rapid2</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/03/rapid3.jpg" medium="image">
			<media:title type="html">rapid3</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/03/rapid4.jpg" medium="image">
			<media:title type="html">rapid4</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/03/rapid5.jpg" medium="image">
			<media:title type="html">rapid5</media:title>
		</media:content>
	</item>
		<item>
		<title>Anomaly Regression – Do It Right!</title>
		<link>http://statpad.wordpress.com/2010/03/18/anomaly-regression-%e2%80%93-do-it-right/</link>
		<comments>http://statpad.wordpress.com/2010/03/18/anomaly-regression-%e2%80%93-do-it-right/#comments</comments>
		<pubDate>Thu, 18 Mar 2010 17:10:49 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=369</guid>
		<description><![CDATA[I have been meaning to do this post for several years, but until now I have not found a particularly relevant time to do it.  In his recent post at the Air Vent, Jeff Id makes the following statement: Think &#8230; <a href="http://statpad.wordpress.com/2010/03/18/anomaly-regression-%e2%80%93-do-it-right/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=369&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I have been meaning to do this post for several years, but until now I have not found a particularly relevant time to do it.  In his <a href="http://noconsensus.wordpress.com/2010/03/17/anomaly-aversion/">recent post at the Air Vent</a>, Jeff Id makes the following statement:</p>
<blockquote><p>Think about that.  We’re re-aligning the anomaly series with each other to remove the steps.  If we use raw data  (assuming up-sloping data), the steps <em><strong>in this case</strong></em><strong> </strong>were positive with respect to trend, sometimes the steps can be negative.  If we use anomaly alone (assuming up-sloping data), the steps from added and removed series are <em><strong>always</strong></em> toward a <strong>reduction in actual trend</strong>.  It’s an odd concept, but the key is that they are NOT TRUE trend as the true trend, in this simple case, is of course 0.12C/Decade.</p></blockquote>
<p>The actual situation is deeper than Jeff thinks.  <strong>The usual method used by climate scientists for doing monthly anomaly regression is wrong! </strong>Before you say, “Whoa! How can a consensus be <em>wrong</em>?”, let me first give an example which I will follow up with the math to show you what the problem is.</p>
<p><span id="more-369"></span>We first produce a series of ten years of ten years worth of noiseless temperature data.  To make it look realistic (not really important) we will take a sinusoidal annual curve and superimpose an exact linear trend (of .2 degrees per year) on the data:</p>
<p><a href="http://statpad.files.wordpress.com/2010/03/anomregfig1.jpeg"><img class="alignleft size-full wp-image-377" title="anomregfig1" src="http://statpad.files.wordpress.com/2010/03/anomregfig1.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>We now calculate the anomalies and fit a linear regression line to the anomalies.  The following is a plot of anomalies with a regression line:</p>
<p><a href="http://statpad.files.wordpress.com/2010/03/anomregfig2.jpeg"><img class="alignleft size-full wp-image-378" title="anomregfig2" src="http://statpad.files.wordpress.com/2010/03/anomregfig2.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>The resulting trend of .198 does not match the actual trend in the data despite the fact that there is no noise present in the series.  Although the difference in this case is not large (a poor oft-repeated justification), this is an uncorrectable bias error.  What is not as obvious is that this will have an impact on the autocorrelation in the situation as evidenced by the ACF of the residuals – an effect due to the methodology and not inherent to the actual “error” sequence:</p>
<p><a href="http://statpad.files.wordpress.com/2010/03/anomregfig3.jpeg"><img class="alignleft size-full wp-image-379" title="anomregfig3" src="http://statpad.files.wordpress.com/2010/03/anomregfig3.jpeg?w=500&#038;h=499" alt="" width="500" height="499" /></a></p>
<p>But it gets worse. If we change our starting month (but still use ten full years of data), we get different slopes for each month:</p>
<table border="0" cellspacing="0" cellpadding="0" width="192">
<tbody>
<tr>
<td width="64" valign="bottom"><strong>Month</strong></td>
<td width="64" valign="bottom"><strong>Intercept</strong></td>
<td width="64" valign="bottom"><strong>Trend </strong></td>
</tr>
<tr>
<td width="64" valign="bottom">1</td>
<td width="64" valign="bottom">-1.27983</td>
<td width="64" valign="bottom">0.198014</td>
</tr>
<tr>
<td width="64" valign="bottom">2</td>
<td width="64" valign="bottom">-1.28521</td>
<td width="64" valign="bottom">0.198931</td>
</tr>
<tr>
<td width="64" valign="bottom">3</td>
<td width="64" valign="bottom">-1.28971</td>
<td width="64" valign="bottom">0.199681</td>
</tr>
<tr>
<td width="64" valign="bottom">4</td>
<td width="64" valign="bottom">-1.29331</td>
<td width="64" valign="bottom">0.200264</td>
</tr>
<tr>
<td width="64" valign="bottom">5</td>
<td width="64" valign="bottom">-1.29595</td>
<td width="64" valign="bottom">0.200681</td>
</tr>
<tr>
<td width="64" valign="bottom">6</td>
<td width="64" valign="bottom">-1.2976</td>
<td width="64" valign="bottom">0.200931</td>
</tr>
<tr>
<td width="64" valign="bottom">7</td>
<td width="64" valign="bottom">-1.29822</td>
<td width="64" valign="bottom">0.201014</td>
</tr>
<tr>
<td width="64" valign="bottom">8</td>
<td width="64" valign="bottom">-1.29775</td>
<td width="64" valign="bottom">0.200931</td>
</tr>
<tr>
<td width="64" valign="bottom">9</td>
<td width="64" valign="bottom">-1.29618</td>
<td width="64" valign="bottom">0.200681</td>
</tr>
<tr>
<td width="64" valign="bottom">10</td>
<td width="64" valign="bottom">-1.29344</td>
<td width="64" valign="bottom">0.200264</td>
</tr>
<tr>
<td width="64" valign="bottom">11</td>
<td width="64" valign="bottom">-1.2895</td>
<td width="64" valign="bottom">0.199681</td>
</tr>
<tr>
<td width="64" valign="bottom">12</td>
<td width="64" valign="bottom">-1.28431</td>
<td width="64" valign="bottom">0.198931</td>
</tr>
</tbody>
</table>
<p>The average <em>is</em> equal to .2 so all is not lost.   However, all of the errors are a result of how we chose to do our analysis and were not really present in the original data.</p>
<p>So what’s causing the problem?</p>
<p><strong> Here comes the math</strong></p>
<p>What most climate scientists still do not seem to understand is the need to spell out the parameters and how they relate to each other.  Unless such a “statistical model” exists, we do not have a good grasp of what is happening when we carry out an analysis and a reduced ability to recognize when something is not quite right.  The implicit model in this situation is</p>
<p><img src='http://s0.wp.com/latex.php?latex=X%28t%29+%3D+%5Cmu+_m+%2B+%5Cbeta+t+%2B+%5Cvarepsilon+%28t%29+&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='X(t) = &#92;mu _m + &#92;beta t + &#92;varepsilon (t) ' title='X(t) = &#92;mu _m + &#92;beta t + &#92;varepsilon (t) ' class='latex' /></p>
<p>where</p>
<p>X(t) = temperature at time t</p>
<p>t = y + m, where y = year and m = month (with values 0/12 , 1/12, …,11/12)</p>
<p>µ<sub>m</sub> = mean of month m</p>
<p>β = annual trend</p>
<p>ε(t) = “error” (which in this case are all zeroes).</p>
<p>Now, what happens when we calculate anomalies?  If A(t) is the anomaly at time t,</p>
<p><img src='http://s0.wp.com/latex.php?latex=A%28t%29+%3D+X%28t%29+-+%5Cbar+X_m+&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='A(t) = X(t) - &#92;bar X_m ' title='A(t) = X(t) - &#92;bar X_m ' class='latex' /></p>
<p>where the “bar” represents averaging over that variable.</p>
<p>Here,</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cbar+X_m+%3D+%5Cmu+_m+%2B+%5Cbeta+%5Cbar+t+%2B+%5Cbar+%5Cvarepsilon+_m+%3D+%5Cmu+_m+%2B+%5Cbeta+%28%5Cbar+y+%2B+m%29+%2B+%5Cbar+%5Cvarepsilon+_m+&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;bar X_m = &#92;mu _m + &#92;beta &#92;bar t + &#92;bar &#92;varepsilon _m = &#92;mu _m + &#92;beta (&#92;bar y + m) + &#92;bar &#92;varepsilon _m ' title='&#92;bar X_m = &#92;mu _m + &#92;beta &#92;bar t + &#92;bar &#92;varepsilon _m = &#92;mu _m + &#92;beta (&#92;bar y + m) + &#92;bar &#92;varepsilon _m ' class='latex' /></p>
<p>and if we substitute this in the original equation we get</p>
<p><img src='http://s0.wp.com/latex.php?latex=A%28y+%2B+m%29+%3D+-+%5Cbeta+%5Cbar+y+%2B+%5Cbeta+y+%2B+%28%5Cvarepsilon+%28y+%2B+m%29+-+%5Cbar+%5Cvarepsilon+_m+%29+&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='A(y + m) = - &#92;beta &#92;bar y + &#92;beta y + (&#92;varepsilon (y + m) - &#92;bar &#92;varepsilon _m ) ' title='A(y + m) = - &#92;beta &#92;bar y + &#92;beta y + (&#92;varepsilon (y + m) - &#92;bar &#92;varepsilon _m ) ' class='latex' /></p>
<p>The important thing to realize here is the the month no longer appears with the trend – only the year (and NOT time) should be used to get the correct trend.  By using time, we actually are fitting the line</p>
<p><img src='http://s0.wp.com/latex.php?latex=A%28y+%2B+m%29+%3D+-+%5Cbeta+%28%5Cbar+y+%2B+m%29+%2B+%5Cbeta+y+%2B+%28%5Cvarepsilon+%28y+%2B+m%29+-+%5Cbar+%5Cvarepsilon+_m+%29+&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='A(y + m) = - &#92;beta (&#92;bar y + m) + &#92;beta y + (&#92;varepsilon (y + m) - &#92;bar &#92;varepsilon _m ) ' title='A(y + m) = - &#92;beta (&#92;bar y + m) + &#92;beta y + (&#92;varepsilon (y + m) - &#92;bar &#92;varepsilon _m ) ' class='latex' /></p>
<p>for which the intercept should be different for each month.  Since the usual anomaly regression fits a single intercept, the resulting trend is incorrectly estimated.</p>
<p><strong>How can this be fixed?</strong></p>
<p>Several fixes are possible.  The simplest is to use year rather than time as the independent variable in the regression.  This may seem counterintuitive, but it takes the anomalies and lines then up vertically above each other solving the problem that Jeff noticed.  However, this solution can still be problematical if the number of available anomalies differs from year to year.</p>
<p>A better solution is to do the anomalizing <em>and</em> the trend fit at the same time.  This corresponds to a single factor Analysis of Covariance where month is treated as a categorical factor with twelve levels and time (or year!) is treated as a numeric covariate.  This can be implemented in R using the function lm.</p>
<p>The script used in the post follows:</p>
<pre class="brush: css; title: ; notranslate">

#we construct 11 years of data for later use
seasonal = 5 + 10*sin(pi*(0:131)/6)
#add trend of .2 degrees per year
trend = (0:131)/60

temps = ts((seasonal+trend)[1:120],start=c(1,1),freq=12)
time.temp= time(temps)

plot(seasonal,type=&quot;l&quot;)

plot(temps, main=&quot;Plot of Monthly Temperature Time Series&quot;, ylab=&quot;Degrees C&quot;, xlab=&quot;Time&quot;) #anomregfig1.jpg

#function to calculate anomaly
anomaly.calc=function(tsdat,styr=1951,enyr=1980){
 tsmeans = rowMeans(matrix(window(tsdat,start=c(styr,1),end=c(enyr,12)),nrow=12))
 tsdat-rep(tsmeans,len=length(tsdat))}

anom = anomaly.calc(temps,1,10)

#anomaly regression
reg1 = lm(anom~time.temp) #  intercept =  -1.180, slope = 0.198

plot(anom, main = &quot;Anomalies&quot;,ylab = &quot;Degree C&quot;,xlab = &quot;Time&quot;) #anomregfig2.jpg
abline(reg = reg1,col=&quot;red&quot;)

#calculate autocorrelation of residuals
acf(residuals(reg1))  #anomregfig3.jpg

#set up revolving regressions
tempsx = ts(seasonal+trend,start=c(1,1),freq=12)
timex.temp= time(tempsx)
anomx.all = anomaly.calc(tempsx,1,11)

#function to cycle regression starting points and give coefficients
cycle.reg = function(dats) {
 coes = matrix(NA,12,2)
 coes[1,] = coef(lm(window(dats,start=c(1,1),end=c(10,12))~time(window(dats,start=c(1,1),end=c(10,12)))))
 for (i in 1:12) coes[i,] = coef(lm(window(dats,start=c(1,i),end=c(11,i-1))~time(window(dats,start=c(1,i),end=c(11,i-1)))))
coes}

(all.coef = cycle.reg(anomx.all))

#           [,1]      [,2]
# [1,] -1.279832 0.1980138
# [2,] -1.285205 0.1989305
# [3,] -1.289710 0.1996805
# [4,] -1.293305 0.2002639
# [5,] -1.295949 0.2006806
# [6,] -1.297599 0.2009306
# [7,] -1.298215 0.2010140
# [8,] -1.297754 0.2009306
# [9,] -1.296176 0.2006806
#[10,] -1.293437 0.2002639
#[11,] -1.289497 0.1996805
#[12,] -1.284314 0.1989305

#fit regression using year
year = floor(time.temp)
reg2 = lm(anom~year)
reg2 #[1]  -1.1          0.2

#setup anova
month = factor(rep(1:12,11))
time2 = time(anomx.all)
year=floor(time2)
dataf = data.frame(anomx.all,time2,year,month)

#function to test different month starting points
#using time as a covariate
#column 13 is trend

cyclex.reg = function(datfs) {
 coes = matrix(NA,12,13)
 for (i in 1:12) {dats = datfs[i:(119+i),]
 coes[i,] = coef(lm(anomx.all~0+month+time2,data=dats))} #line corrected
coes}

(anov1 = cyclex.reg(dataf))

#function to test different month starting points
#using year as covariate
cyclexx.reg = function(datfs) {
 coes = matrix(NA,12,13)
 for (i in 1:12) {dats = datfs[i:(119+i),]
 coes[i,] = coef(lm(anomx.all~0+month+year,data=dats))} #line corrected
coes}

(anov2 = cyclexx.reg(dataf))
</pre>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/369/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=369&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2010/03/18/anomaly-regression-%e2%80%93-do-it-right/feed/</wfw:commentRss>
		<slash:comments>57</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/03/anomregfig1.jpeg" medium="image">
			<media:title type="html">anomregfig1</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/03/anomregfig2.jpeg" medium="image">
			<media:title type="html">anomregfig2</media:title>
		</media:content>

		<media:content url="http://statpad.files.wordpress.com/2010/03/anomregfig3.jpeg" medium="image">
			<media:title type="html">anomregfig3</media:title>
		</media:content>
	</item>
		<item>
		<title>Faster Version for Combining Series</title>
		<link>http://statpad.wordpress.com/2010/03/13/faster-version-for-combining-series/</link>
		<comments>http://statpad.wordpress.com/2010/03/13/faster-version-for-combining-series/#comments</comments>
		<pubDate>Sat, 13 Mar 2010 20:53:23 +0000</pubDate>
		<dc:creator>RomanM</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://statpad.wordpress.com/?p=362</guid>
		<description><![CDATA[I have written a faster version for combining temperature time series (allowing for weights) in R.  I had hoped to post it along with an example using it, but I got sidetracked so the example is not completed yet.  However, &#8230; <a href="http://statpad.wordpress.com/2010/03/13/faster-version-for-combining-series/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=362&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I have written a faster version for combining temperature time series (allowing for weights) in R.  I had hoped to post it along with an example using it, but I got sidetracked so the example is not completed yet.  However, I am posting the new function so that Jeff can use it in his latest efforts.    This version runs a LOT faster and more efficiently than the one <a href="http://statpad.wordpress.com/2010/03/08/combining-stations-plan-c/">posted earlier</a> .</p>
<p>As well, I have found a &#8220;bug&#8221; in that version which causes the script to fail when any series is missing all of the values for some  month.  When the data is run in the newer version, you get results for all months, but I don&#8217;t think the results are necessarily realistic for the month in question.  There is no way to infer what the values for that month might look like for that station without making further assumptions so, at the moment, the best bet is to remove the offending series from the analysis and run the program without it.  I have included a short program to identify possible problem series.</p>
<p>Anyway, here is the updated version.  I might get the example done tomorrow if the nice weather we are currently having goes away. <img src='http://s1.wp.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<pre class="brush: css; title: ; notranslate">
####Function for combining series
# For even faster calculation, use all=F
# to speed up multiple grid calculations

temp.combine = function(tsdat, wts=NULL, all=T) {
##### version2.0
### subfunction to do pseudoinverse
psx.inv = function(mat,tol = NULL) {
 if (NCOL(mat)==1) return( mat /sum(mat^2))
msvd = svd(mat)
 dind = msvd$d
if (is.null(tol)) {tol = max(NROW(mat),NCOL(mat))*max(dind)*.Machine$double.eps}
 dind[dind&lt;tol]=0
 dind[dind&gt;0] = 1/dind[dind&gt;0]
 inv = msvd$v %*% diag(dind, length(dind)) %*% t(msvd$u)
inv}
### subfunction to do offsets
calcx.offset = function(tdat,wts) {
## new version
 nr = length(wts)
 delt.mat = !is.na(tdat)
 delt.vec = rowSums(delt.mat)
 row.miss= (delt.vec ==0)
 delt2 = delt.mat/(delt.vec+row.miss)
 co.mat = diag(colSums(delt.mat)) - (t(delt.mat)%*% delt2)
 co.vec = colSums(delt.mat*tdat,na.rm=T) - colSums(rowSums(delt.mat*tdat,na.rm=T)*delt2)
 co.mat[nr,] = wts
 co.vec[nr]=0
 psx.inv(co.mat)%*%co.vec }
### main routine
 nr = nrow(tsdat)
 nc = ncol(tsdat)
 dims = dim(tsdat)
 if (is.null(wts)) wts = rep(1,nc)
 wts=wts/sum(wts)
 off.mat = matrix(NA,12,nc)
 dat.tsp = tsp(tsdat)
 for (i in 1:12) off.mat[i,] = calcx.offset(window(tsdat,start=c(dat.tsp[1],i), deltat=1), wts)
 colnames(off.mat) = colnames(tsdat)
 rownames(off.mat) = month.abb
 matoff = matrix(NA,nr,nc)
 for (i in 1:nc) matoff[,i] = rep(off.mat[,i],length=nr)
 temp = rowMeans(tsdat-matoff,na.rm=T)
 pred=NULL
 residual=NULL
 if (all==T) { pred =  c(temp) + matoff
 residual = tsdat-pred }
list(temps = ts(temp,start=c(dat.tsp[1],1),freq=12),pred =pred, residual = residual, offsets=off.mat) }


#pick out those series with have at least nn + 1 observations in every month
#Outputs a logical vector with TRUE indicating that that sereis is OK
dat.check = function(tsdat, nn=0) {  good = rep(NA,ncol(tsdat))
 for (i in 1:ncol(tsdat)) good[i]= (min(rowSums(!is.na(matrix(tsdat[,i],nrow=12))))&gt;nn)
 good }
</pre>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/statpad.wordpress.com/362/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/statpad.wordpress.com/362/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=statpad.wordpress.com&#038;blog=6398118&#038;post=362&#038;subd=statpad&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://statpad.wordpress.com/2010/03/13/faster-version-for-combining-series/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/e385885d1347e2524e835888c6d985ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">RomanM</media:title>
		</media:content>
	</item>
	</channel>
</rss>
