Faster Version for Combining Series

I have written a faster version for combining temperature time series (allowing for weights) in R.  I had hoped to post it along with an example using it, but I got sidetracked so the example is not completed yet.  However, I am posting the new function so that Jeff can use it in his latest efforts.    This version runs a LOT faster and more efficiently than the one posted earlier .

As well, I have found a “bug” in that version which causes the script to fail when any series is missing all of the values for some  month.  When the data is run in the newer version, you get results for all months, but I don’t think the results are necessarily realistic for the month in question.  There is no way to infer what the values for that month might look like for that station without making further assumptions so, at the moment, the best bet is to remove the offending series from the analysis and run the program without it.  I have included a short program to identify possible problem series.

Anyway, here is the updated version.  I might get the example done tomorrow if the nice weather we are currently having goes away. ;)

####Function for combining series
# For even faster calculation, use all=F
# to speed up multiple grid calculations

temp.combine = function(tsdat, wts=NULL, all=T) {
##### version2.0
### subfunction to do pseudoinverse
psx.inv = function(mat,tol = NULL) {
 if (NCOL(mat)==1) return( mat /sum(mat^2))
msvd = svd(mat)
 dind = msvd$d
if (is.null(tol)) {tol = max(NROW(mat),NCOL(mat))*max(dind)*.Machine$double.eps}
 dind[dind<tol]=0
 dind[dind>0] = 1/dind[dind>0]
 inv = msvd$v %*% diag(dind, length(dind)) %*% t(msvd$u)
inv}
### subfunction to do offsets
calcx.offset = function(tdat,wts) {
## new version
 nr = length(wts)
 delt.mat = !is.na(tdat)
 delt.vec = rowSums(delt.mat)
 row.miss= (delt.vec ==0)
 delt2 = delt.mat/(delt.vec+row.miss)
 co.mat = diag(colSums(delt.mat)) - (t(delt.mat)%*% delt2)
 co.vec = colSums(delt.mat*tdat,na.rm=T) - colSums(rowSums(delt.mat*tdat,na.rm=T)*delt2)
 co.mat[nr,] = wts
 co.vec[nr]=0
 psx.inv(co.mat)%*%co.vec }
### main routine
 nr = nrow(tsdat)
 nc = ncol(tsdat)
 dims = dim(tsdat)
 if (is.null(wts)) wts = rep(1,nc)
 wts=wts/sum(wts)
 off.mat = matrix(NA,12,nc)
 dat.tsp = tsp(tsdat)
 for (i in 1:12) off.mat[i,] = calcx.offset(window(tsdat,start=c(dat.tsp[1],i), deltat=1), wts)
 colnames(off.mat) = colnames(tsdat)
 rownames(off.mat) = month.abb
 matoff = matrix(NA,nr,nc)
 for (i in 1:nc) matoff[,i] = rep(off.mat[,i],length=nr)
 temp = rowMeans(tsdat-matoff,na.rm=T)
 pred=NULL
 residual=NULL
 if (all==T) { pred =  c(temp) + matoff
 residual = tsdat-pred }
list(temps = ts(temp,start=c(dat.tsp[1],1),freq=12),pred =pred, residual = residual, offsets=off.mat) }


#pick out those series with have at least nn + 1 observations in every month
#Outputs a logical vector with TRUE indicating that that sereis is OK
dat.check = function(tsdat, nn=0) {  good = rep(NA,ncol(tsdat))
 for (i in 1:ncol(tsdat)) good[i]= (min(rowSums(!is.na(matrix(tsdat[,i],nrow=12))))>nn)
 good }
About these ads

12 Comments

Filed under Uncategorized

12 responses to “Faster Version for Combining Series

  1. Agile Aspect

    Is there a way to download the R code other than the cut and paste to a clipboard Windows barbarism? On Linux the \cr\nl nonsense sucks.

    • RomanM

      I am not sure what can be done about it on your system. I am on windows and in Firefox, it is easy to copy the script. If you click inside the script box borders, a set of 4 icons appears at the top:”view source”, “copy to clipboard”, “print” and “?”. Clicking the second of these copies the entire scrip to the clipboard.

      WordPress does not allow a large variety of files to be uploaded to it by the user. Even txt files are verboten. I have fooled it by uploading doc files which were nothing but txt files and that works, but it is a pain in the butt. If you just enter it into the body of the post, the quotes get screwed up and then the copier needs to replace them all before the script can become useable.

  2. ha Roman,

    In looking at the meta data .. Thought you’d like to understand that bunches of stations are coastal..

    Interesting work for those grids that cover both land and sea.

    • RomanM

      Re: steven mosher ,

      Yes, that was particularly apparent from the offset plot in the single versus monthly post. The differences in pattern turned out to be coastal (and island) stations versus inland ones.

      I would like to get at the actual series from the water based atmosphere measurement sites to see what they look like. I suspect that there would be considerably less variation involved.

  3. I’m getting close to an operational version of this last software. It’s very cool as before, this time though my end is getting cleaner. Still not happy tho.

    Since you made the post titled ‘faster’, one suggestion for speed again would be to remove the nested functions. I’m not sure about the inner workings of R but as an interpretive rather than compiled language, there probably isn’t any logic to remove function definitions from a loop or function, therefore on each call your functions are probably re-defined.

    • Well,

      http://surfacestations.dabbledb.com/publish/surfacestations

      Started just a little database for metadata. Things I need to add:

      1. Brightness from GISS special inventory file.

      2. Population densities from GPW.

      3. More stuff.

      Hopefully others will see fit to post metadata files and I can just upload them. otherwise I will be forced to learn R and buy a new system. mac laptop sucks for this.

      Also, see the test on clearclimate code. Changing the smoothing to 250km really hit the trend on GISS.

      Now to recall hansen87 set the smoothing default at 1200km based on the fact that mid to high latitide correlations station to station averaged .5

      but the spread of correlation sucked. to use a precise term. in the southern altitude the correlation dropped to 33%.

      The coastal stations are especially tricky in northern latitudes.. see the giss versus cru treatment of the polar region.

      With coastal stations you have two types ( at least) those that have no ice by land ever. those that have ice by land all the time. and those that have ice by land some of the time. opps thats three. Tilo did a post about this on WUWT.
      Somebody could do a unique peice of work that would blend the work of cru with that of GISS

      • RomanM

        Re: steven mosher ,

        Good one, steven! I have been trying to find a good way of collecting GHCN metadata and this appears to help solve the problem. I will try to use it. Thanks!

        The 1200km. rule of Hansen has always bothered me as pretty extreme since it encompasses too many possible geographic environments simultaneously. What is needed is a fresh view of the whole situation. Having reliable descriptive information is a good start.

    • RomanM

      Re: Jeff Id ,

      The reason that I have “nested” the two functions inside the main function is purely for the convenience of having a single function one can load to do the job. If you wish to do it differently, simply cut the two internal functions out of there and run them separately first – the pieces come apart easily.

      From my understanding of R “environments” (and this could be wrong), I would guess that the way I have structured it would create a single instance of the two which would exist as long as the external wrapper function was running – in the actual flow of the script, the two definitions would each be encountered exactly once. This would not be the case if, for example, psx.inv was defined inside of calcx.offset, a situation I intentionally avoided.

      Of course, the difference between compiled and interpreted programs would exist whether the two functions were external or nested anyway, so it appears to me that that particular issue would be somewhat of a red herring.

      • I’ll quit bugging you about it.

        With each call of the main function, I would expect R to de-allocate and then re-allocate the functions. That is just my guess from a lot of programming and is not necessarily true. Since I’m still doing gridded data, this happens a couple thousand times, one call per grid. It’s not going to be a big speed difference anyway. Of course I’ll break them out, – already did.

        As far as compiled, I would expect the compiler to separate the functions automatically. There are a lot of smart functions which go on in a compiled program during build that would add time to a script style run. Not that it’s impossible.

      • RomanM

        Ok, I see what you are getting at. The main program will be called several thousand times and yes, the function will be redefined each time. This may add a second to the runtime. It is the cost of keeping down the clutter. ;)

  4. Ya roman the nice thing is you have your metadata in one place. You can just create a “view” of the data
    ( Say rural=true & coastal = true) and then that selection will be visible for anybody else to check or use. Plus it’s really handy to just have all the data in front of you to look at. Also, you can make it so that the location data when clicked will bring up the google map.

    Steve Mc has my email as does Jeff id. Send me mail I’ll add you as a user.

  5. Pingback: Anomaly Aversion « the Air Vent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s