Some days ago, the UK Met Office released a subset of the HadCrut3 data set at this web page. I thought that it might be useful for people to be able to download the data and read it into the R statistical package for further analysis. Along the way, I discovered, like Harry, that not everything is as quite as advertised so hopefully my observations on the differences will also prove useful.
The released data is available as a windows zipfile. In the set, each station is available as a single text file (with no extension). The files are contained in 85 directories (at this point in time) whose two digit names are the first two digits of the station numbers of the files in that directory. It makes it easy to locate a station, but adds a bit of complication to trying to read the entire set simultaneously. However, it turns out that R actually can do it pretty easily J. One word of warning: When I downloaded the zipfile yesterday, I had found that new stations had been added to the set without any announcement of the changes on the web change that the set had been altered. In the future, it is a good idea to keep an eye on the size of the file to determine if it is being altered. Note to Met Office – it would be nice to have a date attached to the file indicating the most recent version.
I tried to download the file and unzip it in R, but there were problems so the recommended action is to do a manual download and unzip to a new directory which contains no other files. Otherwise, the R script which I will provide will not work properly.
Some other things to be aware of:
- The format of each station file is described in great detail on the metoffice web page linked above. Each file is supposed to start with 21 lines of information about the data followed the temperatures (not anomalies) in an array whose rows represent years and columns months. Well… not exactly. In fact, at the moment, the number of information lines varies from 13 to 23, so reading the information requires a slightly different approach. Again, the features and functions of R proved useful.
- The missing value designation is given as -99. Yes, for the temperatures. However, it appears the height (altitude) of the station missing value is -999 (not indicated on the web page).
- The longitude value for the station is backwards from what I have usually seen in most other situations. Negative values indicate East of Greenwich and positive values West. This was evident from a simple plot of the station locations. Not exactly wrong, but possibly misleading in an analysis.
There may be other things I have missed however what I have done seems to work reasonably well. My script contains the following functions:
- find.obs: given a set of file names, read the descriptive information to determine what is station info and what is data. Output is the number of “info” lines in each file.
- met.info: given a set of file names output from the previous function, read the descriptive information. Output includes a list of all station information plus the monthly temperature “normals” and monthly “standard deviations”. These come from various sources (indicated in the station information and may not match the averages calculated in the reference time period.
- met.dat: given same information as for met.info, read the data exactly in the format in the file
- monthly.calc: calculate the “normals” for a given reference period
- anom.calc: calculate anomalies either using a pre-specified vector of normals or from scratch for a given time period. The output is a list of time series.
- annual.calc: calculate annual station means for a specified set of stations (set up for temperatures, not anomalies). Output is a matrix of time series.
I hope there are no bugs in the script which can be found here. Enjoy Kenneth!