"I went to a statistician fight and a hockey stick broke out"
Part 2, DeepClimate throws McShane and Wyner into the penalty box: “This is a deeply flawed study”
Last week, the anti-science crowd were touting a couple of statisticians who had launched a slashing cross-check on a small portion of the scientific research supporting our scientific understanding that current warming is very likely unprecedented in the last thousand years.
As I discussed here, the McShane and Wyner analysis actually produced a Hockey Stick — and even as the climate science community ducked the blow, the Medieval Warm Period got hit in the head (see also Deltoid, who spruced up their graph):
But wait, the anti-science disinformers say, you left out the part of their analysis where they call into question all such graphs. And that was because I am on vacation and was waiting for the refs who I knew were busy reviewing the tapes before making their penalty call. In short, I knew that part of the analysis was deeply flawed.
I have been told that when McShane and Wyner is actually published it will be accompanied by several commentaries. I am confident they will identify rather significant shortcomings in the paper. So you may surmise that one reason you haven’t seen more definitive debunkings to date is that some people are holding off until those commentaries are published.
But in the meantime, you can read an evisceration of their analysis by Deep Climate — a terrific climate science blogger known for uncovering details of just how some of the most fraudulent charges against Mann and the Hockey Stick graph were trumped up by the anti-science crowd in the first place.
I reprint his post, “McShane and Wyner 2010,” in its entirety below, but you’ll also want to read the comments on his blog. Here is his conclusion:
McShane and Wyner’s background exposition of the scientific history of the “hockey stick” relies excessively on “grey” literature and is replete with errors, some of which appear to be have been introduced through a misreading of secondary sources, without direct consultation of the cited sources. And the authors’ claims concerning the performance of “null” proxies are clearly contradicted by findings in two key studies cited at length, Mann et al 2008 and Ammann and Wahl 2007.These contradictions are not even mentioned, let alone explained, by the authors.In short, this is a deeply flawed study and if it were to be published as anything resembling the draft I have examined, that would certainly raise troubling questions about the peer review process at the Annals of Applied Statistics.
Note: Everything that follows is from Deep Climate. For ease of readability, I am not indenting this. The second half gets pretty deep into statistics.
Over at ClimateAudit and WUWT they’ve broken out the champagne and are celebrating (once again) the demise, nay, the shattering into 1209 tiny splinters, of the Mann et al “hockey stick” graph, both the 1998 and 2008 editions. The occasion of all the rejoicing is a new paper by statisticians Blakely McShane and Abraham Wyner, entitled A Statistical Analysis of Multiple Temperature Proxies: Are Reconstructions of Surface Temperatures Over the Last 1000 Years Reliable? [PDF]. The paper, in press at the Annals of Applied Statistics, purports to demonstrate that randomly generated proxies of various kinds can produce temperature “reconstructions” that perform on validation tests as well as, or even better than, the actual proxies.
My discussion of McShane and Wyner is divided into two parts. First, I’ll look at the opening background sections. Here we’ll see that the authors have framed the issue in surprisingly political terms, citing a number of popular references not normally found in serious peer-reviewed literature. Similarly, the review of the “scientific literature” relies inordinately on grey literature such as Steve McIntyre and Ross McKitrick’s two Environment and Energy articles and the (non peer-reviewed) Wegman report. Even worse, that review contains numerous substantive errors, some of which appear to have been introduced by a failure to consult cited sources directly, notably in a discussion of a key quote from Edward Wegman himself.
With regard to the technical analysis, I have assumed that McShane and Wyner’s applications of statistical tests and calculations are sound. However, here too, there are numerous problems. The authors’ analysis of the performance of various randomly generated “pseudo proxies” is based on several questionable methodological choices. Not only that, but a close examination of the results shows clear contradictions with the findings in the key reconstruction studies cited. Yet the authors have not even mentioned these contradictions, let alone explained them.
In a late breaking development, Blakeley McShane’s website advises that the paper has been accepted, but that the draft available at the Annals of Applied Statistics (AOAS) is not final. However, the available document does acknowledge the input of two anonymous reviewers as well as that of the AOAS editor Michael Stein (whose purview includes “physical science, computation, engineering, and the environment”). Thus, we can expect the final version to be reasonably close, but not identical, to the one available at the AOAS website. With those caveats in mind, let’s take a closer look.
The AOAS guidelines call for authors to introduce their topic in “as non-technical a manner as possible”. After a brief introduction of paleoclimatology concepts, McShane and Wyner have taken this directive to heart and framed the issue in terms not unlike those found on libertarian websites:
On the other hand, the effort of world governments to pass legislation to cut carbon to pre-industrial levels cannot proceed without the consent of the governed and historical reconstructions from paleoclimatological models
have indeed proven persuasive and effective at winning the hearts and minds of the populace. “¦ [G]raphs like those in Figures 1, 2, and 3 are
featured prominently not only in official documents like the IPCC report but also in widely viewed television programs (BBC, September 14, 2008), in film (Gore, 2006), and in museum expositions (Rothstein, October 17,2008), alarming both the populace and policy makers.
After a passing reference to three Wall Street Journal accounts of “climategate”, McShane and Wyner start in on the history of the controversy “as it as it unfolded in the academic and scientific literature.” With a backing citation of all three McIntyre and McKitrick papers (an obvious warning sign), the authors state categorically:
M&M observed that the original Mann et al. (1998) study “¦ used only one principal component of the proxy record.
This is nonsense – the famous PC1 was the leading principal component of one proxy sub-network (North American tree rings) for one period of time (the 1400 step that represents the start of the original MBH98 reconstruction). And even for that sub-network, two PCs were used, not one.
Then we learn that this single principal component was a result of a “skew”-centred principal component analysis that “guarantees the shape” of the reconstruction. McShane and Wyner continue:
M&M made a further contribution by applying the Mann et al. (1998) reconstruction methodology to principal components computed in the standard fashion. The resulting reconstruction showed a rise in temperature in the medieval period, thus eliminating the hockey stick shape.
In fact, the alternative reconstruction from M&M started in 1400 and showed a clearly spurious spike in that century (normally considered well after the medieval period).
Mann and his colleagues vigorously responded to M&M to justify the hockey stick (Mann et al., 2004). They argued that one should not limit oneself to a single principal component as in Mann et al. (1998), but, rather, one should select the number of retained principal components through crossvalidation on two blocks of heldout instrumental temperature records (i.e., the first fifty years of the instrumental period and the last fifty years). When this procedure is followed, four principal components are retained, and the hockey stick re-emerges even when the PCs are calculated in the standard fashion.
Mann et al (2004) is the Corrigendum to MBH98 which fixed some data listing errors (without affecting the actual data or findings). But there was no reference to a changed PCA methodology; there could not have been as the Corrigendum was issued in March 2004, while the differing centering conventions were only identified much later that year! But there was a further explanation of the original PCA methodology, whereby the number of PCs retained for each proxy sub-network at each “step” interval was based on objective criteria combining “modified Preissendorfer Rule N and screen test”. (In fact, Mann’s methodology involved rebuilding the network with fewer and fewer proxies as one goes back, requiring recomputation of PCA for each large sub-network at each interval).
The account then skips ahead to the Barton-Whitfield investigation of Mann and his co-authors, followed by the Wegman report:
Their Congressional report (Wegman et al., 2006) confirmed M&M’s finding regarding skew-centered principal components (this finding was yet again confirmed by the National Research Council (NRC, 2006)).
Actually, the NRC report preceded Wegman by three months and was much more comprehensive, rendering the Wegman report superfluous.
But the biggest shocker is Wegman’s supposed excoriation of Mann et al 2004 for “adding principal components” after the “spurious results” of the previous de-centered method was revealed. In support of this assertion, the authors quote from Wegman’s supplementary congressional testimony:
In the MBH original, the hockey stick emerged in PC1 from the bristlecone/foxtail pines. If one centers the data properly the hockey stick does not emerge until PC4. Thus, a substantial change in strategy is required in the MBH reconstruction in order to achieve the hockey stick, a strategy which was specifically eschewed in MBH”¦a cardinal rule of statistical inference is that the method of analysis must be decided before looking at the data. The rules and strategy of analysis cannot be changed in order to obtain the desired result. Such a strategy carries no statistical integrity and cannot be used as a basis for drawing sound inferential conclusions.
If I’ve learned one thing in following McIntyre and his acolyte auditors, it’s to always check the ellipsis. Space does not permit showing the full passage from Wegman’s testimony (given in reply to written supplementary question from Rep. Bart Stupak). But even a little of the omitted text is very revealing:
“¦ in the MBH original, the hockey stick emerged in PC1 from the bristlecone/foxtail pines. If one centers the data properly the hockey stick
does not emerge until PC4. Thus, a substantial change in strategy is required in the MBH reconstruction in order to achieve the hockey stick, a strategy which was specifically eschewed in MBH. In Wahl and Ammann’s own words, the centering does significantly affect the results.
Ans: Yes, we were aware of the Wahl and Ammann simulation”¦ Wahl and Ammann reject this criticism of MM based on the fact that if one adds enough principal components back into the proxy, one obtains the hockey stick shape again. This is precisely the point of contention. It is a point we made in our testimony and that Wahl and Ammann make as well. A cardinal rule of statistical inference is that the method of analysis must be decided before looking at the data. The rules and strategy of analysis cannot be changed in order to obtain the desired result. Such a strategy carries no statistical integrity and cannot be used as a basis for drawing sound inferential conclusions.
So this passage has nothing whatsoever to do with the Mann corrigendum, but rather is a discussion of a subsequent paper by Wahl and Ammann. At the same time it reveals some pretty shocking sleight-of-hand by Wegman himself, even leaving aside the snarky dismissal of a paper as “unpublished” when in fact it was peer-reviewed and had been in press for for almost four months at the time of Wegman’s response.
Here Wegman is attempting to claim that Wahl and Ammann acknowledge that the differing numbers of principal components is itself a “change in strategy”. But this is a gross misrepresentation of Wahl and Ammann’s point, which was that an objective criterion is required to determine the number of PCs to be retained and that number will vary from sub-network and period, as well as centering convention. M&M arbitrarily selected only two because that’s what Mann had done at that particular step and network. They failed to implement Mann’s criterion (as noted previously), or indeed any criterion, and thus produced a deeply flawed reconstruction.
Wahl and Ammann demonstrated that when an objective and reasonable criterion for PC retention is used, a validated reconstruction very similar to the original results. That Mann used such a criterion in the original MBH98 is obvious from an examination of the various PC networks engendered from the NOAMER tree-ring network, as the number of retained PCs varied greatly.
Wahl and Ammann also pointed out differing number of PCs should be retained even if centred PCA were used; if the proxies were standardized, Wahl and Amann’s simple objective criterion would still call for two PCs, rather than five.
None of this should be construed as an endorsement of Mann’s original “decentered” PCA, which was criticized by statisticians. But it does demonstrate that its effect on the final reconstruction was minimal.
I’ll return to the exclusion of Wahl and Ammann from consideration in the Wegman report another time. For now, I’ll merely note that the situation was the reverse of that claimed by Wegman, and by extension McShane and Wyner; clearly, it was M&M who had no “strategy” for retention of PCs.
I’ll turn now to undoubtedly the most controversial part of the paper, namely the analysis of “null” randomly generated pseudo-proxies used to assess the significance of reconstruction. From the abstract:
In this paper, we assess the reliability of such reconstructions and their statistical significance against various null models. We find that the proxies do not predict temperature significantly better than random series generated independently of temperature.
This section, like the subsequent reconstruction of section 4, is based on the same set of 1209 proxies as the landmark Mann et al 2008 PNAS study. That study used two methodolgies that should be kept in mind:
- CPS (composite-plus-scale), based on regression of screened proxies against local grid-cell instrumental temperature.
- EIV (Errors-in-Variables), based on the same set of proxies (screened or not) and temperature series, but taking into account wider spatio-temporal correlations (so-called “tele-connections”). EIV is a variation on the RegEm methodology and can be likened to a PCA approach.
As McShane and Wyner explain, temperature reconstructions are typically evaluated by checking a “held back” window within the instrumental-proxy overlap period, which in the case of Mann et al was from 1850-1995. In Mann et al 2008, two mini-reconstruction are performed, an “early window” and a “late window”. For example, the “early window” validation attempts to “reconstruct” temperature from 1850-1895, based on a calibration of proxies to temperature from 1896-1995.
The fidelity of the “window” reconstruction, as measured by RE (reduction of error), is compared to the corresponding reconstruction obtained using a “null” set of proxies. In the case of Mann et al, the “null” proxy sets consisted of randomly-generated AR1 “red noise” series with an autocorrelation factor of 0.4.
Of course, there is only one real proxy set, but many “null” proxy sets so as to create a distribution of RE statistics, permitting the setting of 95% or 90% significance level. (The details are not so important, as long as one keeps in mind the basic idea of comparing a real-proxy reconstruction, to an ensemble created from random “null” proxies).
McSahne and Wyner set out to evaluate the performance of the Mann et al 2008 proxy set against a number of different “null” proxy types. In doing so, they employed a similar “validation window”, but with a number of methodological differences:
- The simpler Lasso L1 multivariate methodology was used, instead of the Mann et al screening/reconstruction methodologies.
- The 46-year validation window was shortened to 30 years.
- The instrumental calibration period was expanded to end in 1998.
- A sliding interpolated series of windows was used, instead of the two fixed “early” and “late” windows (which then become the two extreme points of a range of verification windows).
- Finally, the hemispheric average temperature series was used for calibration, rather than the gridded temperature.
Each of these is worth discussing, but first let’s look at the results. The following chart shows the spread of the performance of the real proxies and the “null” proxies over the range of validation windows. Figure 9 shows the RMSE (root mean square error) for each proxy type. (By the way, the NRC Report Box 9.1 is a good summary of the RMSE and RE measures discussed here).
According to this chart, none of the “null” proxies are significantly worse than the real proxies (lower RMSE representing a better fit in the verification window). And some even perform better over all (for example AR1 Empirical).
Let’s look a little more closely at the AR 1(.4) “null” proxy, which was also used in Mann et al 2008.
Earlier, the use of the first and last blocks had been of characterized as open to “data snooping” abuse:
A final serious problem with validating on only the front and back blocks is that the extreme characteristics of these blocks are widely known; it can only be speculated as to what extent the collection, scaling, and processing of the proxy data as well as modeling choices have been affected by this knowledge.
Now, McShane and Wyner make much of the fact that the real proxy set (in red) only performs better than AR1 (in black) in some early and late blocks.
Hence, the proxies perform statistically significantly better on very few holdout blocks, particularly those near the beginning of the series and those near the end. This is a curious fact because the “front” holdout block and the “back” holdout block are the only two which climate scientists use to validate their models. Insofar as this front and back performance is anomalous, they may be overconfident in their results.
But not so fast. The proper comparison is really with the very first and very last blocks – the ones actually used in climate studies. And those two blocks tell a very different story.
In fact, the very first block (“early window”) and the very last block (“late window”) actually show almost no difference between the two, with Mann et al’s real proxy data set firmly in the middle of the distribution of the “null” AR1 proxy performance spread. The fact that nearby “interpolated” windows show better performance is irrelevant, as these are not actually used in real climate studies.
So that begs the question – what was the performance actually shown for these two verification windows by Mann et al? After all, exactly the same real proxies and “null” proxy type were used. Here is Fig. 1C:
Both the early and late validation windows look very good, at least when based on the full proxy set (plotted in red)!
Let’s take a close up view of the RE (reduction-of-error) statistics (in general, lower RMSE results in higher RE):
At almost 0.9, the RE score is well above the 95% significance level, which is only 0.4 for the “null” proxies. Recall the definition of RE (courtesy of the NRC):
where is the mean squared error of using the sample average temperature over the calibration period (a constant, ) to predict temperatures during the period of interest “¦
For some reason, McShane and Wyner did not report the widely used RE statistic. But clearly, it is very unlikely that such similar RMSE scores could result in such widely divergent RE scores.
So one should look elsewhere for an explanation of this stunningly wide discrepancy. Perhaps the Lasso methodology results in inappropriate screening or overfitting. And the thirty year window might also help the cause of the “null” proxies. Indeed, it is curious that this window has been shortened, given the authors’ complaints about already short verification periods.
Whatever the reason, it should be crystal clear that McShane and Wyner’s simulation has failed to capture the actual behaviour of the study that inspired so much of their work. And the failure to actually cite the comparable results from Mann et al is puzzling indeed.
Now, I’ll turn to the “empirical AR1″³ proxy, claimed to outperform the real proxies. Ammann and Wahl 2007 noted that this proxy is problematic:
To generate “random” noise series, MM05c apply the full autoregressive structure of the real world proxy series. In this way, they in fact train their stochastic engine with significant (if not dominant) low frequency climate signal rather than purely non-climatic noise and its persistence.
McShane and Wyner attempt to rebut this, in a way that some may find unconvincing:
The proxy record has to be evaluated in terms of its innate ability to reconstruct historical temperatures (i.e., as opposed to its ability to “mimic” the local time dependence structure of the temperature series). Ammann and Wahl (2007) wrongly attribute reconstructive skill to the proxy record which is in fact attributable to the temperature record itself.
However, Ammann and Wahl also state that, despite these problems, the real proxies hold up quite well:
Furthermore, the MM05c proxy-based threshold analysis only evaluates the verification-period RE scores, ignoring the associated calibration-period performance. However, any successful real-world verification should always be based on the presumption that the associated calibration has been meaningful as well (in this context defined as RE >0), and that the verification skill is generally not greatly larger than the skill in calibration. When the verification threshold analysis is modified to include this real-world screening and generalized to include all proxies in each of the MBH reconstruction segments – even under the overly-conservative conditions discussed above – previous MBH/WA results can still be regarded as valid, contrary to MM05c. Ten of the eleven MBH98 reconstruction segments emulated in WA are significant above the 95% level (probability of Type I error below 5%) when using a conservative minimum calibration/verification RE ratio of 0.75, i.e. accepting poorer relative calibration performance than the lowest seen in the WA reconstructions (0.82 for the MBH 1400-network).
Again, this real-world result flies in the face of McShane and Wyner’s findings. And the omission of any reference to the contradictory results in Ammann and Wahl, let alone rebuttal thereof, is especially curious, as the above paragraph follows immediately after the passage discussed in detail by McShane and Wyner.
Frankly, after this exhaustive (and exhausting) examination of Section 3, I’m sure readers will understand (and even be grateful) if I do not enter a detailed discussion of Section 4, which contains an actual temperature reconstruction. I’ll simply note that the McShane and Wyner millenial reconstruction has a pronounced hockey stick shape, albeit with a higher Medieval Warm Period and wider error bars than the norm seen in various spaghetti graphs (apparently attributable to the Bayesian “path” approach). Here too, there are several questionable methodological choices, including a simplistic principal component approach that is almost certainly overfitting (for example, 10 principal components to represent 90 plus proxies seems excessive on its face).
So there you have it. McShane and Wyner’s background exposition of the scientific history of the “hockey stick” relies excessively on “grey” literature and is replete with errors, some of which appear to be have been introduced through a misreading of secondary sources, without direct consultation of the cited sources. And the authors’ claims concerning the performance of “null” proxies are clearly contradicted by findings in two key studies cited at length, Mann et al 2008 and Ammann and Wahl 2007.These contradictions are not even mentioned, let alone explained, by the authors.
In short, this is a deeply flawed study and if it were to be published as anything resembling the draft I have examined, that would certainly raise troubling questions about the peer review process at the Annals of Applied Statistics.
Reference list to come.