Tag Archives: statistics

Burning Down the House

Nowhere is the challenge of getting people to understand how to use data better illustrated than the methodology wars being fought in the discipline of Psychology. If you haven’t heard of the methodology wars be assured that the battlefields – studies in psychological research – are being fought over like blocks of Stalingrad; and like that famous battle, not much is left standing in the aftermath.

I’m not sure exactly how the methodology wars started. Somehow, somewhere, someone decided to actually re-test a “classic” study in psychology. A study that’s been accepted into the core of the discipline – that established somebody’s reputation, made somebody a career. Only it didn’t replicate. They re-did the experiment as carefully as they could and it didn’t show the same result. Didn’t, usually, show any result at all. Increase the sample size to fix the problem and the signal becomes even clearer. Alas, the signal always seems to be that there is no signal.

Pretty soon people started calling into question nearly every Psych study done over the last fifty years and testing them. And many – and I mean many – have failed.

Slate’s article (and it’s really good – giving a great overview of the issue – so give it a read) recounts the latest block to burn down in psychology’s methodology wars. The research in question centered around the idea that our facial states feedback into our emotions. If we smile (even inadvertently) we will feel happier.

It’s an interesting idea – intuitively plausible – and apparently widely supported by a huge variety of studies in the field. It’s an idea which strikes me as perfectly reasonable and in which I have zero vested interested one way or another.

But when it was submitted to rigorous simultaneous validation in a number of different labs, it failed. Completely.

The original test involved 32 participants and a change in average subjective scoring between the two groups of 4.4 to 5.5. That means each group had sixteen participants. The improved second test had 92 total participants and showed a scoring difference of 4.3 to 51.

That’s a pretty small sample.

Especially for something that became received wisdom. A classic.

So how would people react if it turned out to be wrong?

Well, the Slate article answers that question pretty definitively. Because when the multi-lab tests came back, here’s what happened. Seventeen different labs replicated the experiment with nearly 2000 subjects. In half the participating labs, participants who smiled recorded a slightly higher average on the resulting happiness test (but much lower than in the original experiment). In the other half, it went the other way.

Net, net, there was no correlation at all. Zero.

Okay, so far you have just another sad story of a small sample size failure.

That’s not what really attracted my attention. Nope. What really made me laugh in utter disbelief was the comment of the “scientist” who had done the original research. Here it is, and I quote in full lest you think I’m about to exaggerate:

“Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.

Two years ago, while the replication of his work was underway, Strack wrote a takedown of the skeptics’ project with the social psychologist Wolfgang Stroebe. Their piece, called “The Alleged Crisis and the Illusion of Exact Replication,” argued that efforts like the RRR reflect an “epistemological misunderstanding,” since it’s impossible to make a perfect copy of an old experiment. People change, times change, and cultures change, they said. No social psychologist ever steps in the same river twice. Even if a study could be reproduced, they added, a negative result wouldn’t be that interesting, because it wouldn’t explain why the replication didn’t work.

So when Strack looks at the recent data he sees not a total failure but a set of mixed results. Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed? Maybe there’s a reason why half the labs could not elicit the effect.

[Bolding is mine]

So here’s a “scientist” who, despite presumably being familiar with the extensive literature on statistics and the methodology wars, somehow believes that because half the labs reported a number slightly above average the key thing to look at is why the other half didn’t. Apparently, the only thing that would satisfy him is if all the labs reported an exactly opposite result. Which, presumably, would result in a new classic paper that frowning makes you happier!

Ring, Ring!

Clue Phone.

It’s random variation calling for you, Dr. Strack!

Cause here’s the thing…you’d expect about half the labs to show a positive result when there is no correlation. If some labs didn’t report a positive result then the correlation would pretty much have to be negative, right?

This isn’t, as he appears to believe, half-corroboration. It’s the way every null result ever found actually looks out here in the real world. I’d advise him to try flipping a coin 100 times, repeatedly, and see how often it comes out 50 heads and 50 tails. He might be surprised to learn that about the half the time this test will yield more heads flips than tail flips. This does not mean that heads is more likely than tails and it does not suggest that researchers should focus on why some trials yielded more heads and other trials yielded more tails.

 

Okay, I get it. You published a study. You made a career out of it. It’s embarrassing that it turns out to be wrong. But it’s hard to know in this case which is the worse response – intellectual dishonesty or sheer stupidity. Frankly, I think the latter. Because I don’t care how dishonest you are, some explanations should be too embarrassing to try on for size. And the idea that the right interpretation of these results would be to look for why some labs had slightly different results than others clearly belongs in that category.

I find the defense based on the difficulties of true replication more respectable. And yet, what are we to make of an experiment so delicate that it can’t be replicated AT ALL even with the most careful controls? How important can any inference we make from such an experiment plausibly be? By definition, it could only fit the most narrow range of cases imaginable. And the idea that replication of an experiment doesn’t matter seems…you know…a tad unscientific.

From my perspective, it isn’t the original study that illustrates the extraordinary problem we have getting people to use data well. Yes, over-reliance on small sample sizes is all too common and all too easy. That’s unfortunate, not shameful. But the deeper problem is that even when data is used well, a lethal combination of self-interest and a near total lack of understanding of basic statistics make it all too possible for people to ignore the data whenever they wish.

As Simon and Garfunkel plaintively observed, “A man hears what he wants to hear, and disregards the rest”.

If it wasn’t so sad, it would be funny.

Dammit. It is funny.

For it’s easy to see that in this version of the psych methodology wars, the defenders have their own unique version of a foxhole – with posterior high in the air and head firmly planted in the sand.

[Getting close to the Digital Analytics Hub. If you love talking analytics, check it out. Would be great to see you there!]

Seven Pillars of Statistical Wisdom

I don’t review a lot of business books on my blog…mostly because I don’t like a lot of business books. A ridiculous percentage of business books seem to me either to be one-trick ponies (a good idea that could be expressed fully in a magazine article expanded to book length) or thinly veiled self-help books (self help books with ties as described in this spot-on Slate article). I HATE self-help books. Grit, Courage, Indecisiveness. It’s all the same to me.

On the other hand, The Seven Pillars of Statistical Wisdom isn’t really a business book. It’s a short (200 small pages), crisp, philosophical exploration of what makes statistics interesting. Written by a Univ. of Chicago Professor and published by Harvard University Press, it’s the best quasi-business book I’ve read in a long time.

I say quasi-business book because I’m not really sure who the intended audience is. It’s not super technical (thank god you can read it and know very little math), but it sometimes veers into explanations that assume a fairly deep understanding of statistics. Deeper, at least, than I have though I am most certainly not a formally trained statistician.

What Seven Pillars does extraordinarily well is examine a small core set of statistical ideas, explicate their history, and show why they are important, fundamental, and, in some cases, still controversial. In doing this, Seven Pillars provides a profound introduction into how to think statistically – not do statistics. Instead of focusing on how specific methods work, on definitions of statistical methods, or on specific issues in modern statistics (like big data), Seven Pillars tries to define what makes statistics an important way to think.

To give you a sense of this, here are the seven pillars:

Aggregation: Probably the core concept at the heart of all statistical thinking is the idea that you can sometimes GAIN insight while losing data. Stigler delves into basic concepts like the mean, shows how they evolved over the centuries (and it did take centuries) and explains why this fundamental insight is so important. It’s a brilliant discussion.

Information: If we gain information by losing data, how do we know how much information we’ve gained? Or how much data we need? With this pillar, Stigler lays out why more is sometimes less and how the value of observations usually declines sharply. Another terrific discussion around a fundamental insight that comes from statistics but is constantly under siege from folk common-sense.

Likelihood: In this section, Stigler tackles how the concepts around confidence levels and estimation of likelihood evolved over time. This section contains an amusing and historically interesting discussion on arguments for and against the likelihood of miracles!

Intercomparison: Stigler’s fourth pillar is the idea that we can use interior measurements of the data (there’s an excellent discussion of the historical derivation of Standard Deviation for example) to understand it. This section includes a superb discussion of the pitfalls of purely internal comparison and the tendency of humans to find patterns and of data to exhibit patterns that are not meaningful.

Regression: The idea of regression to the mean is fundamental to statistical thinking. It’s an amazingly powerful but consistently non-intuitive concept. Stigler uses a genetics example (and a really cool Quincunx visualization) to help explain the concept. This is one of the best discussions in a very fine book. On the other hand, the last part of this section which covers multivariate and Bayesian developments is less wonderful. If you don’t already understand these concepts, I’m not sure Stigler’s discussion is going to help.

Design: The next pillar is all about experimental design – surely a concept that is fundamental not just to statistics but to our everyday practical application of it. I found the discussion of randomization in this section particularly interesting and potentially noteworthy and thought-provoking.

Residual: Pillar seven is, appropriately enough, about what’s left over. Stigler is concerned here to show how examining the unexplained part of the analysis leads to a great deal of productive thinking in science and elsewhere. The idea of nested models is introduced and this section somehow transitions into a discussion of data visualization with illustrations from Florence Nightingale (apparently a mean hand with a chart). I’m not sure this transition made perfect sense in the context of the chapter, but the discussion is fascinating, enjoyable and pointed enough to generate some real insight.

Stigler concludes with some thoughts around whether and where an eighth pillar might arise. There’s some interesting stuff here that’s highly appropriate to anyone in digital trying to extend analytics into high-dimensional, machine-learning spaces. The discussion is (too) brief but I think intentionally so.

 

Seven Pillars isn’t quite a great book, and I mean that as high-praise. I don’t read many books that I could plausibly describe as almost great. The quality of the explanations is extremely high. But it does a better job explicating the intellectual basis behind simpler statistical concepts than more complicated ones and there are places where I think it’s insufficiently forceful in illuminating the underlying ways of thinking not just the statistical methods. Perhaps that’s inevitable, but greatness isn’t easy!

I do think the book occasionally suffers from a certain ambiguity around its audience. Is it intended as a means to get deep practitioners thinking about more fundamental concepts? I don’t think so – too many of the explanations are historical and basic.

Is it intended for a lay audience? Please.

I think it fits two audiences very well, but perhaps neither perfectly.

First, there are folks like me who use statistics and statistical thinking on an everyday basis but are not formally trained. I’m assuming that’s also a pretty broad swath of my readers. I know I found it both useful and enlightening, with only a few spots where the discussion became obscure and overtly professional.

The second audience is students and potential students of statistics who need something that pulls them away from the trenches (here’s how you do a regression) and gets them to think about what their discipline actually does. For that audience, I think the book is consistently brilliant.

If there’s a better short introduction into the intellectual basis and foundation of statistical thinking, I don’t know it. And for those who confuse statistical thinking with the ability to calculate a standard deviation or run a regression, Seven Pillars is a heady antidote