Two classic psychology studies failed the reproducibility test

In 1988, a groundbreaking study found that the more we smile, the happier we are. Ten years later, another study showed that a person’s willpower can be worn out over time. Despite the fact that they are cornerstone studies and greatly represented in the media, the two studies might not be true.

Image in CC0 License.

Studies – even highly revered studies – are not necessarily flawless. It’s not necessarily that the scientists have done anything wrong, it’s just that it can be hard to control all the external variables. Psychology is especially vulnerable to this because it’s hard to create similar conditions for similar people and every tiny detail can lead to significant differences. The scientists asking questions in a different way, having the study in one type of room and not another, everything can screw things up. The setup has to be extremely careful thought out and this generates difficulties in reproducibility.

Reproducibility is the ability of an entire experiment or study to be duplicated, either by the same researcher or by someone else working independently. Reproducible research is key to science, and researchers working in psychology have been aware for some time of reproducibility issues within the field. A 2006 study found that of 141 authors of a publication from the American Psychology Association (APA) empirical articles, 103 (73%) did not respond with their data over a 6-month period. Even two famous studies which you most likely know as myths, seem to fail the reproducibility test.

Study 1: Smile and you’ll be happier

The 1988 study concluded that our facial expressions have an impact on our mood – the more you smile, the happier you’ll be. In the original paper, German researchers asked participants to read The Far Side comics by artist Gary Larson. They asked them to hold a pen either between their teeth (thus forcing a smile) or between their lips (thus forcing a pout). They found that those who were smiling found the comics funnier than those who made a pout and therefore concluded that changing our facial expressions changes our mood. This is called the facial feedback hypothesis.


However, when a team from the University of Amsterdam in the Netherlands replicated the study, they didn’t get the same results, even when they used the same comic.

“Overall, the results were inconsistent with the original result,” the team conclude in Perspectives in Psychological Science – a separate paper to the ego depletion replication, but also due to be published in a few weeks.

Study 2: Human willpower can be worn out

The 1998 study, led by Roy Baumestier from Case Western University established what is called ego depletion – the wearing out of human willpower. Of course, this raised tremendous interest and several follow-up studies were conducted. Most notably, Martin Hagger from Curtin University in Australia had researchers in 24 labs recreate the original study and found no significant results. That’s right, the famous ego depletion theory, who has huge real-life implications and is often referenced in popular culture, was not successfully replicated. The results were published in the journal Perspectives in Psychological Science.

What this means

This doesn’t necessarily mean that the studies are wrong. The first study, for instance, was replicated in 17 Dutch labs. Nine labs reported similar results to the original study, but the others didn’t – and when all results were mixed together, results were blurred out with no significant similarity to the original study. This is where psychology’s inherent replicability problems emerge.

It’s really, really hard to replicate psychology studies. Perhaps humor has changed and people just don’t find that comics funny anymore, or can’t relate to them. Furthermore, the participants in this study were psychology students, who might have been aware of the original study or perhaps are not representative for a more general population.

“It shows how much effort and attention has gone towards improving the accuracy of the knowledge produced,” John Ioannidis, a Stanford University researcher who led a 2005 reproducibility study, told Olivia Goldhill at Quartz.

“Psychology is a discipline that has always been very strong methodologically and was at the forefront at describing various biases and better methods. Now they are again taking the lead in improving their replication record.”

In a strange way, this is not necessarily a bad thing. From this reproducibility crisis, science will emerge stronger and more accurate, though we’re not yet sure how; and as Brian Nosek, who leads the Reproducibility Project that repeated 100 experiments says, science isn’t about truth and falsity – it’s about reducing uncertainty.

“Really, this whole project is science on science: researchers doing what science is supposed to do, which is be skeptical of our own process, procedure, methods, and look for ways to improve.”

One thought on “Two classic psychology studies failed the reproducibility test

  1. stevendeedon

    Very old news, maybe three years old. These issues are pervasive in the special sciences, esp. in biomedical research, but also, e.g. in economics, genetics, neuroscience, randomized trials. See the work of Ioannidis at Stanford. Too bad the author of this short article doesn't know, but the underlying problems are very fixable, and in the field of psychology the "fix is on." It is impossible for someone without specialized education or training in methodology to assess the validity of scientific papers, as in Science, Nature, Psychological Science, Cell, etc. The statistics and other methodologies at work are just too complex. Common sense isn't adequate, and in psychology is often shown to be wrong.

    At the top of the list of these problems are three issuea. 1. Publishers have been disinsterested in publishing attempts at replication, or experiments that fail to support one's initital hypothesis. This is changing. 2. The statistical groundwork for most science is something like this. I will allow that 5% of the time, the resuilts of my experiments could happen by chance. If this turns out not to be the case, as measure statistically, I can then say that my (original) hypothesis has not been rejected. (Science experiments actually test the *opposite* of one's initial hypothesis.) The consensus among scientists now is that the .05 "p-value" is not adequate, and that we should use "effect size." 3. Sample sizes (investigating some persons randomly selected from a larger popular) have been too small. Psychologists have been especially fortunate in that they can now do both surveys and experiments in online labor markets like Amazon Mechanical Turk, which are much cheaper than lab experiments, and easily allow for much bigger samples. New software (G*Power) enables experimenters to quickly determine what size sasmple they need tor an adquate level of "statistical power," i.e. the probability that their findings will be statistcially significant.

    Additionally, science is quickly becoming more transparent. It is more common to share one's data, so others can assess it, or use it to try to replicate. And it becoming more common to "register" what you're going to strudy in advance, so that from the beginning a journal editor can track the work. Some journals now will even publish failed experiments, as long as this process of registering and showing work in progress is followed.

Leave a Reply

Your email address will not be published. Required fields are marked *