There has been a lot of discussion recently about the question of replication in psychological science. One major issue facing the discipline is that the way that institutions and incentives in the field are currently structured (e.g., what gets published in leading journals, what is valued by tenure committees, etc.) leads to far too little replication of the phenomena that we study. There are no easy solutions to this problem, but there are various promising attempts to begin to address these issues underway, including Brian Nosek’s Open Science Framework, the psychfiledrawer website (@psychfiledrawer) and others that have been suggested Sanjay Srivastava (@hardsci), Henry Roediger, Matt Lieberman (@social_brains), among others.
Last Thursday I posted my commentary from The Psychologist’s special issue on replication (PDF). I addressed the question of conceptual replication – in brief, where direct replication attempts to reproduce existing results by running the same experiment again, conceptual replication aims to test the underlying hypothesis in a novel way. I first brought up the distinction between direct and conceptual replication because it seemed like people were talking past each other. I argued that while direct replication is essential, conceptual replication is also an integral part of psychological science, especially because psychologists aren’t typically dealing with objectively defined quantities.
Another reason conceptual replication is so important is that if the field relies exclusively on direct replication then they risk replicating the same mistakes as well. Today I wanted to illustrate this risk by looking back at the history of one of social psychology’s most influential theories: cognitive dissonance. The richness and depth of Cognitive Dissonance Theory is a result of dozens of conceptual replications. I suggest that, had it not been for conceptual replication – had dissonance only been tested and re-tested in the original paradigm (Brehm’s Free Choice Paradigm) – the theory may not have stood up to recent criticisms directed at that particular paradigm.
The 1950s were a bleak time if you were a social psychologist interested in the empirical study of thoughts and feelings and how they affect human behavior. At that time, experimental psychology was dominated by behaviorism, an approach which focused exclusively on observable behavior, exiling ephemeral concepts like beliefs and emotions outside the boundaries of proper science. But things were about to change.
The Theory of Cognitive Dissonance, published by Leon Festinger in 1957, was one of those things. The theory was based on the simple idea that when a person simultaneously holds two conflicting beliefs he will experience a feeling of discomfort – cognitive dissonance – and that he will be motivated to end that discomfort by reducing the conflict between the beliefs, often by changing one of them.
Today, the term cognitive dissonance has entered our vernacular and the idea that we change or discard beliefs that don’t suit us seems like common sense. Research on how people rationalize their beliefs has spread to political science, medicine, neuroscience, and the law, and is one of the cornerstones of our understanding of human psychology. But in 1957, at a time when the field of psychology was dominated by behaviorism, the notion was far more controversial. Luckily, Leon Festinger and his colleagues and students conducted numerous experiments that tested predictions derived from Cognitive Dissonance Theory that could not be accounted for by behaviorist principles.
One of my favorite of these experiments (PDF), published by Elliot Aronson and Judson Mills in 1959, had college women reading obscene words out loud (words so obscene that I don’t feel comfortable writing them here myself, but the F word is in there, as is a four-letter word that also means rooster, and remember, this was 1959!). The women were reading these words as an initiation to get into a discussion group about the psychology of sex – they had to prove they were not going to be too embarrassed to take part in the conversation. This was the "severe initiation" condition. Another group of women recited a milder list of words (e.g., prostitute, virgin); this was the "mild initiation" condition. The women then heard a recording of a discussion by the group to which they had gained entry – as it turned out, the discussion was, according to the study’s authors, “one of the most worthless and uninteresting discussions imaginable.” The question was which group of women would like the psychology of sex discussion group more, the ones who had to undergo the severe initiation or the mild one?
To a behaviorist, the prediction is obvious: the more negative the experience the women go through to join the group, the less they should like it. But Cognitive Dissonance Theory predicted the opposite: the worse the initiation, the more you should like the group (this is what fraternities count on). Why? Because believing that you put yourself through a severe initiation just to join a boring group causes dissonance. By convincing yourself that the group was actually interesting, thereby justifying why you went through the severe initiation, you get rid of the dissonance. And this is exactly what they found! In their experiment, compared to women who had to undergo a mild initiation (or no initiation at all) the women who had to undergo the severe initiation reported liking the discussion group more. Cognitive Dissonance: 1, Behaviorism: 0.
In the early days of dissonance research Festinger and his colleagues produced demonstrations of dissonance at work in several different experimental paradigms. Through conceptual replication they wove a tapestry of converging evidence, all providing support for the idea that when people simultaneously held two conflicting beliefs, one of them would change. For example, in the “induced compliance” paradigm, they found that when people agreed to a request to write an essay supporting something they actually opposed (like cutting funding for wheelchair access or increased tuition), they would then come to support that idea more. Had they only replicated the same thing repeatedly they would have had a much less compelling case. And nowhere is this more apparent than in the case of the first experimental study of dissonance, the Free Choice Paradigm.
One of the earliest experiments supporting Cognitive Dissonance Theory predictions was Jack Brehm’s 1956 paper (PDF) on dissonance in the Free Choice Paradigm, an experiment in which he discovered that after choosing between two similarly-liked options, people increased their liking for the chosen object and decreased their liking for the unchosen object. Why? According to Dissonance Theory, after making a choice between two similarly-liked options people might feel regret – did I choose the right one; what if the other one was better? By adjusting their preferences to make the chosen object the clear winner, this problem is averted and dissonance is gone.
This basic finding, known as the “spreading of alternatives,” has been replicated numerous times. It was recently even found to occur in monkeys. So that’s a major success for science – you can get the basic finding to replicate reliably even when the study is conducted by independent experimenters, with different subjects, across different decades, and even across different species. But not so fast! It turns out that Brehm’s original experiment – and every subsequent replication for the next 50 years – had a very important flaw. According to recent work (PDF) by Keith Chen and Jane Risen (full disclosure, I’m married to one of these people), Brehm made one very important mistake: he violated the rules of random assignment. The full argument is a little complicated – for a more accessible account, check out this version (PDF) by Risen and Chen) – but the main problem is as follows.
In Brehm’s experiment, people first rated various items based on how much they liked them (in Brehm’s experiment these were kitchen appliances, but the results have been replicated with all sorts of things, like CDs and art posters). Next, the experimenter gave people a choice between two similarly rated items, like a toaster and a blender, which would be theirs to keep. Finally, people rated all the items again. And, as I mentioned earlier, you get a “spreading of alternatives” – in their second set of ratings the chosen item gets rated higher and the unchosen item gets rated lower than in the first set of ratings. So far so good. If you liked the toaster a little better than the blender originally, after choosing the toaster it becomes the clear favorite and any need for second guessing is averted. But here comes Brehm’s mistake: he expected, not unreasonably, that people would choose the item they originally rated higher. It turns out, however, that while most people did choose the higher rated item, about a quarter of them chose the other item and so they were “eliminated from consideration.”
This is a problem. To understand why, let’s think back to the initiation study for a minute. Imagine that a quarter of the women that were asked to read the obscene words off the flashcards had refused to do so and were eliminated from consideration. In other words, imagine we kept only those women who were so interested in the psychology of sex discussion group that they were willing to undergo a severe initiation and we got rid of those who weren’t. Of course it wouldn’t be surprising that the remaining women were particularly interested in the discussion group! That’s why it was critical that in the actual study all the women who participated in the study were willing to undergo the initiation, otherwise the results wouldn’t have told us anything about dissonance.
Now, returning to Brehm’s study, we can see that a very similar problem arises. Brehm eliminates anyone who didn’t choose the item they rated higher (let’s say the toaster) and chose the lower-rated item (let’s say the blender) instead. It’s not all that surprising that once you eliminate all the blender-lovers you’re left with a group that is very partial to toasters.
The problem is actually a little more complicated than that, so if you’re interested please check out the original papers here and here, but what’s most important is that even though Brehm’s finding was replicated many times, until very recently every replication made more or less the same mistake. And therein lies the problem of a science that relies exclusively on direct replication: you may have a finding that can be replicated reliably, but if there’s a flaw in the original finding then you may just be replicating the same mistake.