replications | Štěpán Bahník

A lot of recent studies has shown that psychology has a replicability problem. When you try to replicate a study using the original materials, there is a good chance that you will obtain different results. More often than not, the effect sizes in the replication will be smaller and nonsignificant. As if this was not enough, there is another, even more insidious problem that has not been given much attention. Even when a study replicates successfully, it does not mean that the results actually support the general effect they are supposed to demonstrate. The issue has been raised before; however, it does not seem that people take the warnings seriously. One possible reason is that people do not appreciate how serious the problem is unless they see it demonstrated in practice. Our study, which just came out in Psychological Science, will hopefully help by convincingly demonstrating how using only a fixed set of stimuli might lead to misleading research findings.

A study by Hyunjin Song and Norbert Schwarz showed that people judge food additives with hard-to-pronounce names as more risky than additives with relatively easy-to-pronounce names. The study was published in 2009 in Psychological Science and has been cited 201 times according to Google Scholar. Song and Schwarz asked their participants to imagine that they are reading names of food additives on a food label and then to evaluate dangerousness of the additives based on their names. In our study, we initially tried to build on their findings and test a possible moderator of the effect; however, after a few hundreds participants and four studies with mixed results, it seemed that the effects we observed strongly depended on the specific stimuli that were used.

While we were able to repeatedly replicate the results of Song and Schwarz, we worried that the problem might affect the original effect as well. We therefore conducted a study in which we used newly created stimuli alongside the stimuli used by Song and Schwarz. The result supported our hunch – we again observed the effect when we analyzed only the original items, but there was no effect for the newly created stimuli.

How is this possible? A simple answer is that we cannot know for sure. The problem might have been caused partly by treating stimuli as a fixed factor. It is possible that the original results would not have been significant if Song and Schwarz had conducted their analysis correctly treating stimuli as a random factor. Psychologists have been warned about this mistake in the past and a couple of times recently as well. When you treat stimuli as a fixed factor, you limit your claim about the existence of the effect only to the particular stimuli used in the experiment. The effect of this analysis choice is clear from the comparison of results of the two possible analyses in the first four studies in our paper. While the analysis using fixed factors yields, for the same materials, enigmatic effects in opposite directions in studies 2 and 3, the effects disappear when stimuli are treated as a random factor:

However, simply treating stimuli as a random factor in a statistical analysis does not magically guarantee that significant results are really generalizable. When we analyzed our replications of Song and Schwarz’s effect, the results remained significant even when treating stimuli as a random factor. The problem here is probably deeper and more serious than just not using a correct statistical method: People usually use convenience samples of stimuli in their studies, without any attempt to define the underlying population of stimuli. They may pick the first stimuli that come to mind or stimuli that they believe are likely to produce the desired results. Treating stimuli as a random factor helps with generalizability only if the used stimuli are representative of the population of stimuli that are of interest. However, it cannot by itself remedy the cases where bias crept in during the stimuli-selection procedure. We cannot be sure why the effect of pronounceability on perceived risk exists only for the original stimuli used by Song and Schwarz. It is entirely possible that they just had (bad) luck and selected hard-to-pronounce names that were somehow related to danger purely by accident.

Nevertheless, the moral of the story is clear – it is important to have a systematic procedure for generating stimuli and to treat the stimuli as a random factor in analysis. Otherwise, we might end up with highly replicable studies that won’t give us any generalizable knowledge about the world.

This post was written together with Marek Vranka.

Consider a somewhat ridiculous example. You want to study whether political attitudes are stable, or whether they are determined to a large extent by random influences. To study this you design a simple experiment. You have half of the participants drink sauerkraut juice and the other half orange juice. Then, you measure their political attitudes. If people have stable political attitudes, it should not matter what juice you give them. The political attitudes should stay the same. But if they are determined to a large extent by random influences, it may matter, and you might find an effect of juice flavor.

To express this formally, we will label the stable attitudes theory as T_S and the random influences theory as T_R. Since the juice flavor should probably have no effect (labeled as E₀) if T_S holds, the probability of E₀ conditional on T_S is very high, say P(E₀|T_S) = 0.99. There is still a slight possibility that there will be an experimenter’s effect or that the effect might operate through some unknown way that is, nevertheless, compatible with the theory. Be that as it may, the probability that there is an effect (E₁) if T_S holds is very low, P(E₁|T_S) = 0.01. While the random influences theory seems to be more in line with the juice flavor effect, it does not really rest on it. It can be always possible to say that the random influences are something else, that the juice would have an effect under different circumstances, in different participants, etc. Consequently, P(E₁|T_R) is higher than P(E₁|T_S), but it is still low. Say, P(E₁|T_R) = 0.10, and thus P(E₀|T_R) = 0.90. It is important to note that we are talking here about predictions of theories and not about results of an experiment. For simplification, we also just use binary true-or-false-effect, but the reasoning would hold even if we were talking about effect sizes.

Now, what happens if we find an effect of juice flavor? We should update our beliefs by the likelihood ratio, which is P(E₁|T_R) / P(E₁|T_S) = 0.10 / 0.01 = 10. That is, the experiment shows strong evidence for the random influences theory and it makes sense to publish it – the study is informative. What if we find no effect? We should again update our beliefs, but the likelihood ratio is in this case P(E₀|T_S) / P(E₀|T_R) = 0.99 / 0.90 = 1.1, which is hardly informative and you will have a huge trouble publishing this study.

The argument does not depend on statistical power. You may have infinite sample size, and the conclusions will be the same. The problem is in the design of the study. The study was tailored to verify the random influences theory and it cannot falsify it — by design. There is a lot of studies like this in psychology these days. People are trying to show sexy effects and not to test well-defined theories. Even without ill intent, this leads to publication bias and all the hurly-burly we are currently in.

What can be done with this? Primarily, we should design studies that are able to test theories. We should design studies that are publishable no matter what the result is. An ideal study would therefore test opposing predictions of different theories. “But wait, Štěpán, theories in psychology don’t give clear predictions!”, you might disagree if you felt brave enough to try to pronounce my name. Unfortunately, you would be right. The problem lies a bit deeper. Theories in psychology are usually very vaguely defined. It would therefore also help if psychological theories actually tried to make some strong predictions.

Note: The idea presented here is related to the difference between conceptual and direct replications. A conceptual replication is often intended to verify a hypothesis. If it finds a null effect, our perception of the original study does not change that much as if the replication was direct. Direct replications are usually better suited to falsify hypotheses. Conceptual replications are important and they may be more valuable than direct replications under certain conditions. However, they are more likely to be associated with publication bias. A null effect found in a conceptual replication is often not really informative, and it is therefore more likely to stay in the file drawer.

Štěpán Bahník

Category Archives: replications

If it’s easy to replicate, it might still not be true

The problem of verification