If it’s easy to replicate, it might still not be true

A lot of recent studies has shown that psychology has a replicability problem. When you try to replicate a study using the original materials, there is a good chance that you will obtain different results. More often than not, the effect sizes in the replication will be smaller and nonsignificant. As if this was not enough, there is another, even more insidious problem that has not been given much attention. Even when a study replicates successfully, it does not mean that the results actually support the general effect they are supposed to demonstrate. The issue has been raised before; however, it does not seem that people take the warnings seriously. One possible reason is that people do not appreciate how serious the problem is unless they see it demonstrated in practice. Our study, which just came out in Psychological Science, will hopefully help by convincingly demonstrating how using only a fixed set of stimuli might lead to misleading research findings.

A study by Hyunjin Song and Norbert Schwarz showed that people judge food additives with hard-to-pronounce names as more risky than additives with relatively easy-to-pronounce names. The study was published in 2009 in Psychological Science and has been cited 201 times according to Google Scholar. Song and Schwarz asked their participants to imagine that they are reading names of food additives on a food label and then to evaluate dangerousness of the additives based on their names. In our study, we initially tried to build on their findings and test a possible moderator of the effect; however, after a few hundreds participants and four studies with mixed results, it seemed that the effects we observed strongly depended on the specific stimuli that were used.

While we were able to repeatedly replicate the results of Song and Schwarz, we worried that the problem might affect the original effect as well. We therefore conducted a study in which we used newly created stimuli alongside the stimuli used by Song and Schwarz. The result supported our hunch – we again observed the effect when we analyzed only the original items, but there was no effect for the newly created stimuli.

How is this possible? A simple answer is that we cannot know for sure. The problem might have been caused partly by treating stimuli as a fixed factor. It is possible that the original results would not have been significant if Song and Schwarz had conducted their analysis correctly treating stimuli as a random factor. Psychologists have been warned about this mistake in the past and a couple of times recently as well. When you treat stimuli as a fixed factor, you limit your claim about the existence of the effect only to the particular stimuli used in the experiment. The effect of this analysis choice is clear from the comparison of results of the two possible analyses in the first four studies in our paper. While the analysis using fixed factors yields, for the same materials, enigmatic effects in opposite directions in studies 2 and 3, the effects disappear when stimuli are treated as a random factor:


However, simply treating stimuli as a random factor in a statistical analysis does not magically guarantee that significant results are really generalizable. When we analyzed our replications of Song and Schwarz’s effect, the results remained significant even when treating stimuli as a random factor. The problem here is probably deeper and more serious than just not using a correct statistical method: People usually use convenience samples of stimuli in their studies, without any attempt to define the underlying population of stimuli. They may pick the first stimuli that come to mind or stimuli that they believe are likely to produce the desired results. Treating stimuli as a random factor helps with generalizability only if the used stimuli are representative of the population of stimuli that are of interest. However, it cannot by itself remedy the cases where bias crept in during the stimuli-selection procedure. We cannot be sure why the effect of pronounceability on perceived risk exists only for the original stimuli used by Song and Schwarz. It is entirely possible that they just had (bad) luck and selected hard-to-pronounce names that were somehow related to danger purely by accident.

Nevertheless, the moral of the story is clear – it is important to have a systematic procedure for generating stimuli and to treat the stimuli as a random factor in analysis. Otherwise, we might end up with highly replicable studies that won’t give us any generalizable knowledge about the world.

This post was written together with Marek Vranka.

The problem of verification

Consider a somewhat ridiculous example. You want to study whether political attitudes are stable, or whether they are determined to a large extent by random influences. To study this you design a simple experiment. You have half of the participants drink sauerkraut juice and the other half orange juice. Then, you measure their political attitudes. If people have stable political attitudes, it should not matter what juice you give them. The political attitudes should stay the same. But if they are determined to a large extent by random influences, it may matter, and you might find an effect of juice flavor.

To express this formally, we will label the stable attitudes theory as TS and the random influences theory as TR. Since the juice flavor should probably have no effect (labeled as E0) if TS holds, the probability of E0 conditional on TS is very high, say P(E0|TS) = 0.99. There is still a slight possibility that there will be an experimenter’s effect or that the effect might operate through some unknown way that is, nevertheless, compatible with the theory. Be that as it may, the probability that there is an effect (E1) if TS holds is very low, P(E1|TS) = 0.01. While the random influences theory seems to be more in line with the juice flavor effect, it does not really rest on it. It can be always possible to say that the random influences are something else, that the juice would have an effect under different circumstances, in different participants, etc. Consequently, P(E1|TR) is higher than P(E1|TS), but it is still low. Say, P(E1|TR) = 0.10, and thus P(E0|TR) = 0.90. It is important to note that we are talking here about predictions of theories and not about results of an experiment. For simplification, we also just use binary true-or-false-effect, but the reasoning would hold even if we were talking about effect sizes.

Now, what happens if we find an effect of juice flavor? We should update our beliefs by the likelihood ratio, which is P(E1|TR) / P(E1|TS) = 0.10 / 0.01 = 10. That is, the experiment shows strong evidence for the random influences theory and it makes sense to publish it – the study is informative. What if we find no effect? We should again update our beliefs, but the likelihood ratio is in this case P(E0|TS) / P(E0|TR) = 0.99 / 0.90 = 1.1, which is hardly informative and you will have a huge trouble publishing this study.

The argument does not depend on statistical power. You may have infinite sample size, and the conclusions will be the same. The problem is in the design of the study. The study was tailored to verify the random influences theory and it cannot falsify it — by design. There is a lot of studies like this in psychology these days. People are trying to show sexy effects and not to test well-defined theories. Even without ill intent, this leads to publication bias and all the hurly-burly we are currently in.

What can be done with this? Primarily, we should design studies that are able to test theories. We should design studies that are publishable no matter what the result is. An ideal study would therefore test opposing predictions of different theories. “But wait, Štěpán, theories in psychology don’t give clear predictions!”, you might disagree if you felt brave enough to try to pronounce my name. Unfortunately, you would be right. The problem lies a bit deeper. Theories in psychology are usually very vaguely defined. It would therefore also help if psychological theories actually tried to make some strong predictions.

Note: The idea presented here is related to the difference between conceptual and direct replications. A conceptual replication is often intended to verify a hypothesis. If it finds a null effect, our perception of the original study does not change that much as if the replication was direct. Direct replications are usually better suited to falsify hypotheses. Conceptual replications are important and they may be more valuable than direct replications under certain conditions. However, they are more likely to be associated with publication bias. A null effect found in a conceptual replication is often not really informative, and it is therefore more likely to stay in the file drawer.

Good things about pre-registration

In the last year or so, the talk about pre-registrations has become increasingly frequent in psychology. Some see it as one of the possible remedies for problems negatively affecting trustworthiness of published results, but not everyone is convinced that benefits of pre-registration outweigh its disadvantages. Here, we focus on one critique of pre-registration that is sometimes given: Namely, some people argue that pre-registration precludes exploration of data and therefore prevents important serendipitous findings. This would in turn slow down the scientific process, which, these critics argue, is just not worth it.

However, we believe that such critiques are groundless and stem from misunderstanding of how pre-registration works in practice. For this reason, we want to share our experiences with conducting and publishing a pre-registered study in which we had tested our main hypothesis first and then explored the heck out of our data.

It is true that if it was required to have all analyses pre-registered, exploratory analysis would not be possible. But that is not the case – you are supposed to pre-register only your main hypothesis and how you are going to test it. And we all can agree that experiments should test a specific hypothesis and researchers should know in advance which specific manipulation is supposed to influence which specific outcome. If your experiment does not do that, you should think about its design a bit more. But if you have a specific hypothesis and a way to test it in mind, there is no reason why you cannot commit to that hypothesis by putting your whole design in a time-stamped document (e.g. on OSF). After you commit to your study design and hypothesis, there is little possibility that you will be able to fool yourself (and others) that you predicted an outcome that you in fact did not. However, this does not mean that you cannot get crazy analyzing your data after you do that initial test of the main hypothesis. The only thing to watch for is that you keep your confirmatory and exploratory findings separated not only in a results section but also in a discussion. It may be tempting to focus on interpretation of the exploratory findings, especially in cases where data do not support your original hypothesis. However, that would defeat the purpose of pre-registration which lies in making results of hypotheses testing trustworthy and reliable.

All this writing about the possibility of doing exploratory analysis following pre-registered confirmatory analysis was a bit abstract. Fortunately, we can illustrate the idea with a concrete example of our own study. We were interested in a moderator influencing effectiveness of a positive psychology exercise. In particular, the “Three good things in life” exercise asks you to write each day three good things that happened to you during that day. Initial research suggested that this exercise may improve happiness and decrease depressive symptoms. However, this effect wasn’t shown immediately, but only after some time has passed. Now, it is possible that recalling three good things that went well during a day may not be easy for some people. And, there is a literature about processing fluency, which suggests that any difficulty encountered during the recollection may be interpreted as a sign that your day was not that good. Recalling the things probably gets easier with practice, and the effect of the exercise can thus occur only after certain time. Which leads us to our study.

We recruited 204 students who did the exercise on our website for two weeks. However, not all students wrote three good things. They were randomly assigned to write one to ten things each day. Our hypothesis was simple – we expected that writing more things may lead to lower increase in life-satisfaction from a pre-exercise to a post-exercise measurement. It did not.

After testing the pre-registered primary hypothesis, it is possible to ask a lot of additional questions about the exercise. And, this is where exploratory analysis comes to play. Did the number of good things influence life-satisfaction only after a week or six weeks after the exercise? No. Did it influence positive or negative affect instead of life-satisfaction? No, it did not. Could it be that participants did not find the recollection hard because they did not follow instructions of the exercise? It does not seem so. Did the participants writing more things actually consider the exercise harder? A little.

We asked other questions in the exploratory analysis, but it should be already clear that the pre-registration did not by any means stop us from doing exploratory analysis.

Hopefully, we have shown that pre-registration in no way precludes exploration of data. But all exploration takes place in the “garden of forking paths” where decisions about how to proceed with analysis are contingent on the data at hand. In this sense, pre-registration does not take anything away – it just makes truly confirmatory hypotheses testing possible by making it clear that it is not conducted within the garden of forking paths.

Furthemore, pre-registration has other benefits: Because it forces you to think in advance much more deeply about your hypothesis, current literature, study design, sample size, and exclusion criteria, pre-registered studies tend to be more thought-through. And as Anna van’t Veer noted in her post – when conducting pre-registered study, you do not have to do more work, you just do it in a different order. Of course, that is true only if you are not used to collect large amount of data on many different hypotheses and then write up only those that “worked out”. In this way, pre-registration could improve publication (as in “making something public”) of null results and help to decrease the high proportion of false positives in the psychology knowledge base.

This post was written together with Marek Vranka.