(Agnoli et al. 2017)
# cherry picking, * p hacking, ^ HARKing
Null Hypothesis Significance Testing (NHST)—discussed above—is a commonly diagnosed cause of the current replication crisis (see Szucs & Ioannidis 2017). The ubiquitous nature of NHST in life and behavioural sciences is well documented, most recently by Cristea and Ioannidis (2018). This is important pre-condition for establishing its role as a cause, since it could not be a cause if its actual use was rare. The dichotomous nature of NHST facilitates publication bias (Meehl 1967, 1978). For example, the language of accept and reject in hypothesis testing maps conveniently on to acceptance and rejection of manuscripts, a fact that led Rosnow and Rosenthal (1989) to decry that “surely God loves the .06 nearly as much as the .05” (1989: 1277). Techniques that failed to enshrine a dichotomous threshold would be harder to employ in service of publication bias. For example, a case has been made that estimation using effect sizes and confidence intervals (introduced above) would be less prone to being used in service of publication bias (Cumming 2012, Cumming and Calin-Jageman 2017.
As already mentioned, the average statistical power in various disciplines is low. Not only is power often low, but it is virtually never reported; less than 10% of published studies in psychology report statistical power and even fewer in ecology do (Fidler et al. 2006). Explanations for the widespread neglect of statistical power often highlight the many common misconceptions and fallacies associated with p values (e.g., Haller & Krauss 2002; Gigerenzer 2018). For example, the inverse probability fallacy [ 1 ] has been used to explain why so many researchers fail to calculate and report statistical power (Oakes 1986).
In 2017, a group of 72 authors proposed in a Nature Human Behaviour paper that alpha level in statistical significance testing be lowered to 0.005 (as opposed to the current standard of 0.05) to improve the reproducibility rate of published research (Benjamin et al. 2018). A reply from a different set of 88 authors was published in the same journal, arguing against this proposal and stating instead that researchers should justify their alpha level based on context (Lakens et al. 2018). Several other replies have followed, including a call from Andrew Gelman and colleagues to abandon statistical significance altogether (McShane et al. 2018, Other Internet Resources). The exchange has become known on social media as the Alpha Wars (e.g., in the Barely Significant blog, Other Internet Resources )). Independently, the American Statistical Association released a statement on the use of p values for the first time in its history, cautioning against their overinterpretation and pointing out the limits of the information they offer about replication (Wasserman & Lazar 2016) and devoted their association’s 2017 annual convention to the theme “Scientific Method for the 21 st Century: A World Beyond \(p <0.05\)” (see Other Internet Resources ).
A number of recent high-profile cases of scientific fraud have contributed considerably to the amount of press around the reproducibility crisis in science. Often these cases (e.g., Diederik Stapel in psychology) are used as a hook for media coverage, even though the crisis itself has very little to do with scientific fraud. (Note also that the Questionable Research Practices above are not typically counted as “fraud” or even “scientific misconduct” despite their ethically dubious status.) For example, Fang, Grant Steen, and Casadevall (2012) estimated that 43% of retracted articles in biomedical research are withdrawn because of fraud. However, roughly half a million biomedical articles are published annually and only 400 of those are retracted (Oransky 2016, founder of the website RetractionWatch), so this amounts to a very small proportion of the literature (approximately 0.1%). There are, of course, many cases of pharmaceutical companies exercising financial pressure on scientists and the publishing industry that raise speculation about how many undetected (or unretracted) cases there may still be in the literature. Having said that, there is widespread consensus amongst scientists in the field that the main cause of the current reproducibility crisis is the current incentive structure in science (publication bias, publish or perish, non-transparent statistical reporting, lack of rewards for data sharing). Whilst this incentive structure can push some to scientific fraud, it appears to be a very small proportion.
Many scientists believe that replication is epistemically valuable in some way, that is to say, that replication serves a useful function in enhancing our knowledge, understanding or beliefs about reality. This section first discusses a problem about the epistemic value of replication studies—called the “experimenters regress”—and it then considers the claim that replication plays an epistemically valuable role in distinguishing scientific inquiry. It lastly examines a recent attempt to formalise the logic of replication in a Bayesian framework.
Collins (1985) articulated a widely discussed problem that is now known as the experimenters’ regress . He initially lays out the problem in the context of measurement (Collins 1985: 84). Suppose a scientist is trying to determine the accuracy of a measurement device and also the accuracy of a measurement result. Perhaps, for example, a scientist is using a thermometer to measure the temperature of a liquid, and it delivers a particular measurement result, say, 12 degrees Celsius.
The problem arises because of the interdependence of the accuracy of the measurement result and the accuracy of the measurement device: to know whether a particular measurement result is accurate, we need to test it against a measurement result that is previously known to be accurate, but to know that the result is accurate, we need to know that it has been obtained via an accurate measuring device, and so on. This, according to Collins, creates a “circle” which he refers to as the “experimenters’ regress”.
Collins extends the problem to scientific replication more generally. Suppose that an experiment B is a replication study of an initial experiment A , and that B ’s result apparently conflicts with A ’s result. This seeming conflict may have one of two interpretations:
The regress poses a problem about how to choose between these interpretations, a problem which threatens the epistemic value of replication studies if there are no rational grounds for choosing in a particular way. Determining whether one experiment is a proper replication of another is complicated by the facts that scientific writing conventions often omit precise details of experimental methodology (Collins 2016), and, furthermore, much of the knowledge that scientists require to execute experiments is tacit and “cannot be fully explicated or absolutely established” (Collins 1985: 73).
In the context of experimental methodology, Collins wrote:
To know an experiment has been well conducted, one needs to know whether it gives rise to the correct outcome. But to know what the correct outcome is, one needs to do a well-conducted experiment. But to know whether the experiment has been well conducted…! (2016: 66; ellipses original)
Collins holds that in such cases where a conflict of results arises, scientists tend to fraction into two groups, each holding opposing interpretations of the results. According to Collins, where such groups are “determined” and the “controversy runs deep” (Collins 2016: 67), the dispute between the groups cannot be resolved via further experimentation, for each additional result is subject to the problem posed by the experimenters’ regress. [ 2 ] In such cases, Collins claims that particular non-epistemic factors will partly determine which interpretation becomes the lasting view:
the career, social, and cognitive interests of the scientists, their reputations and that of their institutions, and the perceived utility for future work. (Franklin & Collins 2016: 99)
Franklin was the most vociferous opponent of Collins, although recent collaboration between the two has fostered some agreement (Collins 2016). Franklin presented a set of strategies for validating experimental results, all of which relate to “rational argument” on epistemic grounds (Franklin 1989: 459; 1994). Examples include, for instance, appealing to experimental checks on measurement devices or eliminating potential sources of error in the experiment (Franklin & Collins 2016). He claimed that the fact that such strategies were evidenced in scientific practice “argues against those who believe that rational arguments plays little, if any, role” in such validation (Franklin 1989: 459), with Collins being an example. He interprets Collins as suggesting that the strategies for resolving debates of the validation of results are social factors or “culturally accepted practices” (Franklin, 1989: 459) which do not provide reasons to underpin rational belief about results. Franklin (1994) further claims that Collins conflates the difficulty in successfully executing experiments with the difficulty of demonstrating that experiments have been executed, with Feest (2016) interpreting him to say that although such execution requires tacit knowledge, one can nevertheless appeal to strategies to demonstrate the validity of experimental findings.
Feest (2016) examines a case study involving debates about the Mozart effect in psychology (which, roughly speaking, is the effect whereby listening to Mozart beneficially affects some aspect of intelligence or brain structure). Like Collins, she agrees that there is a problem in determining whether conflicting results suggest a putative replication experiment is not a proper replication attempt, in part because there is uncertainty about whether scientific concepts such as the Mozart effect have been appropriately operationalised in earlier or later experimental contexts. Unlike Collins (on her interpretation), however, she does not think that this uncertainty arises because scientists have inescapably tacit knowledge of the linguistic rules about the meaning and application of concepts like the Mozart effect. Rather the uncertainty arises because such concepts are still themselves developing and because of assumptions about the world that are required to successfully draw inferences from it. Experimental methodology then serves to reveal the previously tacit assumptions about the application of concepts and the legitimacy of inferences, assumptions which are then susceptible to scrutiny.
For example, in her study of the Mozart effect, she notes that replication studies of the Mozart effect failed to find that Mozart music had a beneficial influence on spatial abilities. Rauscher, who was the first to report results supporting the Mozart effect, suggested that the later studies were not proper replications of her study (Rauscher, Shaw, and Ky 1993, 1995). She clarified that the Mozart effect applied only to a particular category of spatial abilities (spatio-temporal processes) and that the later studies operationalised the Mozart effect in terms of different spatial abilities (spatial recognition). Here, then, there was a difficulty in determining whether to interpret failed replication results as evidence against the initial results or rather as an indication that the replication studies were not proper replications. Feest claims this difficulty arose because of tacit knowledge or assumptions: assumptions about the application of the Mozart effect concept to different kinds of spatial abilities, about whether the world is such that Mozart music has an effect on such abilities and about whether the failure of Mozart to impact other kinds of spatial abilities warrants the inference that the Mozart effect does not exist. Contra Collins, however, experimental methodology enabled the explication and testing of these assumptions, thus allowing scientists to overcome the interpretive impasse.
Against this background, her overall argument is that scientists often are and should be sceptical towards each other’s results. However, this is not because of inescapably tacit knowledge and the inevitable failure of epistemic strategies for validating results. Rather, it is at least in part because of varying tacit assumptions that researchers have about the meaning of concepts, about the world and about what to draw inferences from it. Progressive experimentation serves to reveal these tacit assumptions which can then be scrutinised, leading to the accumulation of knowledge.
There is also other philosophical literature on the experimenters’ regress, including Teira’s (2013) paper arguing that particular experimental debiasing procedures are defensible against the regress from a contractualist perspective, according to which self-interested scientists have reason to adopt good methodological standards.
There is a widespread belief that science is distinct from other knowledge accumulation endeavours, and some have suggested that replication distinguishes (or is at least essential to) science in this respect. (See also the entry on science and pseudo-science .). According to the Open Science Collaboration, “Reproducible research practices are at the heart of sound research and integral to the scientific method.” (OSC 2015: 7). Schmidt echoes this theme: “To confirm results or hypotheses by a repetition procedure is at the basis of any scientific conception” (2009: 90). Braude (1979) goes so far as to say that reproducibility is a “demarcation criterion between science and nonscience” (1979: 2). Similarly, Nosek, Spies, and Motyl state that:
[T]he scientific method differentiates itself from other approaches by publicly disclosing the basis of evidence for a claim…. In principle, open sharing of methodology means that the entire body of scientific knowledge can be reproduced by anyone. (2012: 618)
If replication played such an essential or distinguishing role in science, we might expect it to be a prominent theme in the history of science. Steinle (2016) considers the extent to which it is such a theme. He presents a variety of cases from the history of science where replication played very different roles, although he understands “replication” narrowly to refer to when an experiment is re-run by different researchers . He claims that the role and value of replication in experimental replication is “much more complex than easy textbook accounts make us believe” (2016: 60), particularly since each scientific inquiry is always tied to a variety of contextual considerations that can affect the importance of replication. Such considerations include the relationship between experimental results and the background of accepted theory at the time, the practical and resource constraints on pursuing replication and the perceived credibility of the researchers. These contextual factors, he claims, mean that replication was a key or even overriding determinant of acceptance of research claims in some cases, but not in others.
For example, sometimes replication was sufficient to embrace a research claim, even if it conflicted with the background of accepted theory and left theoretical questions unresolved. A case of this is high-temperature superconductivity, the effect whereby an electric current can pass with zero resistance through a conductor at relatively high temperatures. In 1986, physicists Georg Bednorz and Alex Müller reported finding a material which acted as a superconductor at 35 kelvin (−238 degrees Celsius). Scientists around the world successfully replicated the effect, and Bednorz and Muller were then awarded with a Nobel Prize in Physics a year after their announcement. This case is remarkable since not only did their effect contradict the accepted physical theory at the time, but there is still no extant theory that adequately explains the effects which they reported (Di Bucchianico, 2014).
As a contrasting example, however, sometimes claims were accepted without any replication. In the 1650s, German scientist Otto von Guericke designed and operated the world’s first vacuum pump that would visibly suck air out of a larger space. He performed experiments with his device to various audiences. Yet the replication of his experiments by others would have been very difficult, if not impossible: not only was Guericke’s pump both expensive and complicated to build, but it was also unlikely that his descriptions of it sufficed to enable anyone to build the pump and to consequently replicate his findings. Despite this, Steinle claims that “no doubts were raised about his results”, probably as a results of his “public performances that could be witnessed by a large number of participants” (2016: 55).
Steinle takes such historical cases to provide normative guidance for understanding the epistemic value as replication as context-sensitive: whether replication is necessary or sufficient for establishing a research claim will depend on a variety of considerations, such as those mentioned earlier. He consequently eschews wide-reaching claims, such as those that “it’s all about replicability” or that “replicability does not decide anything” (2016: 60).
Earp and Trafimow (2015) attempt to formalise the way in which replication is epistemically valuable, and they do this using a Bayesian framework to explicate the inferences drawn from replication studies. They present the framework in a context similar to that of Collins (1985), noting that “it is well-nigh impossible to say conclusively what [replication results] mean” (Earp & Trafimow, 2015: 3). But while replication studies are often not conclusive, they do believe that such studies can be informative , and their Bayesian framework depicts how this is so.
The framework is set out with an example. Suppose an aficionado of Researcher A is highly confident that anything said by Researcher A is true. Some other researcher, Researcher B , then attempts to replicate an experiment by Researcher A , and Researcher B find results that conflict with those of Researcher A . Earp and Trafimow claim that the aficionado might continue to be confident in Researcher A ’s findings, but the aficionado’s confidence is likely to slightly decrease. As the number of failed replication attempts increases, the aficionado’s confidence accordingly decreases, eventually falling below 50% and thereby placing more confidence in the replication failures than in the findings initially reported by Researcher A .
Here, then, suppose we are interested in the probability that the original result reported by Researcher A is true given Researcher B ’s first replication failure. Earp and Trafimow represent this probability with the notation \(p(T\mid F)\) where p is a probability function, T represents the proposition that the original result is true and F represents Researcher B ’s replication failure. According to Bayes’s theorem below, this probability is calculable from the aficionado’s degree of confidence that the original result is true prior to learning of the replication failure \(p(T)\), their degree of expectation of the replication failure on the condition that the original result is true \(p(T\mid F)\), and the degree to which they would unconditionally expect a replication failure prior to learning of the replication failure \(p(F)\):
Relatedly, we could instead be interested in the confidence ratio that the original result is true or false given the failure to replicate. This ratio is representable as \(\frac{p(T\mid F)}{p(\nneg T\mid F)}\) where \(\nneg T\) represents the proposition that the original result is false. According to the standard Bayesian probability calculus, this ratio in turn is related to a product of ratios concerning
This relation is expressed in the equation:
Now Earp and Trafimow assign some values to the terms on the right-hand of the equation for (2). Supposing that the aficionado is confident in the original results, they set the ratio \(\frac{p(T)}{p(\nneg T)}\) to 50, meaning that the aficionado is initially fifty times more confident that the results are true than that the results are false.
They also set the ratio \(\frac{p(F\mid T)}{p(F\mid \nneg T)}\). about the conditional expectation of a replication failure to 0.5, meaning that the aficionado is considerably less confident that there will be a replication failure if the original result is true than if it is false. They point out that the extent to which the aficionado is less confident depends on the quality of so-called auxiliary assumptions about the replication experiment. Here, auxiliary assumptions are assumptions which enable one to infer that particular things should be observable if the theory under test is true. The intuitive idea is that the higher the quality of the assumptions about a replication study, the more one would expect to observe a successful replication if the original result was true. While they do not specify precisely what makes such auxiliary assumptions have high “quality” in this context, presumably this quality concerns the extent to which the assumptions are probably true and the extent to which the replication experiment is an appropriate test of the veracity of the original results if the assumptions are true.
Once the ratios on the right-hand of equation (2) are set as such, one can see that a replication failure would reduce one’s confidence in the original results:
Here, then, a replication failure would reduce the aficionado’s confidence that the original result was true so that the aficionado would be only 25 times more confident that the result is true given a failure (as per \(\frac{p(T\mid F)}{p(\nneg T\mid F)}\)) rather than 50 times more confident that it is true (as per \(\frac{p(T)}{p(\nneg T)}\)).
Nevertheless, the aficionado may still be confident that the original result is true, but we can see how such confidence would decrease with successive replication failures. More formally, let \(F_N\) be the last replication failure in a sequence of N replication failures \(\langle F_1,F_2,\ldots,F_N\rangle\). Then, the aficionado’s confidence in the original result given the N th replication failure is expressible in the equation: [ 3 ]
For example, suppose there are 10 replication failures, and so \(N=10\). Suppose further that the confidence ratios for the replication failures are set such that:
Here, then, the aficionado’s confidence in the original result decreases so that they are more confident that it was false than that it was true. Hence, on Earp and Trafimow’s Bayesian account, successive replication failures can progressively erode one’s confidence that an original result is true, even if one was initially highly confident in the original result and even if no single replication failure by itself was conclusive. [ 4 ]
Some putative merits of Earp and Trafimow’s account, then, are that it provides a formalisation whereby replication attempts are informative even if they are not conclusive, and furthermore, the formalisation provides a role for both quantity of replication attempts as well as auxiliary assumptions about the replications.
The aforementioned meta-science has unearthed a range of problems which give rise to the reproducibility crisis, and the open science movement has proposed or promoted various solutions—or reforms—for these problems. These reforms can be grouped into four categories: (a) methods and training, (b) reporting and dissemination, (c) peer review processes, and (d) evaluating new incentive structures (loosely following the categories used by Munafò et al. 2017 and Ioannidis et al. 2015). In subsections 4.1–4.4 below, we present a non-exhaustive list of initiatives in each of the above categories. These initiatives are reflections of various values and norms that are at the heart of the open science movement, and we discuss these values and norms in 4.5.
There has long been philosophical debate about what role values do and should play in science (Churchman 1948; Rudner 1953; Douglas 2016), and the reproducibility crisis is intimately connected to questions about the operations of, and interconnections between, such values. In particular, Nosek et al. (2017) argue that there is a tension between truth and publishability. More specifically, for reasons discussed in section 2 above, the accuracy of scientific results are compromised by the value which journals place on novel and positive results and, consequently, by scientists who value career success to seek to exclusively publish such results in these journals. Many others in addition to Nosek et al. (Hackett 2005; Martin 1992; Sovacool 2008) have taken also take issue with the value which journals and funding bodies have placed on novelty.
Some might interpret the tension as a manifestation of how epistemic values (such as truth and replicability) can be compromised by (arguably) non-epistemic values, such the value of novel, interesting or surprising results. Epistemic values are typically taken to be values that, in the words of Steel “promote the acquisition of true beliefs” (2010: 18; see also Goldman 1999). Canonical examples of epistemic values include the predictive accuracy and internal consistency of a theory. Epistemic values are often contrasted with putative non-epistemic or non-cognitive values, which include ethical or social values like, for example, the novelty of a theory or its ability to improve well-being by lessening power inequalities (Longino 1996). Of course, there is no complete consensus as to precisely what counts as an epistemic or non-epistemic value (Rooney 1992; Longino 1996). Longino, for example, claims that, other things being equal, novelty counts in favour of accepting a theory, and convincingly argues that, in some contexts, it can serve as a “protection against unconscious perpetuation of the sexism and androcentrism” in traditional science (1997: 22). However, she does not discuss novelty specifically in the context of the reproducibility crisis.
Giner-Sorolla (2012), however, does discuss novelty in the context of the crisis, and he offers another perspective on its value. He claims that one reason novelty has been used to define what is publishable or fundable is that it is relatively easy for researchers to establish and for reviewers and editors to detect. Yet, Giner-Sorolla argues, novelty for its own sake perhaps should not be valued, and should in fact be recognized as merely an operationalisation of a deeper concept, such as “ability to advance the field” (567). Giner-Sorolla goes on to point out how such shallow operationalisations of important concepts often lead to problems, for example, using statistical significance to measure the importance of results, or measuring the quality of research by how well outcomes fit with experimenters’ prior expectations.
Values are closely connected to discussions about norms in the open science movement. Vazire (2018) and others invoke norms of science— communality, universalism, disinterestedness and organised skepticism —in setting the goals for open science, norms originally articulated by Robert Merton (1942). Each such norm arguably reflects a value which Merton advocated, and each norm may be opposed by a counternorm which denotes behaviour that is in conflict with the norm. For example, the norm of communality (which Merton called “communism”) reflects the value of collaboration and the common ownership of scientific goods since the norm recommends such collaboration and common ownership. Advocates of open science see such norms, and the values which they reflect, as an aim for open science. For example, the norm of communality is reflected in sharing and making data open, and in open access publishing. In contrast, the counternorm of secrecy is associated with a closed, for profit publishing system (Anderson et al. 2010). Likewise, assessing scientific work on its merits upholds the norm of universalism—that the evaluation of research claims should not depend on the socio-demographic characteristics of the proponents of such claims. In contrast, assessing work by the age, the status, the institution or the metrics of the journal it is published in reflects a counternorm of particularism.
Vazire (2018) and others have argued that, at the moment, scientific practice is dominated by counternorms and that a move to Mertonian norms is a goal of the open science reform movement. In particular, self-interestedness, as opposed to the norm of disinterestedness, motivates p -hacking and other Questionable Research Practices. Similarly, a desire to protect one’s professional reputation motivates resistance to having one’s work replicated by others (Vazire 2018). This in turn reinforces a counternorm of organized dogmatism rather than organized skepticism which, according to Merton, involves the “temporary suspension of judgment and the detached scrutiny of beliefs” (Merton, 1973).
Anderson et al.’s (2010) focus groups and surveys of scientists suggest that scientists do want to adhere to Merton’s norms but that the current incentive structure of science makes this difficult. Changing the structure of penalty and reward systems within science to promote communality, universalism, disinterestedness and organized skepticism instead of their counternorms is an ongoing challenge for the open science reform movement. As Pashler and Wagenmakers (2012) have said:
replicability problems will not be so easily overcome, as they reflect deep-seated human biases and well-entrenched incentives that shape the behavior of individuals and institutions. (2012: 529)
The effort to promote such values and norms has generated heated controversy. Some early responses to the Reproducibility Project: Psychology and Many Labs projects were highly critical, not just of the substance of the nature and process of the work. Calls for openness were interpreted as reflecting mistrust, and attempts to replicate others’ work as personal attacks (e.g., Schnail 2014 in Other Internet Resources ). Nosek, Spies, & Motyl (2012) argue that calls for openness should not be interepreted as mistrust:
Opening our research process will make us feel accountable to do our best to get it right; and, if we do not get it right, to increase the opportunities for others to detect the problems and correct them. Openness is not needed because we are untrustworthy; it is needed because we are human. (2012: 626)
Exchanges related to this have become known as the tone debate . [ ]
The subject of reproducibility is associated with a turbulent period in contemporary science. This period has called for a re-evaluation of the values, incentives, practices and structures which underpin scientific inquiry. While the meta-science has painted a bleak picture of reproducibility in some fields, it has also inspired a parallel movement to strengthen the foundations of science. However, more progress is to be made, especially in understanding the solutions to the reproducibility crisis. In this regard, there are fruitful avenues for future research, including a deeper exploration of the role that epistemic and non-epistemic values can or should play in scientific inquiry.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
Bayes’ Theorem | epistemology: Bayesian | measurement: in science | operationalism | science: theory and observation in | scientific knowledge: social dimensions of | scientific method | scientific research and big data
Copyright © 2018 by Fiona Fidler < fidlerfm @ unimelb . edu . au > John Wilcox < wilcoxje @ stanford . edu >
Mirror sites.
View this site from another server:
The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University
Library of Congress Catalog Data: ISSN 1095-5054
New Research
The massive project shows that reproducibility problems plague even top scientific journals
Brian Handwerk
Science Correspondent
Academic journals and the press regularly serve up fresh helpings of fascinating psychological research findings. But how many of those experiments would produce the same results a second time around?
According to work presented today in Science , fewer than half of 100 studies published in 2008 in three top psychology journals could be replicated successfully. The international effort included 270 scientists who re-ran other people's studies as part of The Reproducibility Project: Psychology , led by Brian Nosek of the University of Virginia .
The eye-opening results don't necessarily mean that those original findings were incorrect or that the scientific process is flawed. When one study finds an effect that a second study can't replicate, there are several possible reasons, says co-author Cody Christopherson of Southern Oregon University. Study A's result may be false, or Study B's results may be false—or there may be some subtle differences in the way the two studies were conducted that impacted the results.
“This project is not evidence that anything is broken. Rather, it's an example of science doing what science does,” says Christopherson. “It's impossible to be wrong in a final sense in science. You have to be temporarily wrong, perhaps many times, before you are ever right.”
Across the sciences, research is considered reproducible when an independent team can conduct a published experiment, following the original methods as closely as possible, and get the same results. It's one key part of the process for building evidence to support theories. Even today, 100 years after Albert Einstein presented his general theory of relativity, scientists regularly repeat tests of its predictions and look for cases where his famous description of gravity does not apply.
"Scientific evidence does not rely on trusting the authority of the person who made the discovery," team member Angela Attwood , a psychology professor at the University of Bristol, said in a statement "Rather, credibility accumulates through independent replication and elaboration of the ideas and evidence."
The Reproducibility Project, a community-based crowdsourcing effort, kicked off in 2011 to test how well this measure of credibility applies to recent research in psychology. Scientists, some recruited and some volunteers, reviewed a pool of studies and selected one for replication that matched their own interest and expertise. Their data and results were shared online and reviewed and analyzed by other participating scientists for inclusion in the large Science study.
To help improve future research, the project analysis attempted to determine which kinds of studies fared the best, and why. They found that surprising results were the hardest to reproduce, and that the experience or expertise of the scientists who conducted the original experiments had little to do with successful replication.
The findings also offered some support for the oft-criticized statistical tool known as the P value , which measures whether a result is significant or due to chance. A higher value means a result is most likely a fluke, while a lower value means the result is statistically significant.
The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated.
But Christopherson suspects that most of his co-authors would not want the study to be taken as a ringing endorsement of P values, because they recognize the tool's limitations. And at least one P value problem was highlighted in the research: The original studies had relatively little variability in P value, because most journals have established a cutoff of 0.05 for publication. The trouble is that value can be reached by being selective about data sets , which means scientists looking to replicate a result should also carefully consider the methods and the data used in the original study.
It's also not yet clear whether psychology might be a particularly difficult field for reproducibility—a similar study is currently underway on cancer biology research. In the meantime, Christopherson hopes that the massive effort will spur more such double-checks and revisitations of past research to aid the scientific process.
“Getting it right means regularly revisiting past assumptions and past results and finding new ways to test them. The only way science is successful and credible is if it is self-critical,” he notes.
Unfortunately there are disincentives to pursuing this kind of research, he says: “To get hired and promoted in academia, you must publish original research, so direct replications are rarer. I hope going forward that the universities and funding agencies responsible for incentivizing this research—and the media outlets covering them—will realize that they've been part of the problem, and that devaluing replication in this way has created a less stable literature than we'd like.”
Get the latest Science stories in your inbox.
Brian Handwerk | READ MORE
Brian Handwerk is a science correspondent based in Amherst, New Hampshire.
Chapter: 5 replicability, 5 replicability.
Replication is one of the key ways scientists build confidence in the scientific merit of results. When the result from one study is found to be consistent by another study, it is more likely to represent a reliable claim to new knowledge. As Popper (2005 , p. 23) wrote (using “reproducibility” in its generic sense):
We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence,’ but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.
However, a successful replication does not guarantee that the original scientific results of a study were correct, nor does a single failed replication conclusively refute the original claims. A failure to replicate previous results can be due to any number of factors, including the discovery of an unknown effect, inherent variability in the system, inability to control complex variables, substandard research practices, and, quite simply, chance. The nature of the problem under study and the prior likelihoods of possible results in the study, the type of measurement instruments and research design selected, and the novelty of the area of study and therefore lack of established methods of inquiry can also contribute to non-replicability. Because of the complicated relationship between replicability and its variety of sources, the validity of scientific results should be considered in the context of an entire body of evidence, rather than an individual study or an individual replication. Moreover, replication may be a matter of degree, rather than a binary result of “success” or “failure.” 1 We explain in Chapter 7 how research synthesis, especially meta-analysis, can be used to evaluate the evidence on a given question.
How does one determine the extent to which a replication attempt has been successful? When researchers investigate the same scientific question using the same methods and similar tools, the results are not likely to be identical—unlike in computational reproducibility in which bitwise agreement between two results can be expected (see Chapter 4 ). We repeat our definition of replicability, with emphasis added: obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.
Determining consistency between two different results or inferences can be approached in a number of ways ( Simonsohn, 2015 ; Verhagen and Wagenmakers, 2014 ). Even if one considers only quantitative criteria for determining whether two results qualify as consistent, there is variability across disciplines ( Zwaan et al., 2018 ; Plant and Hanisch, 2018 ). The Royal Netherlands Academy of Arts and Sciences (2018 , p. 20) concluded that “it is impossible to identify a single, universal approach to determining [replicability].” As noted in Chapter 2 , different scientific disciplines are distinguished in part by the types of tools, methods, and techniques used to answer questions specific to the discipline, and these differences include how replicability is assessed.
___________________
1 See, for example, the cancer biology project in Table 5-1 in this chapter.
Acknowledging the different approaches to assessing replicability across scientific disciplines, however, we emphasize eight core characteristics and principles:
2 Cova et al. (2018, fn. 3) discuss the challenge of defining sufficiently similar as well as the interpretation of the results:
In practice, it can be hard to determine whether the ‘sufficiently similar’ criterion has actually been fulfilled by the replication attempt, whether in its methods or in its results ( Nakagawa and Parker 2015 ). It can therefore be challenging to interpret the results of replication studies, no matter which way these results turn out ( Collins, 1975 ; Earp and Trafimow, 2015 ; Maxwell et al., 2015 ).
3 See Table 5-1 , for an example of this in the reviews of a psychology replication study by Open Science Collaboration (2015) and Patil et al. (2016) .
to “Are Results A and B sufficiently divergent (given their proximity and uncertainty) so as to qualify as a non-replication ?” It may be advantageous, in assessing degrees of replicability, to define a relatively high threshold of similarity that qualifies as “replication,” a relatively low threshold of similarity that qualifies as “non-replication,” and the intermediate zone between the two thresholds that is considered “indeterminate.” If a second study has low power and wide uncertainties, it may be unable to produce any but indeterminate results.
The final point above is reinforced by a recent special edition of the American Statistician in which the use of a statistical significance threshold in reporting is strongly discouraged due to overuse and wide misinterpretation ( Wasserstein et al., 2019 ). A figure from ( Amrhein et al., 2019b ) also demonstrates this point, as shown in Figure 5-1 .
One concern voiced by some researchers about using a proximity-uncertainty attribute to assess replicability is that such an assessment favors studies with large uncertainties; the potential consequence is that many researchers would choose to perform low-power studies to increase the replicability chances ( Cova et al., 2018 ). While two results with large uncertainties and within proximity, such that the uncertainties overlap with each other, may be consistent with replication, the large uncertainties indicate that not much confidence can be placed in that conclusion.
CONCLUSION 5-1: Different types of scientific studies lead to different or multiple criteria for determining a successful replication. The choice of criteria can affect the apparent rate of non-replication, and that choice calls for judgment and explanation.
CONCLUSION 5-2: A number of parametric and nonparametric methods may be suitable for assessing replication across studies. However, a restrictive and unreliable approach would accept replication only when the results in both studies have attained “statistical significance,” that is, when the p -values in both studies have exceeded a selected threshold. Rather, in determining replication, it is important to consider the distributions of observations and to examine how similar these distributions are. This examination would include summary measures, such as proportions, means, standard deviations (uncertainties), and additional metrics tailored to the subject matter.
The committee was asked to assess what is known and, if necessary, identify areas that may need more information to ascertain the extent
of non-replicability in scientific and engineering research. The committee examined current efforts to assess the extent of non-replicability within several fields, reviewed literature on the topic, and heard from expert panels during its public meetings. We also drew on the previous work of committee members and other experts in the field of replicability of research.
Some efforts to assess the extent of non-replicability in scientific research directly measure rates of replication, while others examine indirect measures to infer the extent of non-replication. Approaches to assessing non-replicability rates include
This section discusses each of these lines of evidence.
The most direct method to assess replicability is to perform a study following the original methods of a previous study and to compare the new results to the original ones. Some high-profile replication efforts in recent years include studies by Amgen, which showed low replication rates in biomedical research ( Begley and Ellis, 2012 ), and work by the Center for Open Science on psychology ( Open Science Collaboration, 2015 ), cancer research ( Nosek and Errington, 2017 ), and social science ( Camerer et al., 2018 ). In these examples, a set of studies was selected and a single replication attempt was made to confirm results of each previous study, or one-to-one comparisons were made. In other replication studies, teams of researchers performed multiple replication attempts on a single original result, or many-to-one comparisons (see e.g., Klein et al., 2014 ; Hagger et al., 2016 ; and Cova et al., 2018 in Table 5-1 ).
Other measures of replicability include assessments that can provide indicators of bias, errors, and outliers, including, for example, computational data checks of reported numbers and comparison of reported values against a database of previously reported values. Such assessments can identify data that are outliers to previous measurements and may signal the need for additional investigation to understand the discrepancy. 4 Table 5-1 summarizes the direct and indirect replication studies assembled by the committee. Other sources of non-replicabilty are discussed later in this chapter in the Sources of Non-Replicability section.
4 There is risk of missing a new discovery by rejecting data outliers without further investigation.
Many direct replication studies are not reported as such. Replication—especially of surprising results or those that could have a major impact—occurs in science often without being labelled as a replication. Many scientific fields conduct reviews of articles on a specific topic—especially on new topics or topics likely to have a major impact—to assess the available data and determine which measurements and results are rigorous (see Chapter 7 ). Therefore, replicability studies included as part of the scientific literature but not cited as such add to the difficulty in assessing the extent of replication and non-replication.
One example of this phenomenon relates to research on hydrogen storage capacity. The U.S. Department of Energy (DOE) issued a target storage capacity in the mid-1990s. One group using carbon nanotubes reported surprisingly high values that met DOE’s target ( Hynek et al., 1997 ); other researchers who attempted to replicate these results could not do so. At the same time, other researchers were also reporting high values of hydrogen capacity in other experiments. In 2003, an article reviewed previous studies of hydrogen storage values and reported new research results, which were later replicated ( Broom and Hirscher, 2016 ). None of these studies was explicitly called an attempt at replication.
Based on the content of the collected studies in Table 5-1 , one can observe that the
The replication studies such as those shown in Table 5-1 are not necessarily indicative of the actual rate of non-replicability across science for a number of reasons: the studies to be replicated were not randomly chosen, the replications had methodological shortcomings, many replication studies are not reported as such, and the reported replication studies found widely varying rates of non-replication ( Gilbert et al., 2016 ). At the same time, replication studies often provide more and better-quality evidence than most original studies alone, and they highlight such methodological features as high precision or statistical power, preregistration, and multi-site collaboration ( Nosek, 2016 ). Some would argue that focusing on replication of a single study as a way to improve the efficiency of science is ill-placed. Rather, reviews of cumulative evidence on a subject, to gauge both the overall effect size and generalizability, may be more useful ( Goodman, 2018 ; and see Chapter 7 ).
Apart from specific efforts to replicate others’ studies, investigators will typically confirm their own results, as in a laboratory experiment, prior to
TABLE 5-1 Examples of Replication Studies
Field and Author(s) | Description | Results | Type of Assessment |
---|---|---|---|
Experimental Philosophy ( ) | A group of 20 research teams performed replication studies of 40 experimental philosophy studies published between 2003 and 2015 | 70% of the 40 studies were replicated by comparing the original effect size to the confidence interval (CI) of the replication. | Direct |
Behavioral Science, Personality Traits Linked to Life Outcomes ( ) | Performed replications of 78 previously published associations between the Big Five personality traits and consequential life outcomes | 87% of the replication attempts were statistically significant in the expected direction, and effects were typically 77% as strong as the corresponding original effects. | Direct |
Behavioral Science, Ego-Depletion Effect ( ) | Multiple laboratories (23 in total) conducted replications of a standardized ego-depletion protocol based on a sequential-task paradigm by | Meta-analysis of the studies revealed that the size of the ego-depletion effect was small with 95% CI that encompassed zero (d = 0.04, 95% CI [−0.07, 0.15]). | |
General Biology, Preclinical Animal Studies ( ) | Attempt by researchers from Bayer HealthCare to validate data on potential drug targets obtained in 67 projects by copying models exactly or by adapting them to internal needs | Published data were completely in line with the results of the validation studies in 20%-25% of cases. | Direct |
Oncology, Preclinical Studies ( ) | Attempt by Amgen team to reproduce the results of 53 “landmark” studies | Scientific results were confirmed in 11% of the studies. | Direct |
Genetics, Preclinical Studies ( ) | Replication of data analyses provided in 18 articles on microarray-based gene expression studies | Of the 18 studies, 2 analyses (11%) were replicated; 6 were partially replicated or showed some discrepancies in results; and 10 could not be replicated. | Direct |
Experimental Psychology ( ) | Replication of 13 psychological phenomena across 36 independent samples | 77% of phenomena were replicated consistently. | Direct |
Field and Author(s) | Description | Results | Type of Assessment |
---|---|---|---|
Experimental Psychology, Many Labs 2 ( ) | Replication of 28 classic and contemporary published studies | 54% of replications produced a statistically significant effect in the same direction as the original study, 75% yielded effect sizes smaller than the original ones, and 25% yielded larger effect sizes than the original ones. | Direct |
Experimental Psychology ( ) | Attempt to independently replicate selected results from 100 studies in psychology | 36% of the replication studies produced significant results, compared to 97% of the original studies. The mean effect sizes were halved. | Direct |
Experimental Psychology ( ) | Using reported data from the replication study in psychology, reanalyzed the results | 77% of the studies replicated by comparing the original effect size to an estimated 95% CI of the replication. | Direct |
Experimental Psychology ( ) | Attempt to replicate 21 systematically selected experimental studies in the social sciences published in and in 2010-2015 | Found a significant effect in the same direction as the original study for 62% (13 of 21) studies, and the effect size of the replications was on average about 50% of the original effect size. | Direct |
Empirical Economics ( ) | 2-year study that collected programs and data from authors and attempted to replicate their published results on empirical economic research | Two of nine replications were successful, three “near” successful, and four unsuccessful; findings suggest that inadvertent errors in published empirical articles are a commonplace rather than a rare occurrence. | Direct |
Economics ( ) | Progress report on the number of journals with data sharing requirements and an assessment of 167 studies | 10 journals explicitly note they publish replications; of 167 published replication studies, approximately 66% were unable to confirm the original results; 12% disconfirmed at least one major result of the original study, while confirming others. | N/A |
Field and Author(s) | Description | Results | Type of Assessment |
---|---|---|---|
Economics ( ) | An effort to replicate 18 studies published in the and the from 2011-2014 | Significant effect in the same direction as the original study found for 11 replications (61%); on average, the replicated effect size was 66% of the original. | Direct |
Chemistry ( ; ) | Collaboration with National Institute of Standards and Technology (NIST) to check new data against NIST database, 13,000 measurements | 27% of papers reporting properties of adsorption had data that were outliers; 20% of papers reporting carbon dioxide isotherms as outliers. | Indirect |
Chemistry ( ) | Collaboration with NIST, Thermodynamics Research Center (TRC) databases, prepublication check of solubility, viscosity, critical temperature, and vapor pressure | 33% experiments had data problems, such as uncertainties too small, reported values outside of TRC database distributions. | Indirect |
Biology Reproducibility Project: Cancer Biology | Large-scale replication project to replicate key results in 29 cancer papers published in , , and other high-impact journals | The first five articles have been published; two replicated important parts of the original papers, one did not replicate, and two were uninterpretable. | Direct |
Psychology, Statistical Checks ( ) | Statcheck tool used to test statistical values within psychology articles from 1985-2013 | 49.6% of the articles with null hypothesis statistical test (NHST) results contained at least one inconsistency (8,273 of the 16,695 articles), and 12.9% (2,150) of the articles with NHST results contained at least one gross inconsistency. | Indirect |
Engineering, Computational Fluid Dynamics ( ) | Full replication studies of previously published results on bluff-body aerodynamics, using four different computational methods | Replication of the main result was achieved in three out of four of the computational efforts. | Direct |
Field and Author(s) | Description | Results | Type of Assessment |
---|---|---|---|
Psychology, Many Labs 3 ( ) | Attempt to replicate 10 psychology studies in one online session | 3 of 10 studies replicated at < 0.05. | Direct |
Psychology ( ) | Argued that one of the failed replications in Ebersole et al. was due to changes in the procedure. They randomly assigned participants to a version closer to the original or to Ebersole et al.’s version. | The original study replicated when the original procedures were followed more closely, but not when the Ebersole et al. procedures were used. | Direct |
Psychology ( ) | 17 different labs attempted to replicate one study on facial feedback by . | None of the studies replicated the result at < 0.05. | Direct |
Psychology ( ) | Pointed out that all of the studies in the replication project changed the procedure by videotaping participants. Conducted a replication in which participants were randomly assigned to be videotaped or not. | The original study was replicated when the original procedure was followed ( = 0.01); the original study was not replicated when the video camera was present ( = 0.85). | Direct |
Psychology ( ) | 31 labs attempted to replicate a study by Schooler and Engstler-Schooler (1990). | Replicated the original study. The effect size was much larger when the original study was replicated more faithfully (the first set of replications inadvertently introduced a change in the procedure). | Direct |
NOTES: Some of the studies in this table also appear in Table 4-1 as they evaluated both reproducibility and replicability. N/A = not applicable.
a From Cova et al. (2018 , p. 14): “For studies reporting statistically significant results, we treated as successful replications for which the replication 95 percent CI [confidence interval] was not lower than the original effect size. For studies reporting null results, we treated as successful replications for which original effect sizes fell inside the bounds of the 95 percent CI.”
b From Soto (2019 , p. 7, fn. 1): “Previous large-scale replication projects have typically treated the individual study as the primary unit of analysis. Because personality-outcome studies often examine multiple trait-outcome associations, we selected the individual association as the most appropriate unit of analysis for estimating replicability in this literature.”
publication. More generally, independent investigators may replicate prior results of others before conducting, or in the course of conducting, a study to extend the original work. These types of replications are not usually published as separate replication studies.
Several experts who have studied replicability within and across fields of science and engineering provided their perspectives to the committee. Brian Nosek, cofounder and director of the Center for Open Science, said there was “not enough information to provide an estimate with any certainty across fields and even within individual fields.” In a recent paper discussing scientific progress and problems, Richard Shiffrin, professor of psychology and brain sciences at Indiana University, and colleagues argued that there are “no feasible methods to produce a quantitative metric, either across science or within the field” to measure the progress of science ( Shiffrin et al., 2018 , p. 2632). Skip Lupia, now serving as head of the Directorate for Social, Behavioral, and Economic Sciences at the National Science Foundation, said that there is not sufficient information to be able to definitively answer the extent of non-reproducibility and non-replicability, but there is evidence of p- hacking and publication bias (see below), which are problems. Steven Goodman, the codirector of the Meta-Research Innovation Center at Stanford University (METRICS), suggested that the focus ought not be on the rate of non-replication of individual studies, but rather on cumulative evidence provided by all studies and convergence to the truth. He suggested the proper question is “How efficient is the scientific enterprise in generating reliable knowledge, what affects that reliability, and how can we improve it?”
Surveys of scientists about issues of replicability or on scientific methods are indirect measures of non-replicability. For example, Nature published the results of a survey in 2016 in an article titled “1,500 Scientists Lift the Lid on Reproducibility ( Baker, 2016 )” 5 ; this article reported that a large percentage of researchers who responded to an online survey believe that replicability is a problem. This article has been widely cited by researchers studying subjects ranging from cardiovascular disease to crystal structures ( Warner et al., 2018 ; Ziletti et al., 2018 ). Surveys and studies have also assessed the prevalence of specific problematic research practices, such as a 2018 survey about questionable research practices in ecology and evolution
5 Nature uses the word “reproducibility” to refer to what we call “replicability.”
( Fraser et al., 2018 ). However, many of these surveys rely on poorly defined sampling frames to identify populations of scientists and do not use probability sampling techniques. The fact that nonprobability samples “rely mostly on people . . . whose selection probabilities are unknown [makes it] difficult to estimate how representative they are of the [target] population” ( Dillman, Smyth, and Christian, 2014 , pp. 70, 92). In fact, we know that people with a particular interest in or concern about a topic, such as replicability and reproducibility, are more likely to respond to surveys on the topic ( Brehm, 1993 ). As a result, we caution against using surveys based on nonprobability samples as the basis of any conclusion about the extent of non-replicability in science.
High-quality researcher surveys are expensive and pose significant challenges, including constructing exhaustive sampling frames, reaching adequate response rates, and minimizing other nonresponse biases that might differentially affect respondents at different career stages or in different professional environments or fields of study ( Corley et al., 2011 ; Peters et al., 2008 ; Scheufele et al., 2009 ). As a result, the attempts to date to gather input on topics related to replicability and reproducibility from larger numbers of scientists ( Baker, 2016 ; Boulbes et al., 2018 ) have relied on convenience samples and other methodological choices that limit the conclusions that can be made about attitudes among the larger scientific community or even for specific subfields based on the data from such surveys. More methodologically sound surveys following guidelines on adoption of open science practices and other replicability-related issues are beginning to emerge. 6 See Appendix E for a discussion of conducting reliable surveys of scientists.
Retractions of published articles may be related to their non-replicability. As noted in a recent study on retraction trends ( Brainard, 2018 , p. 392), “Overall, nearly 40% of retraction notices did not mention fraud or other kinds of misconduct. Instead, the papers were retracted because of errors, problems with reproducibility [or replicability], and other issues.” Overall, about one-half of all retractions appear to involve fabrication, falsification, or plagiarism. Journal article retractions in biomedicine increased from 50-60 per year in the mid-2000s, to 600-700 per year by the mid-2010s ( National Library of Medicine, 2018 ), and this increase attracted much commentary and analysis (see, e.g., Grieneisen and Zhang, 2012 ). A recent comprehensive review of an extensive database of 18,000 retracted papers
6 See https://cega.berkeley.edu/resource/the-state-of-social-science-betsy-levy-paluck-bitssannual-meeting-2018 .
dating back to the 1970s found that while the number of retractions has grown, the rate of increase has slowed; approximately 4 of every 10,000 papers are now retracted ( Brainard, 2018 ). Overall, the number of journals that report retractions has grown from 44 journals in 1997 to 488 journals in 2016; however, the average number of retractions per journal has remained essentially flat since 1997.
These data suggest that more journals are attending to the problem of articles that need to be retracted rather than a growing problem in any one discipline of science. Fewer than 2 percent of authors in the database account for more than one-quarter of the retracted articles, and the retractions of these frequent offenders are usually based on fraud rather than errors that lead to non-replicability. The Institute of Electrical and Electronics Engineers alone has retracted more than 7,000 abstracts from conferences that took place between 2009 and 2011, most of which had authors based in China ( McCook, 2018 ).
The body of evidence on the extent of non-replicabilty gathered by the committee is not a comprehensive assessment across all fields of science nor even within any given field of study. Such a comprehensive effort would be daunting due to the vast amount of research published each year and the diversity of scientific and engineering fields. Among studies of replication that are available, there is no uniform approach across scientific fields to gauge replication between two studies. The experts who contributed their perspectives to the committee all question the feasibility of such a science-wide assessment of non-replicability.
While the evidence base assessed by the committee may not be sufficient to permit a firm quantitative answer on the scope of non-replicability, it does support several findings and a conclusion.
FINDING 5-1: There is an uneven level of awareness of issues related to replicability across fields and even within fields of science and engineering.
FINDING 5-2: Efforts to replicate studies aimed at discerning the effect of an intervention in a study population may find a similar direction of effect, but a different (often smaller) size of effect.
FINDING 5-3: Studies that directly measure replicability take substantial time and resources.
FINDING 5-4: Comparing results across replication studies may be compromised because different replication studies may test different study attributes and rely on different standards and measures for a successful replication.
FINDING 5-5: Replication studies in the natural and clinical sciences (general biology, genetics, oncology, chemistry) and social sciences (including economics and psychology) report frequencies of replication ranging from fewer than one out of five studies to more than three out of four studies.
CONCLUSION 5-3: Because many scientists routinely conduct replication tests as part of a follow-on work and do not report replication results separately, the evidence base of non-replicability across all science and engineering research is incomplete.
Non-replicability can arise from a number of sources. In some cases, non-replicability arises from the inherent characteristics of the systems under study. In others, decisions made by a researcher or researchers in study execution that reasonably differ from the original study such as judgment calls on data cleaning or selection of parameter values within a model may also result in non-replication. Other sources of non-replicability arise from conscious or unconscious bias in reporting, mistakes and errors (including misuse of statistical methods), and problems in study design, execution, or interpretation in either the original study or the replication attempt. In many instances, non-replication between two results could be due to a combination of multiple sources, but it is not generally possible to identify the source without careful examination of the two studies. Below, we review these sources of non-replicability and discuss how researchers’ choices can affect each. Unless otherwise noted, the discussion below focuses on the non-replicability between two results (i.e., a one-to-one comparison) when assessed using proximity and uncertainty of both results.
Non-replicability is a normal part of the scientific process and can be due to the intrinsic variation and complexity of nature, the scope of current scientific knowledge, and the limits of current technologies. Highly surprising and unexpected results are often not replicated by other researchers. In other instances, a second researcher or research team may purposefully make decisions that lead to differences in parts of the study. As long as these differences are reported with the final results, these may be reasonable actions to take yet result in non-replication. In scientific reporting, uncertainties within the study (such as the uncertainty within measurements, the potential interactions between parameters, and the variability of the
system under study) are estimated, assessed, characterized, and accounted for through uncertainty and probability analysis. When uncertainties are unknown and not accounted for, this can also lead to non-replicability. In these instances, non-replicability of results is a normal consequence of studying complex systems with imperfect knowledge and tools. When non-replication of results due to sources such as those listed above are investigated and resolved, it can lead to new insights, better uncertainty characterization, and increased knowledge about the systems under study and the methods used to study them. See Box 5-1 for examples of how investigations of non-replication have been helpful to increasing knowledge.
The susceptibility of any line of scientific inquiry to sources of non-replicability depends on many factors, including factors inherent to the system under study, such as the
Studies that pursue lines of inquiry that are able to better estimate and analyze the uncertainties associated with the variables in the system and control the methods that will be used to conduct the experiment are more replicable. On the other end of the spectrum, studies that are more prone to non-replication often involve indirect measurement of very complex systems (e.g., human behavior) and require statistical analysis to draw conclusions. To illustrate how these characteristics can lead to results that are more or less likely to replicate, consider the attributes of complexity and controllability. The complexity and controllability of a system contribute to the underlying variance of the distribution of expected results and thus the likelihood of non-replication. 7
7 Complexity and controllability in an experimental system affect its susceptibility to non-replicability independently from the way prior odds, power, or p- values associated with hypothesis testing affect the likelihood that an experimental result represents the true state of the world.
The systems that scientists study vary in their complexity. Although all systems have some degree of intrinsic or random variability, some systems are less well understood, and their intrinsic variability is more difficult to assess or estimate. Complex systems tend to have numerous interacting components (e.g., cell biology, disease outbreaks, friction coefficient between two unknown surfaces, urban environments, complex organizations and populations, and human health). Interrelations and interactions among multiple components cannot always be predicted and neither can the resulting effects on the experimental outcomes, so an initial estimate of uncertainty may be an educated guess.
Systems under study also vary in their controllability. If the variables within a system can be known, characterized, and controlled, research on such a system tends to produce more replicable results. For example, in social sciences, a person’s response to a stimulus (e.g., a person’s behavior when placed in a specific situation) depends on a large number of variables—including social context, biological and psychological traits, verbal and nonverbal cues from researchers—all of which are difficult or impossible to control completely. In contrast, a physical object’s response to a physical stimulus (e.g., a liquid’s response to a rise in temperature) depends almost entirely on variables that can either be controlled or adjusted for, such as temperature, air pressure, and elevation. Because of these differences, one expects that studies that are conducted in the relatively more controllable systems will replicate with greater frequency than those that are in less controllable systems. Scientists seek to control the variables relevant to the system under study and the nature of the inquiry, but when these variables are more difficult to control, the likelihood of non-replicability will be higher. Figure 5-2 illustrates the combinations of complexity and controllability.
Many scientific fields have studies that span these quadrants, as demonstrated by the following examples from engineering, physics, and psychology. Veronique Kiermer, PLOS executive editor, in her briefing to the committee noted: “There is a clear correlation between the complexity of the design, the complexity of measurement tools, and the signal to noise ratio that we are trying to measure.” (See also Goodman et al., 2016 , on the complexity of statistical and inferential methods.)
Engineering . Aluminum-lithium alloys were developed by engineers because of their strength-to-weight ratio, primarily for use in aerospace engineering. The process of developing these alloys spans the four quadrants. Early generation of binary alloys was a simple system that showed high replicability (Quadrant A). Second-generation alloys had higher amounts of lithium and resulted in lower replicability that appeared as failures in manufacturing operations because the interactions of the elements were not understood (Quadrant C). The third-generation alloys contained less
lithium and higher relative amounts of other alloying elements, which made it a more complex system but better controlled (Quadrant B), with improved replicability. The development of any alloy is subject to a highly controlled environment. Unknown aspects of the system, such as interactions among the components, cannot be controlled initially and can lead to failures. Once these are understood, conditions can be modified (e.g., heat treatment) to bring about higher replicability.
Physics. In physics, measurements of the electronic band gap of semiconducting and conducting materials using scanning tunneling microscopy is a highly controlled, simple system (Quadrant A). The searches for the Higgs boson and gravitational waves were separate efforts, and each required the development of large, complex experimental apparatus and careful characterization of the measurement and data analysis systems (Quadrant B). Some systems, such as radiation portal monitors, require setting thresholds for alarms without knowledge of when or if a threat will ever pass through them; the variety of potential signatures is high and there is little controllability of the system during operation (Quadrant C). Finally, a simple system with little controllability is that of precisely predicting the path of a feather dropped from a given height (Quadrant D).
Psychology. In psychology, Quadrant A includes studies of basic sensory and perceptual processes that are common to all human beings, such
as the purkinje shift (i.e., a change in sensitivity of the human eye under different levels of illumination). Quadrant D includes studies of complex social behaviors that are influenced by culture and context; for example, a study of the effects of a father’s absence on children’s ability to delay gratification revealed stronger effects among younger children ( Mischel, 1961 ).
Inherent sources of non-replicability arise in every field of science, but they can vary widely depending on the specific system undergoing study. When the sources are knowable, or arise from experimental design choices, researchers need to identify and assess these sources of uncertainty insofar as they can be estimated. Researchers need also to report on steps that were intended to reduce uncertainties inherent in the study or differ from the original study (i.e., data cleaning decisions that resulted in a different final dataset). The committee agrees with those who argue that the testing of assumptions and the characterization of the components of a study are as important to report as are the ultimate results of the study ( Plant and Hanisch, 2018 ) including studies using statistical inference and reporting p -values ( Boos and Stefanski, 2011 ). Every scientific inquiry encounters an irreducible level of uncertainty, whether this is due to random processes in the system under study, limits to researchers understanding or ability to control that system, or limitations of the ability to measure. If researchers do not adequately consider and report these uncertainties and limitations, this can contribute to non-replicability.
RECOMMENDATION 5-1: Researchers should, as applicable to the specific study, provide an accurate and appropriate characterization of relevant uncertainties when they report or publish their research. Researchers should thoughtfully communicate all recognized uncertainties and estimate or acknowledge other potential sources of uncertainty that bear on their results, including stochastic uncertainties and uncertainties in measurement, computation, knowledge, modeling, and methods of analysis.
Non-replicability can also be the result of human error or poor researcher choices. Shortcomings in the design, conduct, and communication of a study may all contribute to non-replicability.
These defects may arise at any point along the process of conducting research, from design and conduct to analysis and reporting, and errors may be made because the researcher was ignorant of best practices, was sloppy in carrying out research, made a simple error, or had unconscious bias toward a specific outcome. Whether arising from lack of knowledge, perverse incentives, sloppiness, or bias, these sources of non-replicability
warrant continued attention because they reduce the efficiency with which science progresses and time spent resolving non-replicablity issues that are caused by these sources do not add to scientific understanding. That is, they are unhelpful in making scientific progress. We consider here a selected set of such avoidable sources of non-replication:
We will discuss each source in turn.
Both researchers and journals want to publish new, innovative, ground-breaking research. The publication preference for statistically significant, positive results produces a biased literature through the exclusion of statistically nonsignificant results (i.e., those that do not show an effect that is sufficiently unlikely if the null hypothesis is true). As noted in Chapter 2 , there is great pressure to publish in high-impact journals and for researchers to make new discoveries. Furthermore, it may be difficult for researchers to publish even robust nonsignificant results, except in circumstances where the results contradict what has come to be an accepted positive effect. Replication studies and studies with valuable data but inconclusive results may be similarly difficult to publish. This publication bias results in a published literature that does not reflect the full range of evidence about a research topic.
One powerful example is a set of clinical studies performed on the effectiveness of tamoxifen, a drug used to treat breast cancer. In a systematic review (see Chapter 7 ) of the drug’s effectiveness, 23 clinical trials were reviewed; the statistical significance of 22 of the 23 studies did not reach the criterion of p < 0.05, yet the cumulative review of the set of studies showed a large effect (a reduction of 16% [±3] in the odds of death among women of all ages assigned to tamoxifen treatment [ Peto et al., 1988 , p. 1684]).
Another approach to quantifying the extent of non-replicability is to model the false discovery rate—that is, the number of research results that are expected to be “false.” Ioannidis (2005) developed a simulation model to do so for studies that rely on statistical hypothesis testing, incorporating the pre-study (i.e., prior) odds, the statistical tests of significance, investigator bias, and other factors. Ioannidis concluded, and used as the title of his paper,
that “most published research findings are false.” Some researchers have criticized Ioannidis’s assumptions and mathematical argument ( Goodman and Greenland, 2007 ); others have pointed out that the takeaway message is that any initial results that are statistically significant need further confirmation and validation.
Analyzing the distribution of published results for a particular line of inquiry can offer insights into potential bias, which can relate to the rate of non-replicability. Several tools are being developed to compare a distribution of results to what that distribution would look like if all claimed effects were representative of the true distribution of effects. Figure 5-3 shows how publication bias can result in a skewed view of the body of evidence when only positive results that meet the statistical significance threshold are reported. When a new study fails to replicate the previously published results—for example, if a study finds no relationship between variables when such a relationship had been shown in previously published studies—it appears to be a case of non-replication. However, if the published literature is not an accurate reflection of the state of the evidence because only positive results are regularly published, the new study could actually have replicated previous but unpublished negative results. 8
Several techniques are available to detect and potentially adjust for publication bias, all of which are based on the examination of a body of research as a whole (i.e., cumulative evidence), rather than individual replication studies (i.e., one-on-one comparison between studies). These techniques cannot determine which of the individual studies are affected by bias (i.e., which results are false positives) or identify the particular type of bias, but they arguably allow one to identify bodies of literature that are likely to be more or less accurate representations of the evidence. The techniques, discussed below, are funnel plots, a p -curve test of excess significance, and assessing unpublished literature.
Funnel Plots. One of the most common approaches to detecting publication bias involves constructing a funnel plot that displays each effect size against its precision (e.g., sample size of study). Asymmetry in the plotted values can reveal the absence of studies with small effect sizes, especially in studies with small sample sizes—a pattern that could suggest publication/selection bias for statistically significant effects (see Figure 5-3 ). There are criticisms of funnel plots, however; some argue that the shape of a funnel plot is largely determined by the choice of method ( Tang and Liu, 2000 ),
8 Earlier in this chapter, we discuss an indirect method for assessing non-replicability in which a result is compared to previously published values; results that do not agreed with the published literature are identified as outliers. If the published literature is biased, this method would inappropriately reject valid results. This is another reason for investigating outliers before rejecting them.
and others maintain that funnel plot asymmetry may not accurately reflect publication bias ( Lau et al., 2006 ).
P -Curve. One fairly new approach is to compare the distribution of results (e.g., p- values) to the expected distributions (see Simonsohn et al., 2014a , 2014b ). P- curve analysis tests whether the distribution of statistically significant p- values shows a pronounced right-skew, 9 as would be expected when the results are true effects (i.e., the null hypothesis is false), or whether the distribution is not as right-skewed (or is even flat, or, in the most extreme cases, left-skewed), as would be expected when the original results do not reflect the proportion of real effects ( Gadbury and Allison, 2012 ; Nelson et al., 2018 ; Simonsohn et al., 2014a ).
Test of Excess Significance. A closely related statistical idea for checking publication bias is the test of excess significance. This test evaluates whether the number of statistically significant results in a set of studies is improbably high given the size of the effect and the power to test it in the set of studies ( Ioannidis and Trikalinos, 2007 ), which would imply that the set of results is biased and may include exaggerated results or false positives. When there is a true effect, one expects the proportion of statistically significant results to be equal to the statistical power of the studies. If a researcher designs her studies to have 80 percent power against a given effect, then, at most, 80 percent of her studies would produce statistically significant results if the effect is at least that large (fewer if the null hypothesis is sometimes true). Schimmack (2012) has demonstrated that the proportion of statistically significant results across a set of psychology studies often far exceeds the estimated statistical power of those studies; this pattern of results that is “too good to be true” suggests that results were either not obtained following the rules of statistical inference (i.e., conducting a single statistical test that was chosen a priori ) or did not report all studies attempted (i.e., there is a “file drawer” of statistically nonsignificant studies that do not get published; or possibly the results were p -hacked or cherry picked (see Chapter 2 ).
In many fields, the proportion of published papers that report a positive (i.e., statistically significant) result is around 90 percent ( Fanelli, 2012 ). This raises concerns when combined with the observation that most studies have far less than 90 percent statistical power (i.e., would only successfully detect an effect, assuming an effect exists, far less than 90% of the time) ( Button et al., 2013 ; Fraley and Vazire, 2014 ; Szucs and Ioannidis, 2017 ; Yarkoni, 2009 ; Stanley et al., 2018 ). Some researchers believe that the
9 Distributions that have more p -values of low value than high are referred to as “right-skewed.” Similarly, “left-skewed” distributions have more p -values of high than low value.
publication of false positives is common and that reforms are needed to reduce this. Others believe that there has been an excessive focus on Type I errors (i.e., false positives) in hypothesis testing at the possible expense of an increase in Type II errors (i.e., false negatives, or failing to confirm true hypotheses) ( Fiedler et al., 2012 ; Finkel et al., 2015 ; LeBel et al., 2017 ).
Assessing Unpublished Literature. One approach to countering publication bias is to search for and include unpublished papers and results when conducting a systematic review of the literature. Such comprehensive searches are not standard practice. For medical reviews, one estimate is that only 6 percent of reviews included unpublished work ( Hartling et al., 2017 ), although another found that 50 percent of reviews did so ( Ziai et al., 2017 ). In economics, there is a large and active group of researchers collecting and sharing “grey” literature, research results outside of peer reviewed publications ( Vilhuber, 2018 ). In psychology, an estimated 75 percent of reviews included unpublished research ( Rothstein, 2006 ). Unpublished but recorded studies (such as dissertation abstracts, conference programs, and research aggregation websites) may become easier for reviewers to access with computerized databases and with the availability of preprint servers. When a review includes unpublished studies, researchers can directly compare their results with those from the published literature, thereby estimating file-drawer effects.
Academic incentives—such as tenure, grant money, and status—may influence scientists to compromise on good research practices ( Freeman, 2018 ). Faculty hiring, promotion, and tenure decisions are often based in large part on the “productivity” of a researcher, such as the number of publications, number of citations, and amount of grant money received ( Edwards and Roy, 2017 ). Some have suggested that these incentives can lead researchers to ignore standards of scientific conduct, rush to publish, and overemphasize positive results ( Edwards and Roy, 2017 ). Formal models have shown how these incentives can lead to high rates of non-replicable results ( Smaldino and McElreath, 2016 ). Many of these incentives may be well intentioned, but they could have the unintended consequence of reducing the quality of the science produced, and poorer quality science is less likely to be replicable.
Although it is difficult to assess how widespread the sources of non-replicability that are unhelpful to improving science are, factors such as publication bias toward results qualifying as “statistically significant” and misaligned incentives on academic scientists create conditions that favor publication of non-replicable results and inferences.
Confirmatory research is research that starts with a well-defined research question and a priori hypotheses before collecting data; confirmatory research can also be called hypothesis testing research. In contrast, researchers pursuing exploratory research collect data and then examine the data for potential variables of interest and relationships among variables, forming a posteriori hypotheses; as such, exploratory research can be considered hypothesis generating research. Exploratory and confirmatory analyses are often described as two different stages of the research process. Some have distinguished between the “context of discovery” and the “context of justification” ( Reichenbach, 1938 ), while others have argued that the distinction is on a spectrum rather than categorical. Regardless of the precise line between exploratory and confirmatory research, researchers’ choices between the two affects how they and others interpret the results.
A fundamental principle of hypothesis testing is that the same data that were used to generate a hypothesis cannot be used to test that hypothesis ( de Groot, 2014 ). In confirmatory research, the details of how a statistical hypothesis test will be conducted must be decided before looking at the data on which it is to be tested. When this principle is violated, significance testing, confidence intervals, and error control are compromised. Thus, it cannot be assured that false positives are controlled at a fixed rate. In short, when exploratory research is interpreted as if it were confirmatory research, there can be no legitimate statistically significant result.
Researchers often learn from their data, and some of the most important discoveries in the annals of science have come from unexpected results that did not fit any prior theory. For example, Arno Allan Penzias and Robert Woodrow Wilson found unexpected noise in data collected in the course of their work on microwave receivers for radio astronomy observations. After attempts to explain the noise failed, the “noise” was eventually determined to be cosmic microwave background radiation, and these results helped scientists to refine and confirm theories about the “big bang.” While exploratory research generates new hypotheses, confirmatory research is equally important because it tests the hypotheses generated and can give valid answers as to whether these hypotheses have any merit. Exploratory and confirmatory research are essential parts of science, but they need to be understood and communicated as two separate types of inquiry, with two different interpretations.
A well-conducted exploratory analysis can help illuminate possible hypotheses to be examined in subsequent confirmatory analyses. Even a stark result in an exploratory analysis has to be interpreted cautiously, pending further work to test the hypothesis using a new or expanded dataset. It is often unclear from publications whether the results came from an
exploratory or a confirmatory analysis. This lack of clarity can misrepresent the reliability and broad applicability of the reported results.
In Chapter 2 , we discussed the meaning, overreliance, and frequent misunderstanding of statistical significance, including misinterpreting the meaning and overstating the utility of a particular threshold, such as p < 0.05. More generally, a number of flaws in design and reporting can reduce the reliability of a study’s results.
Misuse of statistical testing often involves post hoc analyses of data already collected, making it seem as though statistically significant results provide evidence against the null hypothesis, when in fact they may have a high probability of being false positives ( John et al., 2012 ; Munafo et al., 2017 ). A study from the late-1980s gives a striking example of how such post hoc analysis can be misleading. The International Study of Infarct Survival was a large-scale, international, randomized trial that examined the potential benefit of aspirin for patients who had had a heart attack. After data collection and analysis were complete, the publishing journal asked the researchers to do additional analysis to see if certain subgroups of patients benefited more or less from aspirin. Richard Peto, one of the researchers, refused to do so because of the risk of finding invalid but seemingly significant associations. In the end, Peto relented and performed the analysis, but with a twist: he also included a post hoc analysis that divided the patients into the twelve astrological signs, and found that Geminis and Libras did not benefit from aspirin, while Capricorns benefited the most ( Peto, 2011 ). This obviously spurious relationship illustrates the dangers of analyzing data with hypotheses and subgroups that were not prespecified.
Little information is available about the prevalence of such inappropriate statistical practices as p- hacking, cherry picking, and hypothesizing after results are known (HARKing), discussed below. While surveys of researchers raise the issue—often using convenience samples—methodological shortcomings mean that they are not necessarily a reliable source for a quantitative assessment. 10
P- hacking and Cherry Picking. P- hacking is the practice of collecting, selecting, or analyzing data until a result of statistical significance is found. Different ways to p- hack include stopping data collection once p ≤ 0.05 is reached, analyzing many different relationships and only reporting those for which p ≤ 0.05, varying the exclusion and inclusion rules for data so that p ≤ 0.05, and analyzing different subgroups in order to get p ≤ 0.05. Researchers may p- hack without knowing or without understanding the consequences ( Head et al., 2015 ). This is related to the practice of cherry picking, in which researchers may (unconsciously or deliberately) pick
10 For an example of one study of this issue, see Fraser et al. (2018) .
through their data and results and selectively report those that meet criteria such as meeting a threshold of statistical significance or supporting a positive result, rather than reporting all of the results from their research.
HARKing. Confirmatory research begins with identifying a hypothesis based on observations, exploratory analysis, or building on previous research. Data are collected and analyzed to see if they support the hypothesis. HARKing applies to confirmatory research that incorrectly bases the hypothesis on the data collected and then uses that same data as evidence to support the hypothesis. It is unknown to what extent inappropriate HARKing occurs in various disciplines, but some have attempted to quantify the consequences of HARKing. For example, a 2015 article compared hypothesized effect sizes against non-hypothesized effect sizes and found that effects were significantly larger when the relationships had been hypothesized, a finding consistent with the presence of HARKing ( Bosco et al., 2015 ).
Before conducting an experiment, a researcher must make a number of decisions about study design. These decisions—which vary depending on type of study—could include the research question, the hypotheses, the variables to be studied, avoiding potential sources of bias, and the methods for collecting, classifying, and analyzing data. Researchers’ decisions at various points along this path can contribute to non-replicability. Poor study design can include not recognizing or adjusting for known biases, not following best practices in terms of randomization, poorly designing materials and tools (ranging from physical equipment to questionnaires to biological reagents), confounding in data manipulation, using poor measures, or failing to characterize and account for known uncertainties.
In 2010, economists Carmen Reinhart and Kenneth Rogoff published an article that showed if a country’s debt exceeds 90 percent of the country’s gross domestic product, economic growth slows and declines slightly (0.1%). These results were widely publicized and used to support austerity measures around the world ( Herndon et al., 2013 ). However, in 2013, with access to Reinhart and Rogoff’s original spreadsheet of data and analysis (which the authors had saved and made available for the replication effort), researchers reanalyzing the original studies found several errors in the analysis and data selection. One error was an incomplete set of countries used in the analysis that established the relationship between debt and economic growth. When data from Australia, Austria, Belgium, Canada,
and Denmark were correctly included, and other errors were corrected, the economic growth in the countries with debt above 90 percent of gross domestic product was actually +2.2 percent, rather than –0.1. In response, Reinhart and Rogoff acknowledged the errors, calling it “sobering that such an error slipped into one of our papers despite our best efforts to be consistently careful.” Reinhart and Rogoff said that while the error led to a “notable change” in the calculation of growth in one category, they did not believe it “affects in any significant way the central message of the paper.” 11
The Reinhart and Rogoff error was fairly high profile and a quick Internet search would let any interested reader know that the original paper contained errors. Many errors could go undetected or are only acknowledged through a brief correction in the publishing journal. A 2015 study looked at a sample of more than 250,000 p- values reported in eight major psychology journals over a period of 28 years. The study found that many of the p- values reported in papers were inconsistent with a recalculation of the p- value and that in one out of eight papers, this inconsistency was large enough to affect the statistical conclusion ( Nuijten et al., 2016 ).
Errors can occur at any point in the research process: measurements can be recorded inaccurately, typographical errors can occur when inputting data, and calculations can contain mistakes. If these errors affect the final results and are not caught prior to publication, the research may be non-replicable. Unfortunately, these types of errors can be difficult to detect. In the case of computational errors, transparency in data and computation may make it more likely that the errors can be caught and corrected. For other errors, such as mistakes in measurement, errors might not be detected until and unless a failed replication that does not make the same mistake indicates that something was amiss in the original study. Errors may also be made by researchers despite their best intentions (see Box 5-2 ).
During the course of research, researchers make numerous choices about their studies. When a study is published, some of these choices are reported in the methods section. A methods section often covers what materials were used, how participants or samples were chosen, what data collection procedures were followed, and how data were analyzed. The failure to report some aspect of the study—or to do so in sufficient detail—may make it difficult for another researcher to replicate the result. For example, if a researcher only reports that she “adjusted for comorbidities” within the study population, this does not provide sufficient information about how
11 See https://archive.nytimes.com/www.nytimes.com/interactive/2013/04/17/business/17economixresponse.html .
exactly the comorbidities were adjusted, and it does not give enough guidance for future researchers to follow the protocol. Similarly, if a researcher does not give adequate information about the biological reagents used in an experiment, a second researcher may have difficulty replicating the experiment. Even if a researcher reports all of the critical information about the conduct of a study, other seemingly inconsequential details that have an effect on the outcome could remain unreported.
Just as reproducibility requires transparent sharing of data, code, and analysis, replicability requires transparent sharing of how an experiment was conducted and the choices that were made. This allows future researchers, if they wish, to attempt replication as close to the original conditions as possible.
At the extreme, sources of non-replicability that do not advance scientific knowledge—and do much to harm science—include misconduct and fraud in scientific research. Instances of fraud are uncommon, but can be sensational. Despite fraud’s infrequent occurrence and regardless of how
highly publicized cases may be, the fact that it is uniformly bad for science means that it is worthy of attention within this study.
Researchers who knowingly use questionable research practices with the intent to deceive are committing misconduct or fraud. It can be difficult in practice to differentiate between honest mistakes and deliberate misconduct because the underlying action may be the same while the intent is not.
Reproducibility and replicability emerged as general concerns in science around the same time as research misconduct and detrimental research practices were receiving renewed attention. Interest in both reproducibility and replicability as well as misconduct was spurred by some of the same trends and a small number of widely publicized cases in which discovery of fabricated or falsified data was delayed, and the practices of journals, research institutions, and individual labs were implicated in enabling such delays ( National Academies of Sciences, Engineering, and Medicine, 2017 ; Levelt Committee et al., 2012 ).
In the case of Anil Potti at Duke University, a researcher using genomic analysis on cancer patients was later found to have falsified data. This experience prompted the study and the report, Evolution of Translational Omics: Lessons Learned and the Way Forward ( Institute of Medicine, 2012 ), which in turn led to new guidelines for omics research at the National Cancer Institute. Around the same time, in a case that came to light in the Netherlands, social psychologist Diederick Stapel had gone from manipulating to fabricating data over the course of a career with dozens of fraudulent publications. Similarly, highly publicized concerns about misconduct by Cornell University professor Brian Wansink highlight how consistent failure to adhere to best practices for collecting, analyzing, and reporting data—intentional or not—can blur the line between helpful and unhelpful sources of non-replicability. In this case, a Cornell faculty committee ascribed to Wansink: “academic misconduct in his research and scholarship, including misreporting of research data, problematic statistical techniques, failure to properly document and preserve research results, and inappropriate authorship.” 12
A subsequent report, Fostering Integrity in Research ( National Academies of Sciences, Engineering, and Medicine, 2017 ), emerged in this context, and several of its central themes are relevant to questions posed in this report.
According to the definition adopted by the U.S. federal government in 2000, research misconduct is fabrication of data, falsification of data, or plagiarism “in proposing, performing, or reviewing research, or in reporting research results” ( Office of Science and Technology Policy, 2000 , p. 76262). The federal policy requires that research institutions report all
12 See http://statements.cornell.edu/2018/20180920-statement-provost-michael-kotlikoff.cfm .
allegations of misconduct in research projects supported by federal funding that have advanced from the inquiry stage to a full investigation, and to report on the results of those investigations.
Other detrimental research practices (see National Academies of Sciences, Engineering, and Medicine, 2017 ) include failing to follow sponsor requirements or disciplinary standards for retaining data, authorship misrepresentation other than plagiarism, refusing to share data or methods, and misleading statistical analysis that falls short of falsification. In addition to the behaviors of individual researchers, detrimental research practices also include actions taken by organizations, such as failure on the part of research institutions to maintain adequate policies, procedures, or capacity to foster research integrity and assess research misconduct allegations, and abusive or irresponsible publication practices by journal editors and peer review.
Just as information on rates of non-reproducibility and non-replicability in research is limited, knowledge about research misconduct and detrimental research practices is scarce. Reports of research misconduct allegations and findings are released by the National Science Foundation Office of Inspector General and the Department of Health and Human Services Office of Research Integrity (see National Science Foundation, 2018d ). As discussed above, new analyses of retraction trends have shed some light on the frequency of occurrence of fraud and misconduct. Allegations and findings of misconduct increased from the mid-2000s to the mid-2010s but may have leveled off in the past few years.
Analysis of retractions of scientific articles in journals may also shed some light on the problem ( Steen et al., 2013 ). One analysis of biomedical articles found that misconduct was responsible for more than two-thirds of retractions ( Fang et al., 2012 ). As mentioned earlier, a wider analysis of all retractions of scientific papers found about one-half attributable to misconduct or fraud ( Brainard, 2018 ). Others have found some differences according to discipline ( Grieneisen and Zhang, 2012 ).
One theme of Fostering Integrity in Research is that research misconduct and detrimental research practices are a continuum of behaviors ( National Academies of Sciences, Engineering, and Medicine, 2017 ). While current policies and institutions aimed at preventing and dealing with research misconduct are certainly necessary, detrimental research practices likely arise from some of the same causes and may cost the research enterprise more than misconduct does in terms of resources wasted on the fabricated or falsified work, resources wasted on following up this work, harm to public health due to treatments based on acceptance of incorrect clinical results, reputational harm to collaborators and institutions, and others.
No branch of science is immune to research misconduct, and the committee did not find any basis to differentiate the relative level of occurrence
in various branches of science. Some but not all researcher misconduct has been uncovered through reproducibility and replication attempts, which are the self-correcting mechanisms of science. From the available evidence, documented cases of researcher misconduct are relatively rare, as suggested by a rate of retractions in scientific papers of approximately 4 in 10,000 ( Brainard, 2018 ).
CONCLUSION 5-4: The occurrence of non-replicability is due to multiple sources, some of which impede and others of which promote progress in science. The overall extent of non-replicability is an inadequate indicator of the health of science.
This page intentionally left blank.
One of the pathways by which the scientific community confirms the validity of a new scientific discovery is by repeating the research that produced it. When a scientific effort fails to independently confirm the computations or results of a previous study, some fear that it may be a symptom of a lack of rigor in science, while others argue that such an observed inconsistency can be an important precursor to new discovery.
Concerns about reproducibility and replicability have been expressed in both scientific and popular media. As these concerns came to light, Congress requested that the National Academies of Sciences, Engineering, and Medicine conduct a study to assess the extent of issues related to reproducibility and replicability and to offer recommendations for improving rigor and transparency in scientific research.
Reproducibility and Replicability in Science defines reproducibility and replicability and examines the factors that may lead to non-reproducibility and non-replicability in research. Unlike the typical expectation of reproducibility between two computations, expectations about replicability are more nuanced, and in some cases a lack of replicability can aid the process of scientific discovery. This report provides recommendations to researchers, academic institutions, journals, and funders on steps they can take to improve reproducibility and replicability in science.
READ FREE ONLINE
You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.
Do you want to take a quick tour of the OpenBook's features?
Show this book's table of contents , where you can jump to any chapter by name.
...or use these buttons to go back to the previous chapter or skip to the next one.
Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.
Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.
To search the entire text of this book, type in your search term here and press Enter .
Share a link to this book page on your preferred social network or via email.
View our suggested citation for this chapter.
Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.
Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.
By Stanley E. Lazic ( @StanLazic )
Replication is a key idea in science and statistics, but is often misunderstood by researchers because they receive little education or training on experimental design. Consequently, the wrong entity is replicated in many experiments, leading to pseudoreplication or the “unit of analysis” problem [1,2]. This results in exaggerated sample sizes and a potential increase in both false positives and false negatives – the worst of all possible worlds.
Replication can mean many things
Replication is not always easy to understand because many parts of an experiment can be replicated, and a non-exhaustive list includes:
To add to the confusion, terms with related meanings exist, such as repeatability, reproducibility, and replicability. Furthermore, the reasons for having or increasing replication are diverse and include a need to increase statistical power, a desire to make the results more generalisable, or the result of a practical constraint, such as an inability to recruit enough patients in one centre and so multiple centres are needed.
Requirements for genuine replication
How do you design an experiment to have genuine replication and not pseudoreplication? First, ensure that replication is at the level of the biological question or scientific hypothesis. For example, to test the effectiveness of a drug in rats, give the drug to multiple rats, and compare the result with other rats that received a control treatment (corresponding to example 2 above). Multiple measurements on each rat (example 1 above) do not count towards genuine replication.
To test if a drug kills proliferating cells in a well compared to a control condition, you will need multiple drug and control wells, since the drug is applied on a per-well basis. But you may worry that the results from a single experimental run will not generalise – even if you can perform a valid statistical test – because results from in vitro experiments can be highly variable. You could then repeat the experiment four times (corresponding to example 3 above), and the sample size is now four, not the total number of wells that were used across all of the experimental runs. This second option requires more work, will take longer, and will usually have lower power, but it provides a more robust result because the experimenter’s ability to reproduce the treatment effect across multiple experimental runs has been replicated.
To test if pre-registered studies report different effect sizes from traditional studies that are not pre-registered, you will need multiple studies of both types (corresponding to example 5 above). The number of subjects in each of these studies is irrelevant for testing this study-level hypothesis.
Replication at the level of the question or hypothesis a necessary but not sufficient condition for genuine replication – three criteria must be satisfied [1,3]:
It follows that cells in a well or neurons in a brain or slice culture can rarely be considered genuine replicates because the above criteria are unlikely to be met, whereas fish in a tank, rats in a cage, or pigs in a pen could be genuine replicates in some cases but not in others. If the criteria are not met, the solution is to replicate one level up in the biological or technical hierarchy. For example, if you’re interested in the effect of a drug on cells in an in vitro experiment, but cannot use cells as genuine replicates, then the number of wells can be the replicates, and the measurements on cells within a well can be averaged so that the number of data points corresponds to the number of wells, that is, the sample size (hierarchical or multi-level models can also be used and don’t require values to be averaged because they take the structure of the data into account, but they are harder to implement and interpret compared with averaging followed by simpler statistical methods). Similarly, if rats in a cage cannot be considered genuine replicates, then calculating a cage-averaged value and using cages as genuine replicates is an appropriate solution (or a multi-level model).
If genuine replication is too low, the experiment may be unable to answer any scientific questions of interest. Therefore issues about replication must be resolved when designing an experiment, not after the data have been collected. For example, if cages are the genuine replicates and not the rats, then putting fewer rats in a cage and having more cages will increase power; and power is maximised with one rat per cage, but this may be undesirable for other reasons.
Confusing pseudoreplication for genuine replication reduces our ability to learn from experiments, understand nature, and develop treatments for diseases. It is also easily fixed. The requirements for genuine replication, like the definition of a p-value, is often misunderstood by researchers, despite many papers on the topic. An open-access overview is provided in reference [1], and reference [3] has a detailed discussion along with analysis options for many experimental designs.
[1] Lazic SE, Clarke-Williams CJ, Munafo MR (2018). What exactly is “N” in cell culture and animal experiments? PLoS Biol 6(4):e2005282. https://doi.org/10.1371/journal.pbio.2005282
[2] Lazic SE (2010). The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neuroscience 11:5. https://doi.org/10.1186/1471-2202-11-5
[3] Lazic SE (2016). Experimental Design for Laboratory Biologists: Maximising Information and Improving Reproducibility. Cambridge University Press, Cambridge, UK. https://www.cambridge.org/Lazic
Stanley E. Lazic is Co-founder and Chief Scientific Officer at Prioris.ai Inc.
Prioris.ai, Suite 459, 207 Bank Street, Ottawa ON, K2P 2N2, Canada.
Analysis and discussion of research | Updates on the latest issues | Open debate
All BMJ blog posts are published under a CC-BY-NC licence
BMJ Journals
Terms and Conditions Cookie Settings
Error message, what is science: repeat and replicate.
In the scientific process, we should not rely on the results of a single test. Instead, we should perform the test over and over. Why? If it works once, shouldn't it work the same way every time? Yes, it should, so if we repeat the experiment and get a different result, then we know that there is something about the test that we are not considering.
If your system blocks Vimeo, click here to use the alternate player
In studying the processes of science, you will often run into two words, which seem similar: Repetition and Replication
Sometimes it is a matter of random chance, as in the case of flipping a coin. Just because it comes up heads the first time does not mean that it will always come up heads. By repeating the experiment over and over, we can see if our result really supports our hypothesis ( What is a Hypothesis? ), or if it was just random chance.
Sometimes the result might be due to some variable that you have not recognized. In our example of flipping a coin, the individual's technique for flipping the coin might influence the results. To take that into consideration, we repeat the experiment over and over with different people, looking closely for any results that don't fit into the idea we are testing.
Results that don't fit are important! Figuring out why they do not fit our hypothesis can give us an opportunity to learn new things, and get a better understanding of the idea we are testing.
Once we have repeated our testing over and over, and think we understand the results, then it is time for replication. That means getting other scientists to perform the same tests, to see whether they get the same results. As with repetition, the most important things to watch for are results that don't fit our hypothesis, and for the same reason. Those different results give us a chance to discover more about our idea. The different results may be because the person replicating our tests did something different, but they also might be because that person noticed something that we missed.
If we did miss something, it is OK, as long as we performed our tests honestly and scientifically. Science is not about proving that "I am right!" Instead, it is a process for trying to learn more about the universe and how it works. It is usually a group effort, with each scientist adding her own perspective to the idea, giving us a better understanding and often raising new questions to explore.
Please log in.
Search by topic, search better.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
Hannah fraser.
1 School of BioSciences, University of Melbourne, Parkville VIC, Australia
Timothy h. parker.
2 Biology Department, Whitman College, Walla Walla WA, USA
3 School of BioSciences, School of Historical and Philosophical Studies, University of Melbourne, Parkville VIC, Australia
All data and analysis code are available at https://osf.io/bqc74/ with a stable https://doi.org/10.17605/OSF.IO/BQC74 .
Recent large‐scale projects in other disciplines have shown that results often fail to replicate when studies are repeated. The conditions contributing to this problem are also present in ecology, but there have not been any equivalent replication projects. Here, we survey ecologists' understanding of and opinions about replication studies. The majority of ecologists in our sample considered replication studies to be important (97%), not prevalent enough (91%), worth funding even given limited resources (61%), and suitable for publication in all journals (62%). However, there is a disconnect between this enthusiasm and the prevalence of direct replication studies in the literature which is much lower (0.023%: Kelly 2019) than our participants' median estimate of 10%. This may be explained by the obstacles our participants identified including the difficulty of conducting replication studies and of funding and publishing them. We conclude by offering suggestions for how replications could be better integrated into ecological research.
Repeating (replicating) studies can tell you how valid and generalizable the findings are. Ecologists think that replicating studies is important and valuable but a very small proportion of the ecology and evolutionary biology literature is replicated.
While replication is often upheld as a cornerstone of scientific methodology, attempts to replicate studies appear rare, at least in some disciplines. Studies looking at the prevalence of self‐identified “replication studies” in the literature find rates of 0.023% in ecology (Kelly, 2019 ), 0.1% in education (Makel & Plucker, 2014 ), and 1% in psychology (Makel, Plucker, & Hegarty, 2012 ). These figures reflect the rate of direct replications where the method from the original study is repeated as closely as possible. Of course, the feasibility of direct replication studies in many areas of ecology is limited by factors such as the challenge of conducting research in originally studied ecosystems which may be remote from potential replicators, the large spatial and temporal scales of many ecological studies, and the dynamic nature of ecosystems (Schnitzer & Carson, 2016 ; Shavit & Ellison, 2017 ). However, some subfields, such as behavioral ecology, suffer less from these restrictions and direct (or at least close replications) are more feasible (Nakagawa & Parker, 2015 ).
In the current study, we are concerned with how researchers think about replication, whether they consider it important, and what epistemic role they believe replication plays in the formulation of scientific evidence.
Different kinds of replication studies fulfill different epistemic functions, or purposes. It is common to distinguish between “direct” and “conceptual” replications, where direct replications repeat an original study using methods, instruments, and sampling procedures as close to the original as possible (recognizing that exact replications are largely hypothetical constructs in most disciplines) and conceptual replications make deliberate variations. The dichotomy between direct and conceptual is an oversimplification of a noisy continuum, and many more fine‐grained typologies exist (for a summary see Fidler & Wilcox, 2018 ) including ecology and evolutionary biology‐specific ones (Kelly, 2006 ; Nakagawa & Parker, 2015 ). Broadly speaking, replication studies at the “direct” end of the continuum assess the “conclusion” validity of the original findings (whether the originally observed relationship between measured variables is reliable). Those original findings might be invalid because sampling error led to a misleading result, or because of questionable research practices or even fraud. Replication studies at the “conceptual” end of the continuum test generalizability and robustness, this includes what has previously been termed “quasireplication” where studies are replicated in different species or ecosystems. Where a replication study is placed on the direct‐conceptual continuum and what epistemic function it fulfils depends on the scope of the claim in the original study and how the replication study conforms to or differs from that. For example, imagine I am conducting research in the Great Barrier Reef, and I collect data from some locations in the northern part of the reef. If, after analyzing my results, I make explicit inferences to the Great Barrier Reef as a whole, then studies anywhere along the reef employing the same methods and protocols as the original could reasonably be considered direct replications (within reasonable time constraints, of course). However, if I had constrained my inference to just the northern reef, it would not be reasonable to consider new studies sampling other locations direct replications. Replications beyond the Great Barrier Reef, for instance on coral reefs in the Red Sea, would be conceptual replications in both cases. In Table 1 , we illustrate how varying different elements of a study while holding others constant can allow us to interrogate different aspects of its conclusion. However, as the example of the reef demonstrates, whether any given replication is considered direct and conceptual is intrinsically tied to the scope of the inference in the original research claim.
Direct and conceptual replications in ecology. “S” means that the study element in the replication study is similar enough to the original study that it would be considered a fair test of the original hypothesis, and “D” means that the study element is distinctly different in original and replication studies, testing beyond the original hypothesis
Location | Environmental conditions | Study system | Variables | Epistemic function | |
---|---|---|---|---|---|
S | S | S | S | Controls for result being driven by sampling error, QRPs, mistakes, fraud | |
D | S | S | S | Controls for result being driven by its specific location within the stated scope of the study | |
S | D | S | S | Controls for result depending on the particular environmental conditions at the time of study | |
S | S | S | D | Controls for result being an artifact of how the research question was operationalized | |
S | S | D | S | Investigates whether the result generalizes to new study systems (often called “quasireplication”) | |
S/D | S/D | S/D | S/D | Investigates the generalizability and robustness of the result to multiple simultaneous changes in study design, and potential new interactions |
It is worth noting in advance of the next section that the large‐scale replication studies from other disciplines we describe there, and their associated replication success rates, refer exclusively to direct replication studies.
Over the last 8–10 years, concern over a “replication crisis” in science has mounted. The basis of this concern comes from large‐scale direct replication projects in several fields which found low rates of successful replication. Studies included in these projects all attempted fair tests of the original hypothesis, and most were conducted with advice from the original authors. This may mean that the location or time of the replication study differed from the original, but only in cases where location was not specified as being part of the scope of the claim in the original study.
Rates of successful direct replications range from 36% to 62% in psychology, (Camerer et al., 2018 ; Open Science Collaboration, 2015 ), from 11% to 49% in preclinical biomedicine (Freedman, Cockburn, & Simcoe, 2015 ), and from 67% to 78% in economics research (Camerer et al., 2016 ) depending on the study, and the measure of “successful” used (see Fidler et al., 2017 for a summary).
Low rates of successful replication are usually attributed to poor reliability because of low statistical power in the original studies (Maxwell, Lau, & Howard, 2015 ); publication bias toward statistically significant results (Fanelli, 2010 , 2012 ; Franco et al., 2014 ); and the use of questionable research practices (e.g., selectively reporting statistically significant variables, hypothesizing after results known: Agnoli, Wicherts, Veldkamp, Albiero, & Cubelli, 2017 ; Fraser, Parker, Nakagawa, Barnett, & Fidler, 2018 ; John, Loewenstein, & Prelec, 2012 ).
So far, there have been no equivalent, large‐scale replication projects in ecology or related fields. However, meta‐analytic studies have shown that several classic behavioral ecology findings do not reliably replicate (Sánchez‐Tójar et al., 2018 ; Seguin & Forstmeier, 2012 ; Wang et al., 2018 ). In addition, all of the conditions expected to drive low rates of replication mentioned above appear common in ecology and evolution (Fidler et al., 2017 ; Parker et al., 2016 ): low power (Jennions & Moller, 2000 ), publication bias (Cassey, Ewen, Blackburn, & Moller, 2004 ; Fanelli, 2012 ; Franco et al., 2014 ; Jennions & Moller, 2002 ; Murtaugh, 2002 ), and prevalence of questionable research practices (Fraser et al., 2018 ).
In the late 1980s, sociologists of science Mulkay and Gilbert interviewed a sample of biochemists about their replication practices. In particular, they were interested in whether these scientists replicated others' work. Most reported that they did not. And yet, the scientists uniformly claimed that their own work had been independently replicated by others. This seems to suggest an implausible state of affairs where everyone's work is replicated but no one is doing replicating (Box 1 ).
Interviewer: Does this imply that you don't repeat other people's experiments?
Respondent: Never
Interviewer: Does anyone repeat yours?
Respondent: Oh. Does anybody repeat my experiments? Yes, they do. I have read where people have purified rat liver enzyme from other sources. They get basically the same sub‐unit composition. I'm always happy, by the way which I see that somebody has done something and repeated some of our work, because I always worry…
Mulkay and Gilbert's explanation of this potential contradiction rested on the notion of “conceptual slippage.” That is, the definition of “replication” that researchers bring to mind when asked about replicating others' work was narrow, centering around direct or exact replication. When considering whether their own work had been replicated by others, they broadened their definition of replication, allowing conceptual replication (different operationalizations and measurements, extensions, etc.). Mulkay and Gilbert referred to the former as “mere replication” and report that it was rarely valued by the scientists in their interview sample. For example, one interviewee referring to another laboratory that is known to replicate studies said: “They actually take pride in the fact they are checking papers that have been published by others, with the result that a great deal of confirmatory work precludes their truly innovative contribution to the literature” (Mulkay & Gilbert, 1991 , p. 155).
Dismissal of the value of direct replication research is echoed in Madden's , Easley, and Dunn ( 1995 ) survey of 107 social and natural science journal editors, aimed at discovering how journal editors view replication research. Comments from two natural science editors exemplify this “Our attention is focused on avoiding replication! There are so many interesting subjects which have not been studied that it is a stupid thing to make the same work again” and “Why do you want to replicate already published work? If there is some interest [sic] puzzle, of course, but replication for its own sake is never encouraged”. Similarly, Ahadi, Hellas, Ihantola, Korhonen, and Petersen ( 2016 ) found a correlation between the perceived value of publishing original research and the perception that replication studies are less valuable in terms of obtaining citations and grant funding.
This negative stigma feeds into the difficulty of publishing replication studies. Ahadi et al. ( 2016 ) found that only 10% of computer education researchers that found the same result and 8% that found a different result to the original study were able to publish their replication studies. Baker and Penny ( 2016 ) examined the rate of publishing psychology replication studies and found that it was around 12% for replication studies that found the same result and 10% for replication studies that found a different result to the original. This is compounded by the fact that very few people submit replication studies in the first place (Baker & Penny, 2016 ).
Our goal here is to document and evaluate researchers' self‐reported understanding of, attitudes toward, and (where applicable) objections and obstacles to engaging in replication studies.
The current work investigates Kelly's ( 2006 ) argument that there exists in ecology “a general disdain by thesis committees… and journal editors for nonoriginal research” (p232). Echoing findings by Ahadi et al. ( 2016 ), Kelly proposed that replication studies may be hard to publish when they agree with the original findings because they do not add anything novel to the literature and also when they disagree with the original findings because the evidence from the original study is given greater weight than the refuting evidence. The current project is, in the broadest sense, an empirical investigation of these issues.
2.1. survey participants.
We distributed paper and online versions of our anonymous survey (created in Qualtrics Provo, UT, USA, pdf of survey available at https://osf.io/bqc74/ ) at the Ecological Society of America (ESA) 2017 conference (4,500+ attendees) and EcoTas 2017 (joint conference for the Australian and New Zealand Ecological Societies, 350–450 attendees), in line with ethics approval from the University of Melbourne Human Research Ethics Committee (Ethics ID 1749316.1). We set up a booth in the conference hall at ESA and actively approached passers‐by, asking them to take part in our survey. At EcoTas, we distributed the survey by roaming the conference on foot and announcing the survey in conference sessions. Participants at EcoTas were offered the opportunity to go into the draw and win a piece of artwork representing their research. We promoted the survey on twitter at both conferences. In total, ecologists returned 439 surveys, 218 from ESA, and 221 from EcoTas. Our sample comprises ecologists mostly from Australia, New Zealand, and North America. We have no reason to expect these populations to differ from other populations of ecologists in their opinions regarding replication. However, replication studies in other locations would be needed to assess the generalizability of our results.
Our survey included multiple‐choice questions about the following:
The code and data required to computationally reproduce our results and qualitative responses are available from https://osf.io/bqc74/ . For each of the multiple‐choice questions, we plotted the proportion (with 95% Confidence Intervals, CIs) of researchers who selected each of the options (e.g., the proportion of researchers who indicated that replication was “Very Important,” “Somewhat Important,” or “Not Important” in ecology) using ggplot2 (Valero‐Mora, 2015 , version 3.2.1) in R (R Development Core Team, 2018 , version 3.5.1). All 95% CIs are Wilson Score Intervals calculated in binom (Dorai‐Raj, 2014 , version 1.1) except for those calculated for the estimate of the prevalence of replication studies in ecology which were generated using parametric assumptions in Rmisc (Hope, 2013 , version 1.5).
Our sample of ecologists' median estimate of the proportion of replicated studies was 10% (mean 22%, 95% CIs 20%–24%, n = 393). A high proportion of ecologists were very positive about replication. The vast majority (97%, 95%CI: 95%–98%, n = 425 of 437 participants) of ecologists answering our survey stated that replication studies are (very or somewhat) important (Figure 1a ), and 91% (95% CI: 88%–93%, n = 385 of 424 participants) agreed that they would like to see more (or much more) replication taking place in ecology (Figure 1b ). Many also agreed that it is “crucial” (61%, 95%CI: 56%–65%, n = 261 of 428 participants, Figure 1c ) and that replication studies should be published in all journals (63%, 95%CI: 58–67, n = 269 of 427 participants, Figure 1d ).
Proportion of participants (with 95%CIs) selecting each option for the following questions: (a) how important is replication in ecology ( n = 437 participants), (b) does enough replication take place ( n = 424 participants), (c) do you consider replication studies to be a good use of resources in ecology ( n = 437 participants), and (d) how often should replication studies be published ( n = 443 responses from 427 participants)
Around a third of our sample agreed that replication is important with caveats, suggesting that given limited funding, the focus should remain on novel research (37%, 95%CI: 32%–41%, n = 157 of 428 participants, Figure 1c ) or that they should only be published in special editions or specific journals (30%, 95%CI: 25%–34%, n = 126 of 427 participants). We specifically worded these response items (i.e., pointing to funding scarcity, and publishing only in special issues) to mitigate demand characteristics, that is, undue influence to provide a positive answer to a survey question.
Very few ecologists expressed an overall negative perspective of replication studies; 1% (95%CI: 0.6%–3.0%, n = 6 of 437 participants, Figure 1a ) agreed that they were not important, 1% (95%CI: 0.5%–2.7%, n = 5 of 424 participants, Figure 1b ) indicated that there should be “less” or “much less” replication conducted, 0.5% (95%CI: 0.1%–1.7%, n = 2 of 428 participants, Figure 1c ) agreed that replication studies are a waste of time and money, 6% (95%CI: 4%–9%, n = 27 of 427 participants, Figure 1d ) indicated that replication studies should only be published if the results differ from those in the original study, and 0.23% indicated that replications should never be published (95%CI: 0.04%–1.3%, n = 1 of 427 participants, Figure 1d ).
When asked “does an effect or phenomenon need to be successfully replicated before you believe or trust it,” 43% (95%CI 38%–48%, n = 188 of 437 participants) said “yes,” 11% (95%CI: 9%–15%, n = 50 of 437 participants) said “no,” and 46% (95%CI: 41%–50%, n = 199 of 437 participants) said maybe. This leaves open the question of what participants do use to determine the epistemic value of a finding. Fortunately, 395 (of the total 437) participants provided free text responses when asked what, aside from replication, made an effect or phenomenon more believable or trustworthy (Table 2 ).
Researchers' ( n = 395) free‐text responses to a question asking “Is there anything else [aside from replication studies] that you consider to be especially important in determining believability or trustworthiness?” We show summary level results only, with illustrative quotations
Study design | Open science practices | Reputation | Consistency of current finding with existing knowledge | Statistical qualities of the results | |
---|---|---|---|---|---|
Number of comments | 242 | 68 | 66 | 61 | 53 |
Indicative quotes | “Sound methodology… appropriate controls, using different approaches/ method to prove the same hypothesis” “Temporal consistency of relationships. Test of consistency across environmental contexts” | “Open, publicly available data and code!” “whether raw data/analysis is presented in published paper supplements or hidden away” | “Sound scientific history of publications. Well regarded in academic or practitioner community” “Reputation of journals (sometimes, but sometimes reputable journals publish crap.)” | “theoretical validity (ie is it biologically supportable through established knowledge or does it severely contradict established theory)” “Are results consistent with similar research? If not, the new research is revolutionary and has a higher bar to convince me” | “degree to which data build the case for the claim (i.e., different approaches (e.g., experimental and observational, different experimental approaches), sites, length of the study) all are useful” “Sample size, power, strength of the effect, how much the findings can be generalised” |
Topics covered | ‐ scale of the study, ‐ sample size, ‐ use of controls, ‐ statistical approach, ‐ confounds factors | ‐ transparent methods, ‐ analysis code available, ‐ data available, ‐ study preregistered | ‐ funding source, ‐ conflicts of interest, ‐ reputation of: journal, institution, researcher | consistent with: ‐reader's understanding ‐prior literature ‐existing theory | ‐ large effect size, ‐ small p‐value, ‐ result supported by multiple tests, ‐ validity of the data |
We asked how often participants checked for replication studies when they come across an effect or phenomenon that was plausible versus implausible. Very few participants (9%, 95%CI: 7%–12%, n = 39 of 429 participants) “almost always” checked whether a study was replicated if they thought the result were plausible. Participants were more likely to check for replication studies if they found the effect implausible but even then, only 27% (23%–31%, n = 116 of 429 participants) of participants said that they “almost always” checked (Figure 2 ).
Percentage of participants reporting that they check for replications at different frequencies if the original study seemed plausible versus implausible. Error bars at 95% Wilson confidence intervals ( n = 429 participants)
In order to get a picture of what our sampled ecologists consider to be replication studies, we asked participants to select as many options as they wanted from Table 3 . The top four options represent the spectrum of replication studies from most direct (first option) to most conceptual (fourth option). The number of participants who considered the options to be replication studies decreased with decreasing similarity between original and replication study. Options 5 and 6 in Table 3 are related to computationally reproducing the results by reanalyzing a study's data. Computational reproducibility is a related concept to replication and has similar, if more limited, epistemic purpose: If the analysis is kept the same, it can detect mistakes and inconsistencies in the original analysis (Table 3 , option 5) and, if the analysis is altered, it can test the sensitivity of the findings to alternate modeling decisions (Table 3 , option 6).
Statements of different types of variations a new study might make to an original, and the percentage of total participants ( n = 430) who considered each variation type a “replication study.” Also shown is the mean estimate of the replication rate in ecology, calculated separately for participants who indicated that each of the option constituted a “replication study.”
Percentage of participants choosing this response (95% CI) | Mean estimate of replication rate in ecology (95% CI) | |
---|---|---|
Redoing an experiment or study as closely as possible to the original (e.g., with same methods and in the context, region, or species) | 90% (87–92) | 21% (19–24) |
Redoing an experiment or study with same (or similar) methods in a new context (region or species, etc.). | 73% (69–77) | 24% (21–26) |
Redoing an experiment or study with different methods in the same context (region or species, etc.). | 38% (34–43) | 23% (20–27) |
Redoing an experiment or study with different methods in a different context (region or species, etc.). | 14% (11–18) | 19% (13–25) |
Re‐analyzing previously collected data with the same statistical methods/models. | 41% (37–46) | 21% (18–25) |
Re‐analyzing previously collected data with the different statistical methods/models. | 36% (32–41) | 21% (17–24) |
None of the above | 1% (0–2) | NA |
We tested whether different understandings of the definition or scope of replication produced different estimates of the rate of replication studies. We divided participants' estimates of replication rates according to which types of study included in Table 3 each participant considered a type of replication. The estimated replication rate was similar in all subsets.
When asked to comment on the obstacles to replication, 407 participants provided free‐text responses, giving insight into why the replication rate might be low (Table 4 ).
Summary of free‐text responses to the question “In your opinion, what are the main obstacles to replication?”
Difficulty funding and publishing | Academic culture | Logistical constraints | Environmental variability | |
---|---|---|---|---|
Number of comments | 332 | 121 | 81 | 36 |
Indicative quotes | “Given competitive landscape in academia, replication studies hold little reward for researcher‐i.e. no funding/hard to publish/not seen as novel so don't frame you as a research leader in any field” “Hard to publish…very limited resources for biodiversity/ ecology research anyway.” | “I think most scientists want to be known for original work, not for doing ‘some else's’ science.” “Too many things to do, not enough ecologists.” “Lack of emphasis on its importance. funding tends to favour new/novel research. Stigma ‐ people may dislike others who try to replicate their studies. People may consider it ‘lesser or easier science’ replicating.” | “$$ and availability of research sites. When doing field ecology, it can be extremely difficult to replicate sites” “Logistics! Field/ experiments can be expensive and time consuming ‐ also in small populations!” “Hard to find the detailed information necessary for proper replication in original study” | “Long term replication studies are vital to ecology however the problem is climate and habitat loss etc all of which can make it very hard to replicate experiments over time” “Unique attributes of year‐to‐year variability and the challenges that presents ‐ at least for field‐based work for other settings (lab/greenhouse) it seems much more reasonable/worthwhile” |
Topics covered | ‐ Difficulty funding, ‐ Short duration of funding, ‐ difficulty publishing, ‐ Expect low citation rate, ‐ Not “novel” | ‐ Bad for career advancement, ‐ Prioritizing important novel work, ‐ Replications not interesting to do | ‐ Not enough time, ‐ Insufficient transparency of methods, ‐ Difficulty accessing original data, ‐ Few candidate sites/populations/individuals | Influence of: ‐ Climate change ‐ Landscape level changes (e.g., caused by clearing or agriculture) ‐ Year on year variation in climate |
4.1. importance of replication.
The overwhelming majority of the ecologists in our study were very positive about replication studies. They considered replication studies to be important, want to see more of them in the literature and support publishing them (Figure 1a‐d ). Enthusiasm for replication studies is further underlined by the sheer quantity of free‐text comments our participants gave ( https://osf.io/bqc74 ). Although we did not give participants a free‐text question about their perspectives on the role of replication studies, some expressed their views about this in the general comments section at the end of the survey. Some evocative examples of these include:
Ecological replication studies should be necessary where results are applied directly to ecosystem management beyond the local/target species context of the study. Replication means different things in different fields. In biodiversity research replication of studies/phenomena, typically with different settings, species, regions etc., is absolutely essential. The question is when there is enough evidence, i.e. when to stop. There is little point in replicating the study EXACTLY (cf. your question 9 above). In molecular biology or e.g. ecotoxicology it seems that doing the latter actually makes more sense. Different labs should span together and run the same experiment in parallel to eventually publish together. I think journals should automatically publish replications (or failures to replicate) if they published the original study. I would also be interested in how microbiology vs other biology fields replicate results.
However, there is a disconnect between this message of support for replication studies expressed in portions of our survey and the data on how researchers publish, use, and prioritize replications. First, the best available estimate is that only 0.023% of ecology studies are identified by their authors as replications (Kelly, 2019 ). This is tiny compared to our participants' median estimate of 10% replication. The disconnect is evident even within our survey, where only a minority of respondents claimed to “almost always” check for replications when investigating a finding (Figure 2 ), despite emphasizing the importance of replication in other questions and free responses. Similarly, around a third of participants also indicated that, given limited funding, the focus should continue to be on novel research (Figure 1c ) and that replication studies should only be published in special editions or dedicated replication journals, or only if the results differ (Figure 1d ). This, combined with comments such as “People often want to research something novel, I think there's a mental block among scientists when it comes to replication; most recognize it's necessary, but most aren't particularly interested in doing it themselves,” suggests a gap between the perceived value of replication studies and the impetus to perform them. Comments such as this expose the mistake of assuming replication work—even direct replication—cannot make a novel contribution. For example, working out which aspects of a study are intrinsic to its conclusion and should not be varied in a replication is itself a substantial intellectual contributions (Nosek & Errington, 2017 ).
This disconnect may be explained by the obstacles identified in this paper, chief among them (a) researchers are, perhaps rightly (Ahadi et al., 2016 ; Asendorpf & Conner, 2012 ; Baker & Penny, 2016 ), concerned that they would have trouble publishing or funding replication studies, (b) conducting replication studies can be logistically problematic, (c) environmental variation makes conducting and interpreting the results of replication studies difficult (Shavit & Ellison, 2017 ), and (d) researchers are unwilling to conduct replication studies because they believe they are boring and less likely to provide prestige than novel research (Ahadi et al., 2016 ; Kelly, 2006 ).
There is movement toward making replication studies more feasible and publishable in other fields, with the inclusion of a criterion describing journals' stance on accepting replication studies as part of the TOP guidelines (Nosek et al., 2015 ; to which over 5,000 journals are signatories) and the advent of Registered Replication Reports (Simons, Holcombe, & Spellman, 2014 ) at several psychology journals. Similarly, initiatives like the many laboratories projects (e.g., Klein et al., 2014 ), StudySwap ( https://osf.io/9aj5g/ ) and the psychological science accelerator ( https://psysciacc.org/ ) build communities that may help overcome the logistical difficulties with replication studies as well as increasing the interest and prestige associated with conducting replication studies. Although no initiatives to directly replicate previously published studies yet exist in ecology, there is a growing movement to improve assessment of generality of hypotheses through collaborations across large numbers of laboratories, implementing identical experiments in different systems (Borer et al., 2014 ; Knapp et al., 2017 ; Peters, Loescher, Sanclements, & Havstad, 2014 ; Verheyen et al., 2016 , 2017 ). The success of these “distributed experiments” suggests that ecologists may be open to forms of collaborations designed to replicate published work.
As in Mulkay and Gilbert ( 1991 ), we find evidence of conceptual slippage between different types of replication study. We asked participants whether they consider different types of potential studies “replication studies.” Participants were able to select multiple options. We expected that participants who include conceptual replications in their definition of replication studies would provide higher estimates for the percentage of ecological studies that are replicated. However, there was little difference in participants' estimates of the replication rate regardless of how permissive their definition of replication was (Table 3 ). This suggests that ecologists have a fluid definition of what a “replication study” is. Similarly, the majority of surveys were distributed by hand, and early in the data collection, it became evident that some were thinking about replicates within a study (i.e., samples) rather than replication of the whole study. As soon as this became evident, we informed each new participant that we were interested in repeating whole studies, not replicates or samples within study. The effect of this confusion on our results is likely to be minimal, because certainly virtually all ecology studies contain within‐study replicates but only 36 of 439 participants (8%) gave answers higher than 50% for the question “What percentage of studies do you think are replicated in ecology?”. This 8% presumably captures all the participants who were answering about “replicates” as well as some that have a very broad definition of what constitutes a replication study.
We found very high level of agreement (90%) that “redoing an experiment or study as closely as possible to the original” (i.e., a direct replication) should be considered a replication study. Most ecologists had a view of replication studies that is much broader than direct replication to the extent that 38% considered “redoing an experiment or study with different methods in the same context” and 14% considered “redoing an experiment or study with different methods in a different context” to be replication studies. This permissive definition of a replication study may be driven by the strong influence of environmental variability on the results of ecological research. It is also consistent with Kelly's ( 2006 ) observation that conceptual and quasireplication are common in behavioral ecology. Conceptual and quasireplications are required to extend spatial, temporal, and taxonomic generalizability in a field with multitudes of study systems, all of which are strongly influenced by their environment.
Many participating ecologists commented that direct replications may be difficult or impossible in ecology due to the strong influence of environmental variability and need for long‐term studies, concerns that are also voiced by Kelly ( 2006 ), Nakagawa and Parker ( 2015 ), Kress (2017), and Schnitzer and Carson ( 2016 ). Schnitzer and Carson ( 2016 ) propose that putting more resources into ensuring that new studies are conducted over a large spatial and temporal scale performing a similar epistemic function as certain types of replication study. Nakagawa and Parker ( 2015 ) suggest that the impact of environmental variability can be overcome by conducting multiple robust replications (inevitably in different environmental conditions) and evaluating the overall trends using meta‐analysis. In contrast, Kelly ( 2006 ) advocates pairing direct and conceptual replications within a single study, providing insights about both the validity and generalizability of the results and increasing the chance of publication (when compared to a direct replication alone). These suggestions have the potential to make replication studies in ecology more feasible and thereby improve the reliability of the ecology literature. Emphasizing the importance of conceptual replications may also make it easier to build a research culture that is more accepting of replication studies.
Conceptual replications may already be common in ecology and evolutionary biology, but presumably because of the desire to appear novel, such studies are almost never identified as replication. Kelly ( 2006 ) found that even though direct replications were absent from a sample of studies in three animal behavior journals, more than a quarter of these studies could be classified as conceptual replications with the same study species, and most of the rest were “quasireplications” in which a previously tested hypothesis was studied in a new taxon. It seems therefore that testing previously tested hypotheses is the norm. We just do not notice because researchers explicitly distinguish their work from previously published research rather than calling attention to the ways in which their studies are replications. In fact, almost none of these conceptual or quasireplications are identified as replications by their authors (Kelly, 2019 ). This brings up two shortcomings of the current system. First, as pointed out earlier, researchers almost never conduct direct replications, and so the benefits of direct replication in terms of convincing tests of internal validity, are nearly absent. Second, even when researchers conduct conceptual or quasireplications, if they are reluctant to call their work replication, some of the inferential value of their work in testing for generality may be missed. In fact, anecdotally, it seems that inconsistency among conceptual replications is often attributed to biological variation and that this is typically interpreted as meaning that the hypothesized process is more complex or contingent on other factors than originally thought. The generality of the original hypothesis is often not directly challenged.
Most of our participating ecologists agreed that replication studies are important; however, some responses are suggestive of ambivalence toward conducting them. Convincing editors to accept Registered Replication Reports, emphasizing the value of less direct, more conceptual replication, and beginning grassroots replication initiatives (inspired by StudySwap, psychological science accelerator, the many laboratories projects, and existing distributed experiments in ecology) in ecology and related fields may combat ecologists' reluctance to conduct replication studies. Beyond that, we believe that the best approach to replication studies in ecology is to:
The authors have no conflict of interests.
Hannah Fraser: Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (supporting); Project administration (equal); Visualization (lead); Writing‐original draft (lead); Writing‐review & editing (lead). Ashley Barnett: Conceptualization (equal); Data curation (supporting); Formal analysis (supporting); Methodology (supporting). Timothy H. Parker: Conceptualization (equal); Writing‐original draft (supporting); Writing‐review & editing (supporting). Fiona Fidler: Conceptualization (equal); Funding acquisition (lead); Investigation (supporting); Methodology (supporting); Project administration (supporting); Resources (lead); Supervision (lead); Writing‐original draft (supporting); Writing‐review & editing (supporting).
This article has been awarded Open Data and Open Materials Badges. All materials and data are publicly accessible via the Open Science Framework at https://doi.org/10.17605/OSF.IO/BQC74 .
Franca Agnoli provided feedback that improved the manuscript and 439 anonymous ecologists generously gave their time to fill in our survey.
Fraser H, Barnett A, Parker TH, Fidler F. The role of replication studies in ecology . Ecol Evol . 2020; 10 :5197–5207. 10.1002/ece3.6330 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
This research was funded by the Australian Research Council Future Fellowship FT150100297s.
Scientific findings often fail to be replicated, researchers say.
Shankar Vedantam
A massive effort to test the validity of 100 psychology experiments finds that more than 50 percent of the studies fail to replicate. This is based on a new study published in the journal "Science."
Copyright © 2015 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.
NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.
Independent journalism is more important than ever. Vox is here to explain this unprecedented election cycle and help you understand the larger stakes. We will break down where the candidates stand on major issues, from economic policy to immigration, foreign policy, criminal justice, and abortion. We’ll answer your biggest questions, and we’ll explain what matters — and why. This timely and essential task, however, is expensive to produce.
We rely on readers like you to fund our journalism. Will you support our work and become a Vox Member today?
What scientists learn from failed replications: how to do better science.
by Brian Resnick
One of the cornerstone principles of science is replication. This is the idea that experiments need to be repeated to find out if the results will be consistent. The fact that an experiment can be replicated is how we know its results contain a nugget of truth. Without replication, we can’t be sure.
For the past several years, social scientists have been deeply worried about the replicability of their findings. Incredibly influential, textbook findings in psychology — like the “ ego depletion” theory of willpower, or the “ marshmallow test ” — have been bending or breaking under rigorous retests. And the scientists have learned that what they used to consider commonplace methodological practices were really just recipes to generate false positives. This period has been called the “ replication crisis ” by some.
And the reckoning is still underway. Recently, a team of social scientists — spanning psychologists and economists — attempted to replicate 21 findings published in the most prestigious general science journals: Nature and Science. Some of the retested studies have been widely influential in science and in pop culture, like a 2011 paper on whether access to search engines hinders our memories, or whether reading books improves a child’s theory of mind (meaning their ability to understand that other people have thoughts and intentions different from their own).
On Monday, they’re publishing their results in the journal Nature Human Behavior. Here’s their take-home lesson: Even studies that are published in the top journals should be taken with a grain of salt until they are replicated. They’re initial findings, not ironclad truth. And they can be really hard to replicate, for a variety of reasons.
The scientists who ran the 21 replication tests didn’t just repeat the original experiments — they made them more rigorous. In some cases, they increased the number of participants by a factor of five, and preregistered their study and analysis designs before a single participant was brought into the lab.
All the original authors (save for one group that couldn’t be reached), signed off on the study designs too. Preregistering is like making a promise to not deviate from a plan and inject bias into the results.
Here are the results: 13 of the 21 results replicated. But perhaps just as notable: Even among the studies that did pass, the effect sizes (that is, the difference between the experimental group and the control group in the experiment, or the size of the change the experimental manipulation made) decreased by around half, meaning that the original findings likely overstated the power of the experimental manipulation.
“Overall, our study shows statistically significant scientific findings should be interpreted rather cautiously until they have been replicated, even if they have been published in the most renowned journals,” Felix Holzmeister, an Austrian economist and one of the study co-authors, says.
Many of the papers that were retested contained multiple experiments. Only one experiment from each paper was tested. So these failed replications don’t necessarily mean the theory behind the original findings is totally bunk.
For instance, the famous “Google Effects on Memory” paper — which found that we often don’t remember things as well when we know we can search for them online — did not replicate in this study. But the experiment chosen was a word-priming task (i.e., does thinking about the internet make it harder to retrieve information), and not the more real-world experiment that involved actually answering trivia statements. And other research since has bolstered that paper’s general argument that access to the internet is shifting the relationship we have with, and the utility of, our own memories.
There could be a lot of reasons a result doesn’t replicate. One is that the experimenters doing the replication messed something up.
The Unexplainable newsletter guides you through the most fascinating, unanswered questions in science — and the mind-bending ways scientists are trying to answer them. Sign up today .
Another reason can be that the study stumbled on a false positive.
One of the experiments that didn’t replicate was from University of Kentucky psychologist Will Gervais. The experiment tried to see if getting people to think more rationally would make them less willing to report religious belief.
“In hindsight, our study was outright silly,” Gervais says. They had people look at a picture of Rodin’s The Thinker or another statue. They thought The Thinker would nudge people to think harder.
“When we asked them a single question on whether they believe in God, it was a really tiny sample size, and barely significant ... I’d like to think it wouldn’t get published today,” Gervais says. (And know, this study was published in Science a top journal.)
In other cases, a study may not replicate because the target — the human subjects — has changed. In 2012, MIT psychologist David Rand published a paper in Nature on human cooperation. The experiment involved online participants playing an economics game. He argues that a lot of online study participants have since grown familiar with this game, which makes it a less useful tool to probe real-life behaviors. His experiment didn’t replicate in the new study.
Finding out why a study didn’t replicate is hard work. But it’s exactly the type of work, and thinking, that scientists need to be engaged in. The point of this replication project, and others like it , is not to call out individual researchers. “It’s a reminder of our values,” says Brian Nosek, a psychologist and the director of the Center for Open Science , who collaborated on the new study. Scientists who publish in top journals should know their work may be checked up on. It’s also important, he notes, to know that social science’s inability to be replicable is in itself a replicable finding.
Often, when studies don’t replicate, it’s not that the effort totally disproves the underlying hypothesis. And it doesn’t mean the original study authors were frauds. But replication results do often significantly change the story we tell about the experiment .
For instance, I recently wrote about a replication effort of the famous “marshmallow test” studies, which originally showed that the ability to delay gratification early in life is correlated with success later on. A new paper found this correlation, but when the authors controlled for factors like family background, the correlation went away.
Here’s how the story changed: Delay of gratification is not a unique lever to pull to positively influence other aspects of a person’s life. It’s a consequence of bigger-picture, harder-to-change components of a person.
In science, too often, the first demonstration of an idea becomes the lasting one. Replications are a reminder that in science, this isn’t supposed to be the case. Science ought to embrace and learn from failure.
The “replication crisis” in psychology, as it is often called, started around 2010 , when a paper using completely accepted experimental methods was published purporting to find evidence that people were capable of perceiving the future, which is impossible. This prompted a reckoning : Common practices like drawing on small samples of college students were found to be insufficient to find true experimental effects.
Scientists thought if you could find an effect in a small number of people, that effect must be robust. But often, significant results from small samples turn out to be statistical flukes. (For more on this, read our explainer on p-values.)
The crisis intensified in 2015 when a group of psychologists, which included Nosek, published a report in Science with evidence of an overarching problem: When 270 psychologists tried to replicate 100 experiments published in top journals, only around 40 percent of the studies held up. The remainder either failed or yielded inconclusive data. And again, the replications that did work showed weaker effects than the original papers. The studies that tended to replicate had more highly significant results compared to the ones that just barely crossed the threshold of significance.
Another important reason to do replications, Nosek says, is to get better at understanding what types of studies are most likely to replicate, and to sharpen scientists’ intuitions about what hypotheses are worthy of testing and which are not.
As part of the new study, Nosek and his colleagues added a prediction component. A group of scientists took bets on which studies they thought would replicate and which they thought wouldn’t. The bets largely tracked with the final results.
As you can see in the chart below, the yellow dots are the studies that did not replicate, and they were all unfavorably ranked by the prediction market survey.
“These results suggest [there’s] something systematic about papers that fail to replicate,” Anna Dreber, a Stockholm-based economist and one of the study co-authors, says.
One thing that stands out: Many of the papers that failed to replicate sound a little too good to be true. Take this 2010 paper that finds simply washing hands negates a common human hindsight bias . When we make a tough choice, we often look back on the choice we passed on unfavorably and are biased to find reasons to justify our decision. Washing hands in an experiment “seems to more generally remove past concerns, resulting in a metaphorical ‘clean slate’ effect,” the study’s abstract stated .
It all sounds a little too easy, too simple — and it didn’t replicate.
All that said, there are some promising signs that social science is getting better. More and more scientists are preregistering their study designs . This prevents them from cherry-picking results and analyses that are more favorable to their favored conclusions. Journals are getting better at demanding larger subject pools in experiments and are increasingly insisting that scientists share all the underlying data of their experiments for others to assess.
“The lesson out of this project,” Nosek says, “is a very positive message of reformation. Science is going to get better.”
Understand the world with a daily explainer plus the most compelling stories of the day.
The privately funded venture will test out new aerospace technology.
Scientific fraud kills people. Should it be illegal?
The case against Medicare drug price negotiations doesn’t add up.
But there might be global consequences.
Covid’s summer surge, explained
Japan’s early-warning system shows a few extra seconds can save scores of lives.
IMAGES
VIDEO
COMMENTS
Credit: Anne-Christine Poujoulat/AFP/Getty. Replicabillity — the ability to obtain the same result when an experiment is repeated — is foundational to science. But in many research fields it ...
As biological experiments can be complicated, replicate measurements are often taken to monitor the performance of the experiment, but such replicates are not independent tests of the hypothesis, and so they cannot provide evidence of the reproducibility of the main results. ... Science is knowledge obtained by repeated experiment or ...
While most replication efforts have focused on biomedicine, health, and psychology, a recent survey of over 1,500 scientists from various fields suggests that the problem is widespread.
Survey sheds light on the 'crisis' rocking research. More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to ...
CONCLUSION 5-3: Because many scientists routinely conduct replication tests as part of a follow-on work and do not report replication results separately, the evidence base of non-replicability across all science and engineering research is incomplete. ... Before conducting an experiment, a researcher must make a number of decisions about study ...
Here are some of the most significant categories. A lack of access to methodological details, raw data, and research materials. For scientists to be able to reproduce published work, they must be ...
The same challenges apply to scientific experiments. ... "Over time, our hope is that when a scientist takes on or attempts replication — because the value of the result can outweigh cost, because a great deal weighs on scientific basis — those types of papers will get recognition in a scholar's career." ...
A high-profile replication study in cancer biology has had disappointing results. Scientists must redouble their efforts to find out why. R eplicabillity — the ability to obtain the same result when an experiment is repeated — is foun - dational to science. But in many research fields it has proved difficult to achieve. An important
Scientific research has evolved from an activity mainly undertaken by individuals operating in a few locations to many teams, large communities, and complex organizations involving hundreds to thousands of individuals worldwide. In the 17th century, scientists would communicate through letters and were able to understand and assimilate major developments across all the emerging major ...
According to a survey published in the journal Nature last summer, more than 70% of researchers have tried and failed to reproduce another scientist's experiments. Marcus Munafo is one of them ...
Determining whether one experiment is a proper replication of another is complicated by the facts that scientific writing conventions often omit precise details of experimental methodology (Collins 2016), and, furthermore, much of the knowledge that scientists require to execute experiments is tacit and "cannot be fully explicated or ...
provide definitions of "reproducibility" and "replication" accounting for the diversity of fields in science and engineering, 2. assess what is known and, if necessary, identify areas that may need more information to ascertain the extent of the issues of replication and reproducibility in scientific and engineering research, 3.
The answer, of course, is 'no'. Replicates serve as internal quality checks on how the experiment was performed. If, for example, in the experiment described in Table 1 and Fig 1, one of the replicate plates with saline‐treated WT bone marrow contained 100 colonies, you would immediately suspect that something was wrong. You could check ...
But how many of those experiments would produce the same results a second time around? According to work presented today in Science, fewer than half of 100 studies published in 2008 in three top ...
CONCLUSION 5-3: Because many scientists routinely conduct replication tests as part of a follow-on work and do not report replication results separately, the evidence base of non-replicability across all science and engineering research is incomplete. ... Experiments conducted under the same conditions may run the risk of finding "truths ...
Being able to replicate scientific findings is crucial for scientific progress1-15. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and ...
Replication is a key idea in science and statistics, but is often misunderstood by researchers because they receive little education or training on experimental design. Consequently, the wrong entity is replicated in many experiments, leading to pseudoreplication or the "unit of analysis" problem [1,2]. This results in exaggerated sample ...
The trick, he says, is in seeing replication not as an end in itself but as a means for acquainting yourself with the methods used in a study, the original author's line of thinking, the complications he or she must have faced, and the solutions they devised to those problems. Replicating an experiment, or even the whole study, can be useful ...
Replication. Once we have repeated our testing over and over, and think we understand the results, then it is time for replication. That means getting other scientists to perform the same tests, to see whether they get the same results. As with repetition, the most important things to watch for are results that don't fit our hypothesis, and for ...
Over the last 8-10 years, concern over a "replication crisis" in science has mounted. The basis of this concern comes from large‐scale direct replication projects in several fields which found low rates of successful replication. ... Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 ...
A massive effort to test the validity of 100 psychology experiments finds that more than 50 percent of the studies fail to replicate. This is based on a new study published in the journal "Science."
Good experimental design practice includes planning for replication. First, identify the questions the experiment aims to answer. Next, determine the proportion of variability induced by each step ...
The scientists who ran the 21 replication tests didn't just repeat the original experiments — they made them more rigorous. In some cases, they increased the number of participants by a factor ...