SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Reproducibility of Scientific Results

The terms “reproducibility crisis” and “replication crisis” gained currency in conversation and in print over the last decade (e.g., Pashler & Wagenmakers 2012), as disappointing results emerged from large scale reproducibility projects in various medical, life and behavioural sciences (e.g., Open Science Collaboration, OSC 2015). In 2016, a poll conducted by the journal Nature reported that more than half (52%) of scientists surveyed believed science was facing a “replication crisis” (Baker 2016). More recently, some authors have moved to more positive terms for describing this episode in science; for example, Vazire (2018) refers instead to a “credibility revolution” highlighting the improved methods and open science practices it has motivated.

The crisis often refers collectively to at least the following things:

  • the virtual absence of replication studies in the published literature in many scientific fields (e.g., Makel, Plucker, & Hegarty 2012),
  • widespread failure to reproduce results of published studies in large systematic replication projects (e.g., OSC 2015; Begley & Ellis 2012),
  • evidence of publication bias (Fanelli 2010a),
  • a high prevalence of “questionable research practices”, which inflate the rate of false positives in the literature (Simmons, Nelson, & Simonsohn 2011; John, Loewenstein, & Prelec 2012; Agnoli et al. 2017; Fraser et al. 2018), and
  • the documented lack of transparency and completeness in the reporting of methods, data and analysis in scientific publication (Bakker & Wicherts 2011; Nuijten et al. 2016).

The associated open science reform movement aims to rectify conditions that led to the crisis. This is done by promoting activities such as data sharing and public pre-registration of studies, and by advocating stricter editorial policies around statistical reporting including publishing replication studies and statistically non-significant results.

This review consists of four distinct parts. First, we look at the term “reproducibility” and related terms like “repeatability” and “replication”, presenting some definitions and conceptual discussion about the epistemic function of different types of replication studies. Second, we describe the meta-science research that has established and characterised the reproducibility crisis, including large scale replication projects and surveys of questionable research practices in various scientific communities. Third, we look at attempts to address epistemological questions about the limitations of replication, and what value it holds for scientific inquiry and the accumulation of knowledge. The fourth and final part describes some of the many initiatives the open science reform movement has proposed (and in many cases implemented) to improve reproducibility in science. In addition, we reflect there on the values and norms which those reforms embody, noting their relevance to the debate about the role of values in the philosophy of science.

1.1 An Account from the Social Sciences

1.2 an interdisciplinary account, 1.3 a philosophical account, 2.1 reproducibility projects, 2.2 publication bias, low statistical power and inflated false positive rates, 2.3 questionable research practices, 2.4 over-reliance on null hypothesis significance testing, 2.5 scientific fraud, 3.1 the experimenters’ regress, 3.2 replication as a distinguishing feature of science, 3.3 formalising the logic of replication, 4.1 methods and training, 4.2 reporting and dissemination, 4.3 peer review, 4.4 incentives and evaluations, 4.5 values, tone, and scientific norms in open science reform, 5. conclusion, other internet resources, related entries, 1. replicating, repeating, and reproducing scientific results.

A starting point in any philosophical exploration of reproducibility and related notions is to consider the conceptual question of what such notions mean. According to some (e.g., Cartwright 1991), the terms “replication”, “reproduction” and “repetition” denote distinct concepts, while others use these terms interchangeably (e.g., Atmanspacher & Maasen 2016a). Different disciplines can have different understandings of these terms too. In computational disciplines, for example, reproducibility often refers to the ability to reproduce computations alone, that is, it relates exclusively to sharing and sufficiently annotating data and code (e.g., Peng 2011, 2015). In those disciplines, replication describes the redoing of whole experiments (Barba 2017, Other Internet Resources). In psychology and other social and life sciences, however, reproducibility may refer to either the redoing of computations, or the redoing of experiments. The Reproducibility Projects, coordinated by the Center for Open Science, redo entire studies, data collection and analysis. A recent funding program announcement by DARPA (US Defense Advanced Research Programs Agency) distinguished between reproducibility and replicability, where the former refers to computational reproducibility and the latter to the redoing of experiments. Here we use all three terms—“replication”, “reproduction” and “repetition”—interchangeably, unless explicitly describing the distinctions of other authors.

When describing a study as “replicable”, people could have in mind either of at least two different things. The first is that the study is replicable in principle the sense that it can be carried out again, particularly when its methods, procedures and analysis are described in a sufficiently detailed and transparent way. The second is that the study is replicable in that sense that it can be carried out again and, when this happens, the replication study will successfully produce the same or sufficiently similar results as the original. A study may be replicable in the former sense but not in the second sense: one might be able to replicate the methods, procedures and analysis of a study, but fail to successfully replicate the results of the original study. Similarly, when people talk of a “replication”, they could also have in mind two different things: the replication of the methods, procedures and analysis of a study (irrespective of the results) or, alternatively, the replication of such methods, procedures and analysis as well as the results.

Arguably, most typologies of replication make more or less fine-grained distinctions between direct replication (which closely follow the original study to verify results) and conceptual replications (which deliberately alter important features of the study to generalize findings or to test the underlying hypothesis in a new way). As suggested, this distinction may not always be known by these terms. For example, roughly the same distinction is referred to as exact and inexact replication by Keppel (1982); concrete and conceptual replication by Sargent (1981), and literal, operational and constructive replication by Lykken (1968). Computational reproducibility is most often direct (reproducing particular analysis outcomes from the same data set using the same code and software), but it can also be conceptual (analysing the same raw data set with alternative approaches, different models or statistical frameworks). For an example of a conceptual computational reproducibility study, see Silberzahn and Uhlmann 2015.

We do not attempt to resolve these disciplinary differences or to create a new typology of replication, and instead we will provide a limited snapshot of the conceptual terrain by surveying three existing typologies—from Stefan Schmidt (2009), from Omar Gómez, Natalia Juristo, and Sira Vegas (2010) and from Hans Radder. Schmidt’s account has been influential and widely-cited in psychology and social sciences, where the replication crisis literature is heavily concentrated. Gómez, Juristo, and Vegas’s (2010) typology of replication is based on a multidisciplinary survey of over 18 scholarly classifications of replication studies which collectively contain more than 79 types of replication. Finally, Radder’s (1996, 2003, 2006, 2009, 2012) typology is perhaps best known within philosophy of science itself.

Schmidt outlines five functions of replication studies in the social sciences:

  • Function 1. Controlling for sampling error—that is, to verify that previous results in a sample were not obtained purely by chance outcomes which paint a distorted picture of reality
  • Function 2. Controlling for artifacts (internal validity)—that is, ensuring that experimental results are a proper test of the hypothesis (i.e., have internal validity) and do not reflect unintended flaws in the study design (such as when a measurement result is, say, an artifact of a faulty thermometer rather than an actual change in a substance’s temperature)
  • Function 3. Controlling for fraud,
  • Function 4. Enabling generalizability,
  • Function 5. Enabling verification of the underlying hypothesis.

Modifying Hendrik’s (1991) classes of variables that define a research space, Schmidt (2009) presents four classes of variables which may be altered or held constant in order for a given replication study to fulfil one of the above functions. The four classes are:

  • Class 1. Information conveyed to participants (for example, their task instructions).
  • Class 2. Context and background. This is a large class of variables, and it includes: participant characteristics (e.g., age, gender, specific history); the physical setting of the research; characteristics of the experimenter; incidental characteristics of materials (e.g., type of font, colour of the room),
  • Class 3. Participant recruitment, including selection of participants and allocation to conditions (such as experimental or control conditions),
  • Class 4. Dependent variable measures (or in Schmidt’s terms “procedures for the constitution of the dependent variable”, 2009: 93)

Schmidt then systematically works through examples of how each function can be achieved by altering and/or holding a different class or classes of variable constant. For example, to fulfil the function of controlling for sampling error ( Function 1 ), one should alter only variables regarding participant recruitment (Class 3), attempting to keep variables in all other classes as close to the original study as possible. To control for artefacts ( Function 2 ), one should alter variables concerning the context and dependent variable measures (variables in Classes 2 and 4 respectively), but keep variables in 1 and 3 (information conveyed to participants and participant recruitment) as close to the original as possible. Schmidt, like most other authors in this area, acknowledges the practical limits of being able to hold all else constant. Controlling for fraud ( Function 3 ) is served by the same arrangements as controlling for artefacts ( Function 2 ). In Schmidt’s account, controlling for sampling error, artefacts and fraud (Functions 1 to 3) are connected by a theme of confirming the results of the original study. Functions 4 and 5 go beyond this—generalizing to new populations ( Function 4 ) which is served by changes to participant recruitment (Class 3) and confirming the underlying hypothesis ( Function 5 ), which served by changes to the information conveyed, the context and dependant variable measures (Classes 1, 2 and 4 respectively) but not changes to participant recruitment (Class 3, although Schmidt acknowledges that holding the latter class of variables constant whilst varying everything else is often practically impossible). Attempts to enable verification of the underlying research hypothesis (i.e., to fulfil Function 5) alone are what Schmidt classifies as conceptual replications, following Rosenthal (1991). Attempts to fulfil the other four functions are considered variants of direct replications.

In summary, for Schmidt, direct replications control for sampling error, artifacts, and fraud, and provide information about the reliability and validity of prior empirical work. Conceptual replications help corroborate the underlying theory or substantive (as opposed to statistical) hypothesis in question and the extent to which they generalize in new circumstances and situations. In practice, direct and conceptual replications lie on a continuum, with replication studies varying more or less compared to the original on potentially a great number of dimensions.

Gómez, Juristo, and Vega’s (2010) survey of the literature in 18 disciplines identified 79 types of replication, not all of which they considered entirely distinct. They identify five main ways in which a replication study may diverge from an initial study. With some similarities to Schimdt’s four classes above:

  • The site or spatial location of the replication experiment: replication experiments may be conducted in a location that is or is not the same as the site of the initial study.
  • The experimenters conducting a replication may be exclusively the same as the original, exclusively different, or a combination of new and original experimenters
  • The apparatus , including the design, materials, instruments and other important experimental objects and/or procedures may vary between original and replication studies.
  • The operationalisations employed may differ, where operationalisation refers to measurement of variables. For example, in psychology this might include using two different scales measuring for depression (as a dependent variable).
  • Finally, studies may vary on population properties .

A change in any one or combination of these elements in a replication study corresponds to different purposes underlying the study, and thereby establishes a different kind of validity. Like Schmidt et al. then systematically work through how changes to each of the above work to fulfil different epistemic functions.

  • Function 1. Conclusion Validity and Controlling for Sampling Error : If each of the five elements above are unchanged in a replication study, then the purpose of the replication is to control for sampling error , that is, to verify that previous results in a sample were not obtained purely by chance outcomes which make the sample misleading or unrepresentative. This provides a safeguard against what is known as a type I error : incorrectly failing to reject the null hypothesis (that is, the hypothesis that there is no relationship between two phenomena under investigation). These studies establish conclusion validity , that is, the credibility or believability of an observed relationship or phenomenon.
  • Function 2. Internal Validity and Controlling for Artefactual Results : If a replication study differs with respect to the site, experimenters or apparatus, then its purpose is to establish that previously observed results are not an artefact of a particular apparatus, lab or so on. These studies establish internal validity , that is, the extent to which results can be attributed to the experimental manipulation itself rather than to extraneous variables.
  • Function 3. Construct Validity and Determining Limits for Operationalizations : If a replication study differs with respect to operationalisations, then its purpose is to determine the extent to which the effect generalizes across measures of manipulated or dependent variables (e.g., the extent to which the effect does not depend on the particular psychometric test one uses to evaluate depression or IQ). Such studies fulfil the function of establishing construct validity in that they provide evidence that the effect holds across different ways of measuring the constructs.
  • Function 4. External Validity and Determining Limits in the Population Properties : If a replication study differs with respect to its population properties, then its purpose is to ascertain the extent to which the results are generalizable to different populations, populations which, in Gómez, Juristo, and Vegas’s view, concern subjects and experimental objects such as programs. Such studies reinforce external validity —the extent to which the results are generalizable to different populations.

Radder (1996, 2003, 2006, 2009, 2012) distinguishes three types of reproducibility. One is the reproducibility of what Radder calls an experiment’s material realization . Using one of Radder’s own examples as an illustration, two people may carry out the same actions to measure the mass of an object. Despite doing the same actions, person A regards themselves as measuring the object’s Newtonian mass while person B regards themselves as measuring the object’s Einsteinian mass. Here, then, the actions or material realization of the experimental procedure can be reproduced, but the theoretical descriptions of their significance differ. Radder, however, does not specify what is required for one material realisation to be a reproduction of another, a pertinent question, especially since, as Radder himself affirms, no reproduction will be exactly the same as any other reproduction (1996: 82–83).

A second type of reproducibility is the reproducibility of an experiment, given a fixed theoretical description . For example, a social scientist might conduct two experiments to examine social conformity. In one experiment, a young child might be instructed to give an answer to a question before a group of other children who are, unknown to the former child, instructed to give wrong answers to the same question. In another experiment, an adult might be instructed to give an answer to a question before a group of other adults who are, unknown to the former adult, instructed to give wrong answers to the same question. If the child and the adult give a wrong answer that conforms to the answers of others, then the social scientist might interpret the result as exemplifying social conformity. For Radder, the theoretical description of the experiment might be fixed, specifying that if some people in a participant’s surroundings give intentionally false answers to the question, then the genuine participant will conform to the behaviour of their peers. However, the material realization of these experiments differs insofar as one concerns children and the other adults. It is difficult to see how, in this example at least, this differs from what either Schmidt or Gómez, Juristo, and Vegas would refer to as establishing generalizability to a different population (Schmidt’s [2009] Class 3 and Function 5 ; Gómez, Juristo, and Vegas’s [2010] way 5 and Function 4 ).

The third kind of reproducibility is what Radder calls replicability . This is where experimental procedures differ to produce the same experimental result (otherwise known as a successful replication). For example, Radder notes that multiple experiments might obtain the result “a fluid of type f has a boiling point b ”, despite having different kinds of thermometers by which to measure this boiling point (2006: 113–114).

Schmidt (2009) points out that the difference between Radder’s second and third types of reproducibility is small in comparison to their differences to the first type. He consequently suggests his alternative distinction between direct and conceptual replication, presumably intending a conceptual replication to cover Radder’s second and third types.

In summary, whilst Gómez, Juristo, and Vegas’s typology draws distinctions in slightly different places to Schmidt’s, its purpose is arguably the same—to explain what types of alterations in replication studies fulfil different scientific goals, such as establishing internal validity or the extent of generalization and so on. With the exception of his discussion of reproducing the material realization, Radder’s other two categories can perhaps be seen as fitting within the larger range of functions described by Schmidt and Gómez et al., who both acknowledge that in practice, direct and conceptual replications lie on a noisy continuum.

2. Meta-Science: Establishing, Monitoring, and Evaluating the Reproducibility Crisis

In psychology, the origin of the reproducibility crisis is often linked to Daryl Bem’s (2011) paper which reported empirical evidence for the existence of “psi”, otherwise known as Extra Sensory Perception (ESP). This paper passed through the standard peer review process and was published in the high impact Journal of Personality and Social Psychology. The controversial nature of the findings inspired three independent replication studies, each of which failed to reproduce Bem’s results. However, these replication studies were rejected from four different journals, including the journal that had originally published Bem’s study, on the grounds that the replications were not original or novel research. They were eventually published in PLoS ONE (Ritchie, Wiseman, & French 2012). This created controversy in the field, and was interpreted by many as demonstrating how publication bias impeded science’s self-correction mechanism. In medicine, the origin of the crisis is often attributed to Ioannidis’ (2005) paper “Why most published findings are false”. The paper offered formal arguments about inflated rates of false positives in the literature—where a “false positive” result claims a relationship exists between phenomena when it in fact does not (e.g., a claim that consuming a drug is correlated with symptom relief when it in fact is not). Ioannidis’ (2005) also reported very low (11%) empirical reproducibility rates from a set of pre-clinical trial replications at Amgen, later independently published by Begley and Ellis (2012). In all disciplines, the replication crisis is also more generally linked to earlier criticisms of Null Hypothesis Significance Testing (e.g., Szucs & Ioannidis 2017), which pointed out the neglect of statistical power (e.g., Cohen 1962, 1994) and a failure to adequately distinguish statistical and substantive hypotheses (e.g., Meehl 1967, 1978). This is discussed further below.

In response to the events above, a new field identifying as meta-science (or meta-research ) has become established over the last decade (Munafò et al. 2017). Munafò et al. define meta-science as “the scientific study of science itself” (2017: 1). In October 2015, Ioannidis, Fanelli, Dunne, and Goodman identified over 800 meta-science papers published in the five-month period from January to May that year, and estimated that the relevant literature was accruing at the rate of approximately 2,000 papers each year. Referring to the same bodies of work with slightly different terms, Ioannidis et al. define “meta-research” as

an evolving scientific discipline that aims to evaluate and improve research practices. It includes thematic areas of methods, reporting, reproducibility, evaluation, and incentives (how to do, report, verify, correct, and reward science). (2015: 1)

Multiple research centres dedicated to this work now exist, including, for example, the Tilburg University Meta-Research Center in psychology, the Meta-Research Innovation Center at Stanford (METRICS), and others listed in Ioannidis et al. 2015 (see Other Internet Resources ). Relevant research in medical fields is also covered in Stegenga 2018.

Projects that self-identify as meta-science or meta-research include:

  • Large, crowd-sourced, direct (or close) replication projects such as The Reproducibility Projects in Psychology (OSC 2015) and Cancer Biology (Errington et al. 2014) and the Many Labs projects in psychology (e.g., Klein et al. 2014);
  • Computational reproducibility projects, that is, redoing analysis using the same original data set (e.g., Chang & Li 2015);
  • Bibliographic studies documenting the extent of publication bias in different scientific fields and changes over time (e.g., Fanelli 2010a, 2010b, 2012);
  • Surveys of the use of Questionable Research Practices (QRPs) amongst researchers and their impact on the publication literature (e.g., John, Loewenstein, & Prelec 2012; Fiedler & Schwarz 2016; Agnoli et al. 2017; Fraser et al. 2018);
  • Surveys of the completeness, correctness and transparency of methods and analysis reporting in scientific journals (e.g., Nuijten et al. 2016; Bakker & Wicherts 2011; Cumming et al. 2007; Fidler et al. 2006);
  • Survey and interview studies of researchers’ understanding of core methodological and statistical concepts, and real and perceived obstacles to improving practices (Bakker et al. 2016; Washburn et al. 2018; Allen, Dorozenko, & Roberts 2016);
  • Evaluation of incentives to change behaviour, thereby improving reproducibility and encouraging more open practices (e.g., Kidwell et al. 2016).

The most well known of these projects is undoubtedly the Reproducibility Project: Psychology, coordinated by the now Center for Open Science in Charlottesville, VA (then the Open Science Collaboration). It involved 270 crowd sourced researchers in 64 different institutions in 11 different countries. Researchers attempted direct replications of 100 studies published in three leading psychology journals in the year 2008. Each study was replicated only once. Replications attempted to follow original protocols as closely as possible, though some differences were unavoidable (e.g., some replication studies were done with European samples when the original studies used US samples). In almost all cases, replication studies used larger sample sizes that the original studies and therefore had greater statistical power—that is, a greater probability of correctly rejecting the null hypothesis (i.e., that no relationship exists) when the hypothesis is false. A number of measures of reproducibility were reported:

  • The proportion of studies in which there was a match in the statistical significance between original and replication. (Here, the statistical significance of a result is the probability that it would occur given the null hypothesis, and p values are common measures of such probabilities. A replication study and an original study would have a match in statistical significance if, for example, they both specified that the probability of the original and replication results occurring given the null hypothesis is less than 5%—i.e., if the p values for results in both studies are below 0.05.) Thirty nine percent (36%) of results were successful reproduced according to this measure.
  • The proportion of studies in which the Effect Size (ES) of the replication study fell within the 95% Confidence Interval (CI) of the original. (Here, an ES represents the strength of a relationship between phenomena—a toy example of which is how strongly consumption of a drug is correlated with symptom relief—and a Confidence Interval provides some indication of the probability that the ES of the replication study is close to the ES of the original study.) Forty seven percent (47%) of results were successfully reproduced according to this measure.
  • The correlation between original ES and replication ES. Replication study ESs were roughly half the size of original ESs.
  • The proportion of studies for which subjective ratings by independent researchers indicated a match between the replication and the original. Thirty nine percent (39%) were considered successful reproductions according to this measure. The closeness of this figure to measure 1 suggests that raters relied very heavily on p values in making their judgements.

There have been objections to the implementation and interpretation of this project, most notably by Gilbert et al. (2016), who took issue with the extent to which the replications studies were indeed direct replications. For example, Gilbert et al. highlighted 6 specific examples of “low fidelity protocols”, that is, where replication studies differed in their view substantially from the original (in one case, using a European sample rather than a US sample of participants). However, Anderson et al. (2016) explained in a reply that in half of those cases, the authors of the original study had endorsed the replication as being direct or close to on relevant dimensions and that furthermore, that independently rated similarity between original and replication studies failed to predict replication success. Others (e.g., Etz & Vandekerckhove 2016) have applied Bayesian reanalysis to the OSC’s (2015) data and conclude that up to 75% (as opposed to the OSC’s 36–47%) of replications could be considered successful. However, they do note that in many cases this is only with very weak evidence (i.e., Bayes factors of <10). They too conclude that the failure to reproduce many effects is indeed explained by the overestimation of effect sizes, itself a product of publication bias. A Reproducibility Project: Cancer Biology (also coordinated by the Center for Open Science) is currently underway (Errington et al. 2014), originally attempting to replicate 50 of the highest impact studies in Cancer Biology published between 2010–2012. This project has recently announced it will complete with only 18 replication studies, as too few originals reported enough information to proceed with full replications (Kaiser 2018). Results of the first 10 studies are reportedly mixed, with only 5 being considered “mostly repeatable” (Kaiser 2018).

The Many Labs project (Klein et al. 2014) coordinated 36 independent replications of 13 classic psychology phenomena (from 12 studies, that is, one study tested two effects), including anchoring, sunk cost bias and priming, amongst other well-known effects in psychology. In terms of matching statistical significance, the project demonstrated that 11 out of 13 effects could be successful replicated. It also showed great variation in many of the effect sizes across the 36 replications.

In biomedical research, there have also been a number of large scale reproducibility projects. An early one by Begley and Ellis (2012, but discussed earlier in Ioannidis 2005) attempted to replicate 56 landmark pre-clinical trials and reported an alarming reproducibility rate of only 11%, that is, only 6 of the 56 results could be successfully reproduced. Subsequent attempts at large scale replications in this field have produced more optimistic estimates, but routinely failed to successfully reproduce more than half of the published results. Freedman et al. (2015) report five replication projects by independent groups of researchers which produce reproducibility estimates ranging from 22% to 49%. They estimate the cost of irreproducible research in US biomedical science alone to be in the order of USD$28 billion per year. A reproducibility project in Experimental Philosophy is an exception to the general trend, reporting reproducibility rates of 70% (Cova et al. forthcoming).

Finally, the Social Science Replication Project (SSRP) redid 21 experimental social science studies published in the journals Nature and Science between 2010 and 2015. Depending on the measure taken, the replication success rate was 57–67% (Camerer et al. 2018).

The causes of irreproducible results are largely the same across disciplines we have mentioned. This is not surprising given that they stem from problems with statistical methods, publishing practices and the incentive structures created in a “publish or perish” research culture, all of which are largely shared, at least in the life and behavioral sciences.

Whilst replication is often casually referred to as a cornerstone of the scientific method, direct replication studies (as they might be understood from Schmidt or Gómez, Juristo, and Vegas’s typologies above) are a rare event in the published literature of some scientific disciplines, most notably the life and social sciences. For example, such replication attempts constitute roughly 1% of the published psychology literature (Makel, Plucker, & Hegarty 2012). The proportion in published ecology and evolution literature is even smaller (Kelly 2017, Other Internet Resources).

This virtual absence of replication studies in the literature can explained by the fact that many scientific journals have historically had explicit policies against publishing replication studies (Mahoney 1985)—thus giving rise to a “publication bias”. Over 70% of editors from 79 social science journals said they preferred new studies over replications and over 90% said they would did not encourage the submission of replication studies (Neuliep & Crandall 1990). In addition, many science funding bodies also fund only “novel”, “original” and/or “groundbreaking” research (Schmidt 2009).

A second type of publication bias has also played a substantial role in the reproducibility crisis, namely a bias towards “statistically significant” or “positive” results. Unlike the bias against replication studies, this is rarely an explicitly stated policy of a journal. Publication bias towards statistically significant findings has a long history, and was first documented in psychology by Sterling (1959). Developments in text mining techniques have led to more comprehensive estimates. For example, Fanelli’s work has demonstrated the extent of publication bias in various disciplines, and the proportions of statistically significant results given below are from his 2010a paper. He has also documented the increase of this bias over time (2012) and explored the causes of the bias, including the relationship between publication bias and a publish or perish research culture (2010b).

In many disciplines (e.g., psychology, psychiatry, materials science, pharmacology and toxicology, clinical medicine, biology and biochemistry, economics and business, microbiology and genetics) the proportion of statistically significant results is very high, close to or exceeding 90% (Fanelli 2010a). This is despite the fact that in many of these fields, the average statistical power is low—that is, the average probability that a study will correctly reject the null hypothesis is low. For example, in psychology the proportion of published results that are statistically significant is 92% despite the fact that the average power of studies in this field to detect medium effect sizes (arguably typical of the discipline) is roughly 44% (Szucs & Ioannidis 2017). If there was no bias towards publishing statistically significant results, the proportion of significant results should roughly match the average statistical power of the discipline. The excess in statistical significance (in this case, the difference between 92% and 44%) is therefore an indicator the strength of the bias. For a second example, in ecology, environment and plant and animal sciences the proportion of statistically significant results is 74% and 78% respectively, admittedly lower than in psychology. However, the most recent estimate of the statistical power, again of medium effect sizes, of ecology and animal behaviour is 23–26% (Smith, Hardy, & Gammell 2011) (An earlier more optimistic assessment was 40–47%, Jennions & Møller, 2003.) For a third example, the proportion of statistically significant results in neuroscience and behaviour is 85%. Our best estimate of the statistical power in neuroscience is at best 31%, with a lower bound estimate of 8% (Button et al. 2013). The associated file-drawer problem (Rosenthal 1979)—where researchers relegate failed statistically non-significant studies to their file drawers, hidden from public view—has long been established in psychology and others disciplines, and is known to lead to distortions in meta-analysis (where a “meta-analysis” is a study which analyses results across multiple other studies).

In addition to creating the file-drawer problem described above, publication bias has been held at least partially responsible for the high prevalence of Questionable Research Practices (QRPs) uncovered in both self-report survey research (John, Loewenstein, & Prelec 2012; Agnoli 2017 et al. 2017; Fraser et al. 2018) and in journal studies that have detected, for example, unusual distributions of p values (Masicampo & Lalande 2012; Hartgerink et al. 2016). Pressure to publish, now ubiquitous across academic institutions, means that researchers often cannot afford to simply assign “failed” or statistically non-significant studies to the file drawer, so instead they p hack and cherry-pick results (as discussed below) back to significance, and back into the published literature. Simmons, Nelson, and Simonsohn (2011) explained and demonstrated with simulated results how engaging in such practices inflates the false positive error rate of the published literature, leading to a lower rate of reproducible results.

“ P hacking” refers to a set of practices which include: checking the statistical significance of results before deciding whether to collect more data; stopping data collection early because results have reached statistical significance; deciding whether to exclude data points (e.g., outliers) only after checking the impact on statistical significance and not reporting the impact of the data exclusion; adjusting statistical models, for instance by including or excluding covariates based on the resulting strength of the main effect of interest; and rounding of a p value to meet a statistical significance threshold (e.g., presenting 0.053 as P < .05). “Cherry picking” includes failing to report dependent or response variables or relationships that did not reach statistical significance or other threshold and/or failing to report conditions or treatments that did not reach statistical significance or other threshold. “HARKing” (Hypothesising After Results are Known) includes presenting ad hoc and/or unexpected findings as though they had been predicted all along (Kerr 1998); and presenting exploratory work as though it was confirmatory hypothesis testing (Wagenmakers et al. 2012). Five of the most widespread QRPs are listed below in Table 1 (from Fraser et al. 2018), with associated survey measures of prevalence.

Table 1: The prevalence of some common Questionable Research Practices. Percentage (with 95% confidence intervals) of researches who reported having used the QRP at least once (adapted from Fraser et al. 2018)


(Agnoli et al. 2017)

(John, Loewenstein, & Prelec 2012)

(Fraser et al. 2018)

(Fraser et al. 2018)
Not reporting response (outcome) variables that failed to reach statistical significance 47.9
63.4
64.1
63.7
Collecting more data after inspecting whether the results are statistically significant 53.2
55.9
36.9
50.7
Rounding-off a value or other quantity to meet a pre-specified threshold 22.2
22.0
27.3
17.5
Deciding to exclude data points after first checking the impact on statistical significance 39.7
38.2
24.0
23.9
Reporting an unexpected finding as having been predicted from the start 37.4
27.0
48.5
54.2

# cherry picking, * p hacking, ^ HARKing

Null Hypothesis Significance Testing (NHST)—discussed above—is a commonly diagnosed cause of the current replication crisis (see Szucs & Ioannidis 2017). The ubiquitous nature of NHST in life and behavioural sciences is well documented, most recently by Cristea and Ioannidis (2018). This is important pre-condition for establishing its role as a cause, since it could not be a cause if its actual use was rare. The dichotomous nature of NHST facilitates publication bias (Meehl 1967, 1978). For example, the language of accept and reject in hypothesis testing maps conveniently on to acceptance and rejection of manuscripts, a fact that led Rosnow and Rosenthal (1989) to decry that “surely God loves the .06 nearly as much as the .05” (1989: 1277). Techniques that failed to enshrine a dichotomous threshold would be harder to employ in service of publication bias. For example, a case has been made that estimation using effect sizes and confidence intervals (introduced above) would be less prone to being used in service of publication bias (Cumming 2012, Cumming and Calin-Jageman 2017.

As already mentioned, the average statistical power in various disciplines is low. Not only is power often low, but it is virtually never reported; less than 10% of published studies in psychology report statistical power and even fewer in ecology do (Fidler et al. 2006). Explanations for the widespread neglect of statistical power often highlight the many common misconceptions and fallacies associated with p values (e.g., Haller & Krauss 2002; Gigerenzer 2018). For example, the inverse probability fallacy [ 1 ] has been used to explain why so many researchers fail to calculate and report statistical power (Oakes 1986).

In 2017, a group of 72 authors proposed in a Nature Human Behaviour paper that alpha level in statistical significance testing be lowered to 0.005 (as opposed to the current standard of 0.05) to improve the reproducibility rate of published research (Benjamin et al. 2018). A reply from a different set of 88 authors was published in the same journal, arguing against this proposal and stating instead that researchers should justify their alpha level based on context (Lakens et al. 2018). Several other replies have followed, including a call from Andrew Gelman and colleagues to abandon statistical significance altogether (McShane et al. 2018, Other Internet Resources). The exchange has become known on social media as the Alpha Wars (e.g., in the Barely Significant blog, Other Internet Resources )). Independently, the American Statistical Association released a statement on the use of p values for the first time in its history, cautioning against their overinterpretation and pointing out the limits of the information they offer about replication (Wasserman & Lazar 2016) and devoted their association’s 2017 annual convention to the theme “Scientific Method for the 21 st Century: A World Beyond \(p <0.05\)” (see Other Internet Resources ).

A number of recent high-profile cases of scientific fraud have contributed considerably to the amount of press around the reproducibility crisis in science. Often these cases (e.g., Diederik Stapel in psychology) are used as a hook for media coverage, even though the crisis itself has very little to do with scientific fraud. (Note also that the Questionable Research Practices above are not typically counted as “fraud” or even “scientific misconduct” despite their ethically dubious status.) For example, Fang, Grant Steen, and Casadevall (2012) estimated that 43% of retracted articles in biomedical research are withdrawn because of fraud. However, roughly half a million biomedical articles are published annually and only 400 of those are retracted (Oransky 2016, founder of the website RetractionWatch), so this amounts to a very small proportion of the literature (approximately 0.1%). There are, of course, many cases of pharmaceutical companies exercising financial pressure on scientists and the publishing industry that raise speculation about how many undetected (or unretracted) cases there may still be in the literature. Having said that, there is widespread consensus amongst scientists in the field that the main cause of the current reproducibility crisis is the current incentive structure in science (publication bias, publish or perish, non-transparent statistical reporting, lack of rewards for data sharing). Whilst this incentive structure can push some to scientific fraud, it appears to be a very small proportion.

3. Epistemological Issues Related to Replication

Many scientists believe that replication is epistemically valuable in some way, that is to say, that replication serves a useful function in enhancing our knowledge, understanding or beliefs about reality. This section first discusses a problem about the epistemic value of replication studies—called the “experimenters regress”—and it then considers the claim that replication plays an epistemically valuable role in distinguishing scientific inquiry. It lastly examines a recent attempt to formalise the logic of replication in a Bayesian framework.

Collins (1985) articulated a widely discussed problem that is now known as the experimenters’ regress . He initially lays out the problem in the context of measurement (Collins 1985: 84). Suppose a scientist is trying to determine the accuracy of a measurement device and also the accuracy of a measurement result. Perhaps, for example, a scientist is using a thermometer to measure the temperature of a liquid, and it delivers a particular measurement result, say, 12 degrees Celsius.

The problem arises because of the interdependence of the accuracy of the measurement result and the accuracy of the measurement device: to know whether a particular measurement result is accurate, we need to test it against a measurement result that is previously known to be accurate, but to know that the result is accurate, we need to know that it has been obtained via an accurate measuring device, and so on. This, according to Collins, creates a “circle” which he refers to as the “experimenters’ regress”.

Collins extends the problem to scientific replication more generally. Suppose that an experiment B is a replication study of an initial experiment A , and that B ’s result apparently conflicts with A ’s result. This seeming conflict may have one of two interpretations:

  • The results of A and B deliver genuinely conflicting verdicts over the truth of the hypothesis under investigation
  • Experiment B was not in fact a proper replication of experiment A .

The regress poses a problem about how to choose between these interpretations, a problem which threatens the epistemic value of replication studies if there are no rational grounds for choosing in a particular way. Determining whether one experiment is a proper replication of another is complicated by the facts that scientific writing conventions often omit precise details of experimental methodology (Collins 2016), and, furthermore, much of the knowledge that scientists require to execute experiments is tacit and “cannot be fully explicated or absolutely established” (Collins 1985: 73).

In the context of experimental methodology, Collins wrote:

To know an experiment has been well conducted, one needs to know whether it gives rise to the correct outcome. But to know what the correct outcome is, one needs to do a well-conducted experiment. But to know whether the experiment has been well conducted…! (2016: 66; ellipses original)

Collins holds that in such cases where a conflict of results arises, scientists tend to fraction into two groups, each holding opposing interpretations of the results. According to Collins, where such groups are “determined” and the “controversy runs deep” (Collins 2016: 67), the dispute between the groups cannot be resolved via further experimentation, for each additional result is subject to the problem posed by the experimenters’ regress. [ 2 ] In such cases, Collins claims that particular non-epistemic factors will partly determine which interpretation becomes the lasting view:

the career, social, and cognitive interests of the scientists, their reputations and that of their institutions, and the perceived utility for future work. (Franklin & Collins 2016: 99)

Franklin was the most vociferous opponent of Collins, although recent collaboration between the two has fostered some agreement (Collins 2016). Franklin presented a set of strategies for validating experimental results, all of which relate to “rational argument” on epistemic grounds (Franklin 1989: 459; 1994). Examples include, for instance, appealing to experimental checks on measurement devices or eliminating potential sources of error in the experiment (Franklin & Collins 2016). He claimed that the fact that such strategies were evidenced in scientific practice “argues against those who believe that rational arguments plays little, if any, role” in such validation (Franklin 1989: 459), with Collins being an example. He interprets Collins as suggesting that the strategies for resolving debates of the validation of results are social factors or “culturally accepted practices” (Franklin, 1989: 459) which do not provide reasons to underpin rational belief about results. Franklin (1994) further claims that Collins conflates the difficulty in successfully executing experiments with the difficulty of demonstrating that experiments have been executed, with Feest (2016) interpreting him to say that although such execution requires tacit knowledge, one can nevertheless appeal to strategies to demonstrate the validity of experimental findings.

Feest (2016) examines a case study involving debates about the Mozart effect in psychology (which, roughly speaking, is the effect whereby listening to Mozart beneficially affects some aspect of intelligence or brain structure). Like Collins, she agrees that there is a problem in determining whether conflicting results suggest a putative replication experiment is not a proper replication attempt, in part because there is uncertainty about whether scientific concepts such as the Mozart effect have been appropriately operationalised in earlier or later experimental contexts. Unlike Collins (on her interpretation), however, she does not think that this uncertainty arises because scientists have inescapably tacit knowledge of the linguistic rules about the meaning and application of concepts like the Mozart effect. Rather the uncertainty arises because such concepts are still themselves developing and because of assumptions about the world that are required to successfully draw inferences from it. Experimental methodology then serves to reveal the previously tacit assumptions about the application of concepts and the legitimacy of inferences, assumptions which are then susceptible to scrutiny.

For example, in her study of the Mozart effect, she notes that replication studies of the Mozart effect failed to find that Mozart music had a beneficial influence on spatial abilities. Rauscher, who was the first to report results supporting the Mozart effect, suggested that the later studies were not proper replications of her study (Rauscher, Shaw, and Ky 1993, 1995). She clarified that the Mozart effect applied only to a particular category of spatial abilities (spatio-temporal processes) and that the later studies operationalised the Mozart effect in terms of different spatial abilities (spatial recognition). Here, then, there was a difficulty in determining whether to interpret failed replication results as evidence against the initial results or rather as an indication that the replication studies were not proper replications. Feest claims this difficulty arose because of tacit knowledge or assumptions: assumptions about the application of the Mozart effect concept to different kinds of spatial abilities, about whether the world is such that Mozart music has an effect on such abilities and about whether the failure of Mozart to impact other kinds of spatial abilities warrants the inference that the Mozart effect does not exist. Contra Collins, however, experimental methodology enabled the explication and testing of these assumptions, thus allowing scientists to overcome the interpretive impasse.

Against this background, her overall argument is that scientists often are and should be sceptical towards each other’s results. However, this is not because of inescapably tacit knowledge and the inevitable failure of epistemic strategies for validating results. Rather, it is at least in part because of varying tacit assumptions that researchers have about the meaning of concepts, about the world and about what to draw inferences from it. Progressive experimentation serves to reveal these tacit assumptions which can then be scrutinised, leading to the accumulation of knowledge.

There is also other philosophical literature on the experimenters’ regress, including Teira’s (2013) paper arguing that particular experimental debiasing procedures are defensible against the regress from a contractualist perspective, according to which self-interested scientists have reason to adopt good methodological standards.

There is a widespread belief that science is distinct from other knowledge accumulation endeavours, and some have suggested that replication distinguishes (or is at least essential to) science in this respect. (See also the entry on science and pseudo-science .). According to the Open Science Collaboration, “Reproducible research practices are at the heart of sound research and integral to the scientific method.” (OSC 2015: 7). Schmidt echoes this theme: “To confirm results or hypotheses by a repetition procedure is at the basis of any scientific conception” (2009: 90). Braude (1979) goes so far as to say that reproducibility is a “demarcation criterion between science and nonscience” (1979: 2). Similarly, Nosek, Spies, and Motyl state that:

[T]he scientific method differentiates itself from other approaches by publicly disclosing the basis of evidence for a claim…. In principle, open sharing of methodology means that the entire body of scientific knowledge can be reproduced by anyone. (2012: 618)

If replication played such an essential or distinguishing role in science, we might expect it to be a prominent theme in the history of science. Steinle (2016) considers the extent to which it is such a theme. He presents a variety of cases from the history of science where replication played very different roles, although he understands “replication” narrowly to refer to when an experiment is re-run by different researchers . He claims that the role and value of replication in experimental replication is “much more complex than easy textbook accounts make us believe” (2016: 60), particularly since each scientific inquiry is always tied to a variety of contextual considerations that can affect the importance of replication. Such considerations include the relationship between experimental results and the background of accepted theory at the time, the practical and resource constraints on pursuing replication and the perceived credibility of the researchers. These contextual factors, he claims, mean that replication was a key or even overriding determinant of acceptance of research claims in some cases, but not in others.

For example, sometimes replication was sufficient to embrace a research claim, even if it conflicted with the background of accepted theory and left theoretical questions unresolved. A case of this is high-temperature superconductivity, the effect whereby an electric current can pass with zero resistance through a conductor at relatively high temperatures. In 1986, physicists Georg Bednorz and Alex Müller reported finding a material which acted as a superconductor at 35 kelvin (−238 degrees Celsius). Scientists around the world successfully replicated the effect, and Bednorz and Muller were then awarded with a Nobel Prize in Physics a year after their announcement. This case is remarkable since not only did their effect contradict the accepted physical theory at the time, but there is still no extant theory that adequately explains the effects which they reported (Di Bucchianico, 2014).

As a contrasting example, however, sometimes claims were accepted without any replication. In the 1650s, German scientist Otto von Guericke designed and operated the world’s first vacuum pump that would visibly suck air out of a larger space. He performed experiments with his device to various audiences. Yet the replication of his experiments by others would have been very difficult, if not impossible: not only was Guericke’s pump both expensive and complicated to build, but it was also unlikely that his descriptions of it sufficed to enable anyone to build the pump and to consequently replicate his findings. Despite this, Steinle claims that “no doubts were raised about his results”, probably as a results of his “public performances that could be witnessed by a large number of participants” (2016: 55).

Steinle takes such historical cases to provide normative guidance for understanding the epistemic value as replication as context-sensitive: whether replication is necessary or sufficient for establishing a research claim will depend on a variety of considerations, such as those mentioned earlier. He consequently eschews wide-reaching claims, such as those that “it’s all about replicability” or that “replicability does not decide anything” (2016: 60).

Earp and Trafimow (2015) attempt to formalise the way in which replication is epistemically valuable, and they do this using a Bayesian framework to explicate the inferences drawn from replication studies. They present the framework in a context similar to that of Collins (1985), noting that “it is well-nigh impossible to say conclusively what [replication results] mean” (Earp & Trafimow, 2015: 3). But while replication studies are often not conclusive, they do believe that such studies can be informative , and their Bayesian framework depicts how this is so.

The framework is set out with an example. Suppose an aficionado of Researcher A is highly confident that anything said by Researcher A is true. Some other researcher, Researcher B , then attempts to replicate an experiment by Researcher A , and Researcher B find results that conflict with those of Researcher A . Earp and Trafimow claim that the aficionado might continue to be confident in Researcher A ’s findings, but the aficionado’s confidence is likely to slightly decrease. As the number of failed replication attempts increases, the aficionado’s confidence accordingly decreases, eventually falling below 50% and thereby placing more confidence in the replication failures than in the findings initially reported by Researcher A .

Here, then, suppose we are interested in the probability that the original result reported by Researcher A is true given Researcher B ’s first replication failure. Earp and Trafimow represent this probability with the notation \(p(T\mid F)\) where p is a probability function, T represents the proposition that the original result is true and F represents Researcher B ’s replication failure. According to Bayes’s theorem below, this probability is calculable from the aficionado’s degree of confidence that the original result is true prior to learning of the replication failure \(p(T)\), their degree of expectation of the replication failure on the condition that the original result is true \(p(T\mid F)\), and the degree to which they would unconditionally expect a replication failure prior to learning of the replication failure \(p(F)\):

Relatedly, we could instead be interested in the confidence ratio that the original result is true or false given the failure to replicate. This ratio is representable as \(\frac{p(T\mid F)}{p(\nneg T\mid F)}\) where \(\nneg T\) represents the proposition that the original result is false. According to the standard Bayesian probability calculus, this ratio in turn is related to a product of ratios concerning

  • the confidence that the original result is true \(\frac{p(T)}{p(\nneg T)}\) and
  • the expectation of a replication failure on the condition that the result is true or false \(\frac{p(F\mid T)}{p(F\mid \nneg T)}\).

This relation is expressed in the equation:

Now Earp and Trafimow assign some values to the terms on the right-hand of the equation for (2). Supposing that the aficionado is confident in the original results, they set the ratio \(\frac{p(T)}{p(\nneg T)}\) to 50, meaning that the aficionado is initially fifty times more confident that the results are true than that the results are false.

They also set the ratio \(\frac{p(F\mid T)}{p(F\mid \nneg T)}\). about the conditional expectation of a replication failure to 0.5, meaning that the aficionado is considerably less confident that there will be a replication failure if the original result is true than if it is false. They point out that the extent to which the aficionado is less confident depends on the quality of so-called auxiliary assumptions about the replication experiment. Here, auxiliary assumptions are assumptions which enable one to infer that particular things should be observable if the theory under test is true. The intuitive idea is that the higher the quality of the assumptions about a replication study, the more one would expect to observe a successful replication if the original result was true. While they do not specify precisely what makes such auxiliary assumptions have high “quality” in this context, presumably this quality concerns the extent to which the assumptions are probably true and the extent to which the replication experiment is an appropriate test of the veracity of the original results if the assumptions are true.

Once the ratios on the right-hand of equation (2) are set as such, one can see that a replication failure would reduce one’s confidence in the original results:

Here, then, a replication failure would reduce the aficionado’s confidence that the original result was true so that the aficionado would be only 25 times more confident that the result is true given a failure (as per \(\frac{p(T\mid F)}{p(\nneg T\mid F)}\)) rather than 50 times more confident that it is true (as per \(\frac{p(T)}{p(\nneg T)}\)).

Nevertheless, the aficionado may still be confident that the original result is true, but we can see how such confidence would decrease with successive replication failures. More formally, let \(F_N\) be the last replication failure in a sequence of N replication failures \(\langle F_1,F_2,\ldots,F_N\rangle\). Then, the aficionado’s confidence in the original result given the N th replication failure is expressible in the equation: [ 3 ]

For example, suppose there are 10 replication failures, and so \(N=10\). Suppose further that the confidence ratios for the replication failures are set such that:

Here, then, the aficionado’s confidence in the original result decreases so that they are more confident that it was false than that it was true. Hence, on Earp and Trafimow’s Bayesian account, successive replication failures can progressively erode one’s confidence that an original result is true, even if one was initially highly confident in the original result and even if no single replication failure by itself was conclusive. [ 4 ]

Some putative merits of Earp and Trafimow’s account, then, are that it provides a formalisation whereby replication attempts are informative even if they are not conclusive, and furthermore, the formalisation provides a role for both quantity of replication attempts as well as auxiliary assumptions about the replications.

4. Open Science Reforms: Values, Tone, and Scientific Norms

The aforementioned meta-science has unearthed a range of problems which give rise to the reproducibility crisis, and the open science movement has proposed or promoted various solutions—or reforms—for these problems. These reforms can be grouped into four categories: (a) methods and training, (b) reporting and dissemination, (c) peer review processes, and (d) evaluating new incentive structures (loosely following the categories used by Munafò et al. 2017 and Ioannidis et al. 2015). In subsections 4.1–4.4 below, we present a non-exhaustive list of initiatives in each of the above categories. These initiatives are reflections of various values and norms that are at the heart of the open science movement, and we discuss these values and norms in 4.5.

  • Combating bias. The development of methods for combating bias, for example, masked or blind analysis techniques to combat confirmation bias (e.g., MacCoun & Perlmutter 2017).
  • Support. Providing methodological support for researchers, including published guidelines and statistical consultancy (for example, as offered by the Center for Open Science) and large online courses such as that developed by Daniel Lakens (see Other Internet Resources ).
  • Collaboration. Promoting collaboration and team/crowd sourced science to combat low power and other methodological limitations of single studies. The Reproducibility Projects themselves are an example of this, but there are other initiatives too such StudySwap in psychology and the Collective Replication and Education Project (CREP, see Other Internet Resources for both of these , see also Munafò et al. for a more detailed description) which aims to increase the prevalence of replications through undergraduate education.
  • The TOP Guidelines. The Transparency and Openness Promotion (TOP) guidelines (Nosek et al. 2015) have, as at the end of May, 2018, almost 5,000 journals and organizations as signatories. Developed within psychology, TOP guidelines have formed the basis of other disciplinary specific guidelines, such as the Tools for Transparency in Ecology and Evolution (TTEE). As the name suggests, these guidelines promote more complete and transparent reporting of methodological and statistical practices. This in turn enables authors, reviewers and editors to consider detailed aspects of their sample size planning and design decisions, and to clearly distinguish between confirmatory (planned) analysis and exploratory (post hoc) analysis.
  • Pre-registration . In its simplest form, pre-registration involves making a public, date-stamped statement of predictions and/or hypotheses, either before data is collected, viewed or analysed. The purpose is to distinguish prediction from postdiction (Nosek et al. 2018), or what is elsewhere referred to as confirmatory research from exploratory research (Wagenmakers et al. 2012) and a distinction perhaps more commonly known as hypothesis testing versus hypothesis generating research. Pre-registration of predictive research helps control for HARKing (Kerr 1998) and hindsight bias, and within the frequentist Null Hypothesis Significance Testing, helps contain the false positive error rate to the set alpha level. There are several platforms that host pre-registrations, such as the Open Science Framework (osf.io) and As Predicted (aspredicted.org). The Open Science Framework also hosts a “pre-registration challenge” offering monetary rewards for publishing pre-registered work.
  • Specific Journal Initiatives . Some high impact journals, having been singled out in the science media as having particularly problematic publishing practices (e.g., Schekman 2013), have taken exceptional steps to improve the completeness, transparency and reproducibility of the research they publish. For example, since 2013, Nature and Nature research journals have engaged in a range of editorial activities aimed at improving reproducibility of research published in their journals (see the editorial announcement , Nature 496, 398, 25 April 2013, doi:10.1038/496398a). In 2017, they introduced checklists and reporting summaries (published alongside articles) in an effort to improve transparency and reproducibility. In 2018, they produced discipline specific versions for Nature Human Behaviour and Nature Ecology & Evolution . Within psychology, the journal Psychological Science (flagship journal of the Association of Psychological Science) was the first to adopt open science practices, such the COS Open Science badges described below. Following a meeting of ecology and evolution journal editors in 2015, a number of journals in these fields have run editorials on this topic, often committing to TTEE guidelines (discussed above). Conservation Biology has in addition adopted a checklist for associate editors (Parker et al. 2016).
  • Registered reports . Registered reports shift the point at which peer review occurs in the research process, in an effort to combat publication bias against null (negative) results. Manuscripts are submitted, reviewed and a publication decision made on the basis of the introduction, methods and planned analysis alone. If accepted, authors then have a defined period of time to carry out the planned research and submit the results. Assuming authors followed their original plans (or adequately justified deviations from them), the journal will honour its decision to publish, regardless of the result outcomes. In psychology, the Registered Report format has been championed by Chris Chambers, with the journal Cortex being the first to adopt the format under Chambers’ editorship (Chambers 2013, 2017; Nosek & Lakens 2014). Currently (end of May 2018), 108 journals in a range of biomedical, psychology and neuroscience fields, offer the format (see Registered Reports in Other Internet Resources ).
  • Pre-prints. Well-established in some sciences like physics, the use of pre-print servers is relatively new in biological and social sciences.
  • Open Science badges. A recent review of initiatives for improving data sharing identified the awarding of open data and open materials badges as the most effective scheme (Rowhani-Farid, Allen, & Barnett 2017). One such badge scheme is coordinated by the Center for Open Science who currently award three badges: Open Data, Open Materials and Pre-Registration. Badges are attached to articles that follow a specific set of criteria to engage in these activities. Kidwell et al. (2016) evaluated the effectiveness of badges in the journal Psychological Science and found substantial increases (from 3 to 39%) in data sharing over a less than two-year period. Such increases were not found in similar journals without badge schemes over the same period.

There has long been philosophical debate about what role values do and should play in science (Churchman 1948; Rudner 1953; Douglas 2016), and the reproducibility crisis is intimately connected to questions about the operations of, and interconnections between, such values. In particular, Nosek et al. (2017) argue that there is a tension between truth and publishability. More specifically, for reasons discussed in section 2 above, the accuracy of scientific results are compromised by the value which journals place on novel and positive results and, consequently, by scientists who value career success to seek to exclusively publish such results in these journals. Many others in addition to Nosek et al. (Hackett 2005; Martin 1992; Sovacool 2008) have taken also take issue with the value which journals and funding bodies have placed on novelty.

Some might interpret the tension as a manifestation of how epistemic values (such as truth and replicability) can be compromised by (arguably) non-epistemic values, such the value of novel, interesting or surprising results. Epistemic values are typically taken to be values that, in the words of Steel “promote the acquisition of true beliefs” (2010: 18; see also Goldman 1999). Canonical examples of epistemic values include the predictive accuracy and internal consistency of a theory. Epistemic values are often contrasted with putative non-epistemic or non-cognitive values, which include ethical or social values like, for example, the novelty of a theory or its ability to improve well-being by lessening power inequalities (Longino 1996). Of course, there is no complete consensus as to precisely what counts as an epistemic or non-epistemic value (Rooney 1992; Longino 1996). Longino, for example, claims that, other things being equal, novelty counts in favour of accepting a theory, and convincingly argues that, in some contexts, it can serve as a “protection against unconscious perpetuation of the sexism and androcentrism” in traditional science (1997: 22). However, she does not discuss novelty specifically in the context of the reproducibility crisis.

Giner-Sorolla (2012), however, does discuss novelty in the context of the crisis, and he offers another perspective on its value. He claims that one reason novelty has been used to define what is publishable or fundable is that it is relatively easy for researchers to establish and for reviewers and editors to detect. Yet, Giner-Sorolla argues, novelty for its own sake perhaps should not be valued, and should in fact be recognized as merely an operationalisation of a deeper concept, such as “ability to advance the field” (567). Giner-Sorolla goes on to point out how such shallow operationalisations of important concepts often lead to problems, for example, using statistical significance to measure the importance of results, or measuring the quality of research by how well outcomes fit with experimenters’ prior expectations.

Values are closely connected to discussions about norms in the open science movement. Vazire (2018) and others invoke norms of science— communality, universalism, disinterestedness and organised skepticism —in setting the goals for open science, norms originally articulated by Robert Merton (1942). Each such norm arguably reflects a value which Merton advocated, and each norm may be opposed by a counternorm which denotes behaviour that is in conflict with the norm. For example, the norm of communality (which Merton called “communism”) reflects the value of collaboration and the common ownership of scientific goods since the norm recommends such collaboration and common ownership. Advocates of open science see such norms, and the values which they reflect, as an aim for open science. For example, the norm of communality is reflected in sharing and making data open, and in open access publishing. In contrast, the counternorm of secrecy is associated with a closed, for profit publishing system (Anderson et al. 2010). Likewise, assessing scientific work on its merits upholds the norm of universalism—that the evaluation of research claims should not depend on the socio-demographic characteristics of the proponents of such claims. In contrast, assessing work by the age, the status, the institution or the metrics of the journal it is published in reflects a counternorm of particularism.

Vazire (2018) and others have argued that, at the moment, scientific practice is dominated by counternorms and that a move to Mertonian norms is a goal of the open science reform movement. In particular, self-interestedness, as opposed to the norm of disinterestedness, motivates p -hacking and other Questionable Research Practices. Similarly, a desire to protect one’s professional reputation motivates resistance to having one’s work replicated by others (Vazire 2018). This in turn reinforces a counternorm of organized dogmatism rather than organized skepticism which, according to Merton, involves the “temporary suspension of judgment and the detached scrutiny of beliefs” (Merton, 1973).

Anderson et al.’s (2010) focus groups and surveys of scientists suggest that scientists do want to adhere to Merton’s norms but that the current incentive structure of science makes this difficult. Changing the structure of penalty and reward systems within science to promote communality, universalism, disinterestedness and organized skepticism instead of their counternorms is an ongoing challenge for the open science reform movement. As Pashler and Wagenmakers (2012) have said:

replicability problems will not be so easily overcome, as they reflect deep-seated human biases and well-entrenched incentives that shape the behavior of individuals and institutions. (2012: 529)

The effort to promote such values and norms has generated heated controversy. Some early responses to the Reproducibility Project: Psychology and Many Labs projects were highly critical, not just of the substance of the nature and process of the work. Calls for openness were interpreted as reflecting mistrust, and attempts to replicate others’ work as personal attacks (e.g., Schnail 2014 in Other Internet Resources ). Nosek, Spies, & Motyl (2012) argue that calls for openness should not be interepreted as mistrust:

Opening our research process will make us feel accountable to do our best to get it right; and, if we do not get it right, to increase the opportunities for others to detect the problems and correct them. Openness is not needed because we are untrustworthy; it is needed because we are human. (2012: 626)

Exchanges related to this have become known as the tone debate . [ ]

The subject of reproducibility is associated with a turbulent period in contemporary science. This period has called for a re-evaluation of the values, incentives, practices and structures which underpin scientific inquiry. While the meta-science has painted a bleak picture of reproducibility in some fields, it has also inspired a parallel movement to strengthen the foundations of science. However, more progress is to be made, especially in understanding the solutions to the reproducibility crisis. In this regard, there are fruitful avenues for future research, including a deeper exploration of the role that epistemic and non-epistemic values can or should play in scientific inquiry.

  • Agnoli, Franca, Jelte M. Wicherts, Coosje L. S. Veldkamp, Paolo Albiero, and Roberto Cubelli, 2017, “Questionable Research Practices among Italian Research Psychologists”, Jakob Pietschnig (ed.), PLoS ONE , 12(3): e0172792. doi:10.1371/journal.pone.0172792
  • Allen, Peter J., Kate P. Dorozenko, and Lynne D. Roberts, 2016, “Difficult Decisions: A Qualitative Exploration of the Statistical Decision Making Process from the Perspectives of Psychology Students and Academics”, Frontiers in Psychology , 7(February): 188. doi:10.3389/fpsyg.2016.00188
  • Anderson, Christopher J., Štěpán Bahnik, Michael Barnett-Cowan, Frank A. Bosco, Jesse Chandler, C. R. Chartier, F. Cheung, et al., 2016, “Response to Comment on ‘Estimating the Reproducibility of Psychological Science’”, Science , 351(6277): 1037. doi:10.1126/science.aad9163
  • Anderson, Melissa S., Emily A. Ronning, Raymond De Vries, and Brian C. Martinson, 2010, “Extending the Mertonian Norms: Scientists’ Subscription to Norms of Research”, The Journal of Higher Education , 81(3): 366–393. doi:10.1353/jhe.0.0095
  • Atmanspacher, Harald and Sabine Maasen, 2016a, “Introduction”, in Atmanspacher and Maasen 2016b: 1–8. doi:10.1002/9781118865064.ch0
  • ––– (eds.), 2016b, Reproducibility: Principles, Problems, Practices, and Prospects , Hoboken, NJ: John Wiley & Sons. doi:10.1002/9781118865064
  • Baker, Monya, 2016, “1,500 Scientists Lift the Lid on Reproducibility”, Nature , 533(7604): 452–454. doi:10.1038/533452a
  • Bakker, Marjan, Chris H. J. Hartgerink, Jelte M. Wicherts, and Han L. J. van der Maas, 2016, “Researchers’ Intuitions About Power in Psychological Research”, Psychological Science , 27(8): 1069–1077. doi:10.1177/0956797616647519
  • Bakker, Marjan and Jelte M. Wicherts, 2011, “The (Mis)Reporting of Statistical Results in Psychology Journals”, Behavior Research Methods , 43(3): 666–678. doi:10.3758/s13428-011-0089-5
  • Begley, C. Glenn and Lee M. Ellis, 2012, “Raise Standards for Preclinical Cancer Research: Drug Development”, Nature , 483(7391): 531–533. doi:10.1038/483531a
  • Bem, Daryl J., 2011, “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect”, Journal of Personality and Social Psychology , 100(3): 407–425.
  • Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, Eric-Jan Wagenmakers, Richard Berk, Kenneth A. Bollen, et al., 2018, “Redefine Statistical Significance”, Nature Human Behaviour , 2(1): 6–10. doi:10.1038/s41562-017-0189-z
  • Braude, Stephen E., 1979, ESP and Psychokinesis. A Philosophical Examination , Philadelphia: Temple University Press.
  • Button, Katherine S., John P. A. Ioannidis, Claire Mokrysz, Brian A. Nosek, Jonathan Flint, Emma S. J. Robinson, and Marcus R. Munafò, 2013, “Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience”, Nature Reviews Neuroscience , 14(5): 365–376. doi:10.1038/nrn3475
  • Camerer C.F., et al., 2018, “Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015”, Nature Human Behaviour , 2: 637–644. doi: 10.1038/s41562-018-0399-z
  • Cartwright, Nancy, 1991, “Replicability, Reproducibility and Robustness: Comments on Harry Collins”, History of Political Economy , 23(1): 143–155.
  • Chambers, Christopher D., 2013, “Registered Reports: A New Publishing Initiative at Cortex”, Cortex , 49(3): 609–610. doi:10.1016/j.cortex.2012.12.016
  • –––, 2017, The Seven Deadly Sins of Psychology A Manifesto for Reforming the Culture of Scientific Practice , Princeton: Princeton University Press.
  • Chang, Andrew C. and Phillip Li, 2015, “Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ‘Usually Not’”, Finance and Economics Discussion Series , 2015(83): 1–26. doi:10.17016/FEDS.2015.083
  • Churchman, C. West, 1948, “Statistics, Pragmatics, Induction”, Philosophy of Science , 15(3): 249–268. doi:10.1086/286991
  • Collins, Harry M., 1985, Changing Order: Replication and Induction in Scientific Practice , London; Beverly Hills: Sage Publications.
  • –––, 2016, “Reproducibility of experiments: experiments’ regress, statistical uncertainty principle, and the replication imperative” in Atmanspacher and Maasen 2016b: 65–82. doi:10.1002/9781118865064.ch4
  • Cohen, Jacob, 1962, “The Statistical Power of Abnormal-Social Psychological Research: A Review”,, The Journal of Abnormal and Social Psychology , 65(3): 145–153. doi:10.1037/h0045186
  • –––, 1994, “The Earth Is Round (\(p < .05\))”, American Psychologist , 49(12): 997–1003, doi:10.1037/0003-066X.49.12.997
  • Cova, Florian, Brent Strickland, Angela Abatista, Aurélien Allard, James Andow, Mario Attie, James Beebe, et al., forthcoming, “Estimating the Reproducibility of Experimental Philosophy”, Review of Philosophy and Psychology , early online: 14 June 2018. doi:10.1007/s13164-018-0400-9
  • Cristea, Ioana Alina and John P. A. Ioannidis, 2018, “P Values in Display Items Are Ubiquitous and Almost Invariably Significant: A Survey of Top Science Journals”, Christos A. Ouzounis (ed.), PLoS ONE , 13(5): e0197440. doi:10.1371/journal.pone.0197440
  • Cumming, Geoff, 2012, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis . New York: Routledge.
  • Cumming, Geoff and Robert Calin-Jageman, 2017, Introduction to the New Statistics: Estimation, Open Science and Beyond , New York: Routledge.
  • Cumming, Geoff, Fiona Fidler, Martine Leonard, Pavel Kalinowski, Ashton Christiansen, Anita Kleinig, Jessica Lo, Natalie McMenamin, and Sarah Wilson, 2007, “Statistical Reform in Psychology: Is Anything Changing?”, Psychological Science , 18(3): 230–232. doi:10.1111/j.1467-9280.2007.01881.x
  • Di Bucchianico, Marilena, 2014, “A Matter of Phronesis: Experiment and Virtue in Physics, A Case Study”, in Virtue Epistemology Naturalized , Abrol Fairweather (ed.), Cham: Springer International Publishing, 291–312. doi:10.1007/978-3-319-04672-3_17
  • Dominus, Susan, 2017, “When the Revolution Came for Amy Cuddy”, The New York Times , October 21, Sunday Magazine, page 29.
  • Douglas, Heather, 2016, “Values in Science”, in Paul Humphreys, The Oxford Handbook of Philosophy of Science , New York: Oxford University Press, pp. 609–630.
  • Earp, Brian D. and David Trafimow, 2015, “Replication, Falsification, and the Crisis of Confidence in Social Psychology”, Frontiers in Psychology , 6(May): 621. doi:10.3389/fpsyg.2015.00621
  • Errington, Timothy M., Elizabeth Iorns, William Gunn, Fraser Elisabeth Tan, Joelle Lomax, and Brian A Nosek, 2014, “An Open Investigation of the Reproducibility of Cancer Biology Research”, ELife , 3(December): e043333. doi:10.7554/eLife.04333
  • Etz, Alexander and Joachim Vandekerckhove, 2016, “A Bayesian Perspective on the Reproducibility Project: Psychology”, Daniele Marinazzo (ed.), PLoS ONE , 11(2): e0149794. doi:10.1371/journal.pone.0149794
  • Fanelli, Daniele, 2010a, “Do Pressures to Publish Increase Scientists’ Bias? An Empirical Support from US States Data”, Enrico Scalas (ed.), PLoS ONE , 5(4): e10271. doi:10.1371/journal.pone.0010271
  • –––, 2010b, “‘Positive’ Results Increase Down the Hierarchy of the Sciences”, Enrico Scalas (ed.), PLoS ONE , 5(4): e10068. doi:10.1371/journal.pone.0010068
  • –––, 2012, “Negative Results Are Disappearing from Most Disciplines and Countries”, Scientometrics , 90(3): 891–904. doi:10.1007/s11192-011-0494-7
  • Fang, Ferric C., R. Grant Steen, and Arturo Casadevall, 2012, “Misconduct Accounts for the Majority of Retracted Scientific Publications”, Proceedings of the National Academy of Sciences , 109(42): 17028–17033. doi:10.1073/pnas.1212247109
  • Feest, Uljana, 2016, “The Experimenters’ Regress Reconsidered: Replication, Tacit Knowledge, and the Dynamics of Knowledge Generation”, Studies in History and Philosophy of Science Part A , 58(August): 34–45. doi:10.1016/j.shpsa.2016.04.003
  • Fidler, Fiona, Mark A. Burgman, Geoff Cumming, Robert Buttrose, and Neil Thomason, 2006, “Impact of Criticism of Null-Hypothesis Significance Testing on Statistical Reporting Practices in Conservation Biology”, Conservation Biology , 20(5): 1539–1544. doi:10.1111/j.1523-1739.2006.00525.x
  • Fidler, Fiona, Yung En Chee, Bonnie C. Wintle, Mark A. Burgman, Michael A. McCarthy, and Ascelin Gordon, 2017, “Metaresearch for Evaluating Reproducibility in Ecology and Evolution”, BioScience , 67(3): 282–289. doi:10.1093/biosci/biw159
  • Fiedler, Klaus and Norbert Schwarz, 2016, “Questionable Research Practices Revisited”, Social Psychological and Personality Science , 7(1): 45–52. doi:10.1177/1948550615612150
  • Fiske, Susan T., 2016, “A Call to Change Science’s Culture of Shaming”, Association for Psychological Science Observer , 29(9). [ Fiske 2016 available online ]
  • Franklin, Allan, 1989, “The Epistemology of Experiment”, in David Gooding, Trevor Pinch, and Simon Schaffer (eds.), The Uses of Experiment: Studies in the Natural Sciences , Cambridge: Cambridge University Press, pp. 437–460.
  • –––, 1994, “How to Avoid the Experimenters’ Regress”, Studies in History and Philosophy of Science Part A , 25(3): 463–491. doi:10.1016/0039-3681(94)90062-0
  • Franklin, Allan and Harry Collins, 2016, “Two Kinds of Case Study and a New Agreement”, in The Philosophy of Historical Case Studies , Tilman Sauer and Raphael Scholl (eds.), Cham: Springer International Publishing, 319: 95–121. doi:10.1007/978-3-319-30229-4_6
  • Fraser, Hannah, Tim Parker, Shinichi Nakagawa, Ashley Barnett, and Fiona Fidler, 2018, “Questionable Research Practices in Ecology and Evolution”, Jelte M. Wicherts (ed.), PLoS ONE , 13(7): e0200303. doi:10.1371/journal.pone.0200303
  • Freedman, Leonard P., Iain M. Cockburn, and Timothy S. Simcoe, 2015, “The Economics of Reproducibility in Preclinical Research”, PLoS Biology , 13(6): e1002165. doi:10.1371/journal.pbio.1002165
  • Giner-Sorolla, Roger, 2012, “Science or Art? How Aesthetic Standards Grease the Way Through the Publication Bottleneck but Undermine Science”, Perspectives on Psychological Science , 7(6): 562–571. doi:10.1177/1745691612457576
  • Gigerenzer, Gerd, 2018, “Statistical Rituals: The Replication Delusion and How We Got There”, Advances in Methods and Practices in Psychological Science , 1(2): 198–218. doi:10.1177/2515245918771329
  • Gilbert, Daniel T., Gary King, Stephen Pettigrew, and Timothy D. Wilson, 2016, “Comment on ‘Estimating the Reproducibility of Psychological Science’”, Science , 351(6277): 1037–1037. doi:10.1126/science.aad7243
  • Goldman, Alvin I., 1999, Knowledge in a Social World , Oxford: Clarendon. doi:10.1093/0198238207.001.0001
  • Gómez, Omar S., Natalia Juristo, and Sira Vegas, 2010, “Replications Types in Experimental Disciplines”, in Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement - ESEM ’10 , Bolzano-Bozen, Italy: ACM Press. doi:10.1145/1852786.1852790
  • Hackett, B., 2005, “Essential tensions: Identity, control, and risk in research”, Social Studies of Science , 35(5): 787–826. doi:10.1177/0306312705056045
  • Haller, Heiko, and Stefan Krauss, 2002, “Misinterpretations of Significance: a Problem Students Share with Their Teachers?” Methods of Psychological Research—Online , 7(1): 1–20. [ Haller & Kraus 2002 available online ]
  • Hartgerink, Chris H.J., Robbie C.M. van Aert, Michèle B. Nuijten, Jelte M. Wicherts, and Marcel A.L.M. van Assen, 2016, “Distributions of p -Values Smaller than .05 in Psychology: What Is Going On?”, PeerJ , 4(April): e1935. doi:10.7717/peerj.1935
  • Hendrick, Clyde, 1991. “Replication, Strict Replications, and Conceptual Replications: Are They Important?”, in Neuliep 1991: 41–49.
  • Ioannidis, John P. A., 2005, “Why Most Published Research Findings Are False”, PLoS Medicine , 2(8): e124. doi:10.1371/journal.pmed.0020124
  • Ioannidis, John P. A., Daniele Fanelli, Debbie Drake Dunne, and Steven N. Goodman, 2015, “Meta-Research: Evaluation and Improvement of Research Methods and Practices”, PLOS Biology , 13(10): e1002264. doi:10.1371/journal.pbio.1002264
  • Jennions, Michael D. and Anders Pape Møller, 2003, “A Survey of the Statistical Power of Research in Behavioral Ecology and Animal Behavior”, Behavioral Ecology , 14(3): 438–445. doi:10.1093/beheco/14.3.438
  • John, Leslie K., George Loewenstein, and Drazen Prelec, 2012, “Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling”, Psychological Science , 23(5): 524–532. doi:10.1177/0956797611430953
  • Kaiser, Jocelyn, 2018, “Plan to Replicate 50 High-Impact Cancer Papers Shrinks to Just 18”, Science , 31 July 2018. doi:10.1126/science.aau9619
  • Keppel, Geoffrey, 1982, Design and Analysis. A Researcher’s Handbook , second edition, Englewood Cliffs, NJ: Prentice-Hall.
  • Kerr, Norbert L., 1998, “HARKing: Hypothesizing After the Results Are Known”, Personality and Social Psychology Review , 2(3): 196–217. doi:10.1207/s15327957pspr0203_4
  • Kidwell, Mallory C., Ljiljana B. Lazarević, Erica Baranski, Tom E. Hardwicke, Sarah Piechowski, Lina-Sophia Falkenberg, Curtis Kennett, et al., 2016, “Badges to Acknowledge Open Practices: A Simple, Low-Cost, Effective Method for Increasing Transparency”, Malcolm R Macleod (ed.), PLOS Biology , 14(5): e1002456. doi:10.1371/journal.pbio.1002456
  • Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams, Štěpán Bahník, Michael J. Bernstein, Konrad Bocian, et al., 2014, “Investigating Variation in Replicability: A ‘Many Labs’ Replication Project”, Social Psychology , 45(3): 142–152. doi:10.1027/1864-9335/a000178
  • Lakens, Daniel, Federico G. Adolfi, Casper J. Albers, Farid Anvari, Matthew A. J. Apps, Shlomo E. Argamon, Thom Baguley, et al., 2018, “Justify Your Alpha”, Nature Human Behaviour , 2(3): 168–171. doi:10.1038/s41562-018-0311-x
  • Longino, Helen E., 1990, Science as Social Knowledge: Values and Objectivity in Scientific Inquiry , Princeton: Princeton University Press.
  • –––, 1996, “Cognitive and Non-Cognitive Values in Science: Rethinking the Dichotomy”, in Feminism, Science, and the Philosophy of Science , Lynn Hankinson Nelson and Jack Nelson (eds.), Dordrecht: Springer Netherlands, 39–58. doi:10.1007/978-94-009-1742-2_3
  • –––, 1997, “Feminist Epistemology as a Local Epistemology: Helen E. Longino”, Aristotelian Society Supplementary Volume , 71(1): 19–35. doi:10.1111/1467-8349.00017
  • Lykken, David T., 1968, “Statistical Significance in Psychological Research”, Psychological Bulletin , 70(3, Pt.1): 151–159. doi:10.1037/h0026141
  • Madden, Charles S., Richard W. Easley, and Mark G. Dunn, 1995, “How Journal Editors View Replication Research”, Journal of Advertising , 24(December): 77–87. doi:10.1080/00913367.1995.10673490
  • Makel, Matthew C., Jonathan A. Plucker, and Boyd Hegarty, 2012, “Replications in Psychology Research: How Often Do They Really Occur?”, Perspectives on Psychological Science , 7(6): 537–542. doi:10.1177/1745691612460688
  • MacCoun, Robert J. and Saul Perlmutter, 2017, “Blind Analysis as a Correction for Confirmatory Bias in Physics and in Psychology”, in Psychological Science Under Scrutiny , Scott O. Lilienfeld and Irwin D. Waldman (eds.), Hoboken, NJ: John Wiley & Sons, pp. 295–322. doi:10.1002/9781119095910.ch15
  • Martin, B., 1992, “Scientific fraud and the power structure of science”, Prometheus , 10(1): 83–98. doi:10.1080/08109029208629515
  • Masicampo, E.J. and Daniel R. Lalande, 2012, “A Peculiar Prevalence of p Values Just below .05”, Quarterly Journal of Experimental Psychology , 65(11): 2271–2279. doi:10.1080/17470218.2012.711335
  • Mahoney, Michael J., 1985, “Open Exchange and Epistemic Progress”,, American Psychologist , 40(1): 29–39. doi:10.1037/0003-066X.40.1.29
  • Meehl, Paul E., 1967, “Theory-Testing in Psychology and Physics: A Methodological Paradox”, Philosophy of Science , 34(2): 103–115. doi:10.1086/288135
  • –––, 1978, “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology”, Journal of Consulting and Clinical Psychology , 46(4): 806–834. doi:10.1037/0022-006X.46.4.806
  • Merton, Robert K., 1942 [1973], “A Note on Science and Technology in a Democratic Order”, Journal of Legal and Political Sociology , 1(1–2): 115–126; reprinted as “The Normative Structure of Science”, in Robert K. Merton (ed.) The Sociology of Science: Theoretical and Empirical Investigations , Chicago, IL: University of Chicago Press.
  • Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis, 2017, “A Manifesto for Reproducible Science”, Nature Human Behaviour , 1(1): 0021. doi:10.1038/s41562-016-0021
  • Neuliep, James William (ed.), 1991, Replication Research in the Social Sciences , (Journal of social behavior and personality; 8: 6), Newbury Park, CA: Sage Publications.
  • Neuliep, James W. and Rick Crandall, 1990, “Editorial Bias Against Replication Research”, Journal of Social Behavior and Personality , 5(4): 85–90
  • Nosek, Brian A. and Daniël Lakens, 2014, “Registered Reports: A Method to Increase the Credibility of Published Results”, Social Psychology , 45(3): 137–141. doi:10.1027/1864-9335/a000192
  • Nosek, Brian A., Jeffrey R. Spies, and Matt Motyl, 2012, “Scientific Utopia: II. Restructuring Incentives and Practices to Promote Truth Over Publishability”, Perspectives on Psychological Science , 7(6): 615–631. doi:10.1177/1745691612459058
  • Nosek, B. A., G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J. Breckler, S. Buck, et al., 2015, “Promoting an Open Research Culture”, Science , 348(6242): 1422–1425. doi:10.1126/science.aab2374,
  • Nosek, Brian A., Charles R. Ebersole, Alexander C. DeHaven, and David T. Mellor, 2018, “The Preregistration Revolution”, Proceedings of the National Academy of Sciences , 115(11): 2600–2606. doi:10.1073/pnas.1708274114
  • Nuijten, Michèle B., Chris H. J. Hartgerink, Marcel A. L. M. van Assen, Sacha Epskamp, and Jelte M. Wicherts, 2016, “The Prevalence of Statistical Reporting Errors in Psychology (1985–2013)”, Behavior Research Methods , 48(4): 1205–1226. doi:10.3758/s13428-015-0664-2
  • Oakes, Michael, 1986, Statistical Inference: A Commentary for the Social and Behavioral Sciences , New York: Wiley.
  • Open Science Collaboration (OSC), 2015, “Estimating the Reproducibility of Psychological Science”, Science , 349(6251): 943–951. doi:10.1126/science.aac4716
  • Oransky, Ivan, 2016, “Half of Biomedical Studies Don’t Stand up to Scrutiny and What We Need to Do about That”, The Conversation , 11 November 2016. [ Oransky 2016 available online ]
  • Parker, T.H., E. Main, S. Nakagawa, J. Gurevitch, F. Jarrad, and M. Burgman, 2016, “Promoting Transparency in Conservation Science: Editorial”, Conservation Biology , 30(6): 1149–1150. doi:10.1111/cobi.12760
  • Pashler, Harold and Eric-Jan Wagenmakers, 2012, “Editors’ Introduction to the Special Section on Replicability in Psychological Science: A Crisis of Confidence?”, Perspectives on Psychological Science , 7(6): 528–530. doi:10.1177/1745691612465253
  • Peng, Roger D., 2011, “Reproducible Research in Computational Science”, Science , 334(6060): 1226–1227. doi:10.1126/science.1213847
  • –––, 2015, “The Reproducibility Crisis in Science: A Statistical Counterattack”, Significance , 12(3): 30–32. doi:10.1111/j.1740-9713.2015.00827.x
  • Radder, Hans, 1996, In And About The World: Philosophical Studies Of Science And Technology , Albany, NY: State University of New York Press.
  • –––, 2003, “Technology and Theory in Experimental Science”, in Hans Radder (ed.), The Philosophy of Scientific Experimentation , Pittsburgh: University of Pittsburgh Press, pp. 152–173.
  • –––, 2006, The World Observed/The World Conceived , Pittsburgh, PA: University of Pittsburgh Press.
  • –––, 2009, “Science, Technology and the Science-Technology Relationship”, in Anthonie Meijers (ed.), Philosophy of Technology and Engineering Sciences , Amsterdam: Elsevier, pp. 65–91. doi:10.1016/B978-0-444-51667-1.50007-0
  • –––, 2012, The Material Realization of Science: From Habermas to Experimentation and Referential Realism , Boston: Springer. doi:10.1007/978-94-007-4107-2
  • Rauscher, Frances H., Gordon L. Shaw, and Catherine N. Ky, 1993, “Music and Spatial Task Performance”, Nature , 365(6447): 611–611. doi:10.1038/365611a0
  • Rauscher, Frances H., Gordon L. Shaw, and Katherine N. Ky, 1995, “Listening to Mozart Enhances Spatial-Temporal Reasoning: Towards a Neurophysiological Basis”, Neuroscience Letters , 185(1): 44–47. doi:10.1016/0304-3940(94)11221-4
  • Ritchie, Stuart J., Richard Wiseman, and Christopher C. French, 2012, “Failing the Future: Three Unsuccessful Attempts to Replicate Bem’s ‘Retroactive Facilitation of Recall’ Effect”, Sam Gilbert (ed.), PLoS ONE , 7(3): e33423. doi:10.1371/journal.pone.0033423
  • Rooney, Phyllis, 1992, “On Values in Science: Is the Epistemic/Non-Epistemic Distinction Useful?”, PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association , 1992(1): 13–22. doi:10.1086/psaprocbienmeetp.1992.1.192740
  • Rosenthal, Robert, 1979, “The File Drawer Problem and Tolerance for Null Results”, Psychological Bulletin , 86(3): 638–641. doi:10.1037/0033-2909.86.3.638
  • –––, 1991, “Replication in Behavioral Research”, in Neuliep 1991: 1–39.
  • Rosnow, Ralph L. and Robert Rosenthal, 1989, “Statistical Procedures and the Justification of Knowledge in Psychological Science”,, American Psychologist , 44(10): 1276–1284. doi:10.1037/0003-066X.44.10.1276
  • Rowhani-Farid, Anisa, Michelle Allen, and Adrian G. Barnett, 2017, “What Incentives Increase Data Sharing in Health and Medical Research? A Systematic Review”, Research Integrity and Peer Review , 2: 4. doi:10.1186/s41073-017-0028-9
  • Rudner, Richard, 1953, “The Scientist Qua Scientist Makes Value Judgments”, Philosophy of Science , 20(1): 1–6. doi:10.1086/287231
  • Sargent, C.L., 1981, “The Repeatability Of Significance And The Significance Of Repeatability”, European Journal of Parapsychology , 3: 423–433.
  • Schekman, Randy, 2013, “How Journals like Nature, Cell and Science Are Damaging Science | Randy Schekman”, The Guardian , December 9, sec. Opinion, [ Schekman 2013 available online ]
  • Schmidt, Stefan, 2009, “Shall We Really Do It Again? The Powerful Concept of Replication Is Neglected in the Social Sciences”, Review of General Psychology , 13(2): 90–100. doi:10.1037/a0015108
  • Silberzahn, Raphael,and Uhlmann, Eric L., 2015, “Many hands make tight work: crowdsourcing research can balance discussions, validate findings and better inform policy”, Nature , 526(7572): 189–192.
  • Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn, 2011, “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”, Psychological Science , 22(11): 1359–1366. doi:10.1177/0956797611417632
  • Smith, Daniel R., Ian C.W. Hardy, and Martin P. Gammell, 2011, “Power Rangers: No Improvement in the Statistical Power of Analyses Published in Animal Behaviour”, Animal Behaviour , 81(1): 347–352. doi:10.1016/j.anbehav.2010.09.026
  • Sovacool, B. K., 2008, “Exploring scientific misconduct: Isolated individuals, impure institutions, or an inevitable idiom of modern science?” Journal of Bioethical Inquiry , 5: 271–282. doi: 10.1007/s11673-008-9113-6
  • Steel, Daniel, 2010, “Epistemic Values and the Argument from Inductive Risk*”, Philosophy of Science , 77(1): 14–34. doi:10.1086/650206
  • Stegenga, Jacob, 2018, Medical Nihilism , Oxford: Oxford University Press.
  • Steinle, Friedrich, 2016, “Stability and Replication of Experimental Results: A Historical Perspective”, in Atmanspacher and Maasen 2016b: 39–68. doi:10.1002/9781118865064.ch3
  • Sterling, Theodore D., 1959, “Publication Decisions and Their Possible Effects on Inferences Drawn from Tests of Significance – or Vice Versa”, Journal of the American Statistical Association , 54(285): 30–34. doi:10.1080/01621459.1959.10501497
  • Sutton, Jon, 2018, “Tone Deaf?”, The Psychologist , 31: 12–13. [ Sutton 2018 available online ]
  • Szucs, Denes and John P. A. Ioannidis, 2017, “Empirical Assessment of Published Effect Sizes and Power in the Recent Cognitive Neuroscience and Psychology Literature”, Eric-Jan Wagenmakers (ed.), PLoS Biology , 15(3): e2000797. doi:10.1371/journal.pbio.2000797
  • Teira, David, 2013, “A Contractarian Solution to the Experimenter’s Regress”, Philosophy of Science , 80(5): 709–720. doi:10.1086/673717
  • Vazire, Simine, 2018, “Implications of the Credibility Revolution for Productivity, Creativity, and Progress”, Perspectives on Psychological Science , 13(4): 411–417. doi:10.1177/1745691617751884
  • Wagenmakers, Eric-Jan, Ruud Wetzels, Denny Borsboom, Han L. J. van der Maas, and Rogier A. Kievit, 2012, “An Agenda for Purely Confirmatory Research”, Perspectives on Psychological Science , 7(6): 632–638. doi:10.1177/1745691612463078
  • Washburn, Anthony N., Brittany E. Hanson, Matt Motyl, Linda J. Skitka, Caitlyn Yantis, Kendal M. Wong, Jiaqing Sun, et al., 2018, “Why Do Some Psychology Researchers Resist Adopting Proposed Reforms to Research Practices? A Description of Researchers’ Rationales”, Advances in Methods and Practices in Psychological Science , 1(2): 166–173. doi:10.1177/2515245918757427
  • Wasserstein, Ronald L. and Nicole A. Lazar, 2016, “The ASA’s Statement on p-Values: Context, Process, and Purpose”, The American Statistician , 70(2): 129–133. doi:10.1080/00031305.2016.1154108
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • Barba, Lorena A., 2017, “ Science Reproducibility Taxonomy ”, Presentation slides for the 2017 Workshop on Reproducibility Taxonomies for Computing and Computational Science .
  • Kelly, Clint, 2017, “Redux: Do Behavioral Ecologists Replicate Their Studies?”, presented at Ignite Session 12, Ecological Society of America, Portland, Oregon, 8 August. [ Kelly 2017 abstract available online ]
  • McShane, Blakeley B., David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett, 2018, “ Abandon Statistical Significance ”, arXiv.org, first version 22 September 2017; latest revision, 8 September 2018.
  • Schnall, Simone, 2014, “ Social Media and the Crowd-Sourcing of Social Psychology ”, Blog Department of Psychology, Cambridge University , November 18.
  • Tilburg University Meta-Research Center
  • Meta-Research Innovation Center at Stanford (METRICS)
  • The saga of the summer 2017, a.k.a. ‘the alpha wars’ , Barely Significant blog by Ladislas Nalborczyk.
  • 2017 American Statistical Association Symposium on Statistical Inference: Scientific Method for the 21 st Century: A World Beyond \(p <0.05\)
  • Improving Your Statistical Inferences, David Lakens, 2018, Coursera,
  • StudySwap: A Platform for Interlab Replication, Collaboration, and Research Resource Exchange , Open Science Framework
  • Collaborative Replications and Education Project (CREP) , Open Science Framework
  • Registered Reports: Peer review before results are known to align scientific values and practices , Center for Open Science

Bayes’ Theorem | epistemology: Bayesian | measurement: in science | operationalism | science: theory and observation in | scientific knowledge: social dimensions of | scientific method | scientific research and big data

Copyright © 2018 by Fiona Fidler < fidlerfm @ unimelb . edu . au > John Wilcox < wilcoxje @ stanford . edu >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

New Research

Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results

The massive project shows that reproducibility problems plague even top scientific journals

Brian Handwerk

Science Correspondent

42-52701089.jpg

Academic journals and the press regularly serve up fresh helpings of fascinating psychological research findings. But how many of those experiments would produce the same results a second time around?

According to work presented today in Science , fewer than half of 100 studies published in 2008 in three top psychology journals could be replicated successfully. The international effort included 270 scientists who re-ran other people's studies as part of The Reproducibility Project: Psychology , led by Brian Nosek of the University of Virginia .

The eye-opening results don't necessarily mean that those original findings were incorrect or that the scientific process is flawed. When one study finds an effect that a second study can't replicate, there are several possible reasons, says co-author Cody Christopherson of Southern Oregon University. Study A's result may be false, or Study B's results may be false—or there may be some subtle differences in the way the two studies were conducted that impacted the results.

“This project is not evidence that anything is broken. Rather, it's an example of science doing what science does,” says Christopherson. “It's impossible to be wrong in a final sense in science. You have to be temporarily wrong, perhaps many times, before you are ever right.”

Across the sciences, research is considered reproducible when an independent team can conduct a published experiment, following the original methods as closely as possible, and get the same results. It's one key part of the process for building evidence to support theories. Even today, 100 years after Albert Einstein presented his general theory of relativity, scientists regularly repeat tests of its predictions and look for cases where his famous description of gravity does not apply.

"Scientific evidence does not rely on trusting the authority of the person who made the discovery," team member Angela Attwood , a psychology professor at the University of Bristol, said in a statement "Rather, credibility accumulates through independent replication and elaboration of the ideas and evidence."

The Reproducibility Project, a community-based crowdsourcing effort, kicked off in 2011 to test how well this measure of credibility applies to recent research in psychology. Scientists, some recruited and some volunteers, reviewed a pool of studies and selected one for replication that matched their own interest and expertise. Their data and results were shared online and reviewed and analyzed by other participating scientists for inclusion in the large Science study.

To help improve future research, the project analysis attempted to determine which kinds of studies fared the best, and why. They found that surprising results were the hardest to reproduce, and that the experience or expertise of the scientists who conducted the original experiments had little to do with successful replication.

The findings also offered some support for the oft-criticized statistical tool known as the P value , which measures whether a result is significant or due to chance. A higher value means a result is most likely a fluke, while a lower value means the result is statistically significant.

The project analysis showed that a low P value was fairly predictive of which psychology studies could be replicated. Twenty of the 32 original studies with a P value of less than 0.001 could be replicated, for example, while just 2 of the 11 papers with a value greater than 0.04 were successfully replicated.

But Christopherson suspects that most of his co-authors would not want the study to be taken as a ringing endorsement of P values, because they recognize the tool's limitations. And at least one P value problem was highlighted in the research: The original studies had relatively little variability in P value, because most journals have established a cutoff of 0.05 for publication. The trouble is that value can be reached by being selective about data sets , which means scientists looking to replicate a result should also carefully consider the methods and the data used in the original study.

It's also not yet clear whether psychology might be a particularly difficult field for reproducibility—a similar study is currently underway on cancer biology research. In the meantime, Christopherson hopes that the massive effort will spur more such double-checks and revisitations of past research to aid the scientific process.

“Getting it right means regularly revisiting past assumptions and past results and finding new ways to test them. The only way science is successful and credible is if it is self-critical,” he notes. 

Unfortunately there are disincentives to pursuing this kind of research, he says: “To get hired and promoted in academia, you must publish original research, so direct replications are rarer. I hope going forward that the universities and funding agencies responsible for incentivizing this research—and the media outlets covering them—will realize that they've been part of the problem, and that devaluing replication in this way has created a less stable literature than we'd like.”

Get the latest Science stories in your inbox.

Brian Handwerk | READ MORE

Brian Handwerk is a science correspondent based in Amherst, New Hampshire.

National Academies Press: OpenBook

Reproducibility and Replicability in Science (2019)

Chapter: 5 replicability, 5 replicability.

Replication is one of the key ways scientists build confidence in the scientific merit of results. When the result from one study is found to be consistent by another study, it is more likely to represent a reliable claim to new knowledge. As Popper (2005 , p. 23) wrote (using “reproducibility” in its generic sense):

We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence,’ but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.

However, a successful replication does not guarantee that the original scientific results of a study were correct, nor does a single failed replication conclusively refute the original claims. A failure to replicate previous results can be due to any number of factors, including the discovery of an unknown effect, inherent variability in the system, inability to control complex variables, substandard research practices, and, quite simply, chance. The nature of the problem under study and the prior likelihoods of possible results in the study, the type of measurement instruments and research design selected, and the novelty of the area of study and therefore lack of established methods of inquiry can also contribute to non-replicability. Because of the complicated relationship between replicability and its variety of sources, the validity of scientific results should be considered in the context of an entire body of evidence, rather than an individual study or an individual replication. Moreover, replication may be a matter of degree, rather than a binary result of “success” or “failure.” 1 We explain in Chapter 7 how research synthesis, especially meta-analysis, can be used to evaluate the evidence on a given question.

ASSESSING REPLICABILITY

How does one determine the extent to which a replication attempt has been successful? When researchers investigate the same scientific question using the same methods and similar tools, the results are not likely to be identical—unlike in computational reproducibility in which bitwise agreement between two results can be expected (see Chapter 4 ). We repeat our definition of replicability, with emphasis added: obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.

Determining consistency between two different results or inferences can be approached in a number of ways ( Simonsohn, 2015 ; Verhagen and Wagenmakers, 2014 ). Even if one considers only quantitative criteria for determining whether two results qualify as consistent, there is variability across disciplines ( Zwaan et al., 2018 ; Plant and Hanisch, 2018 ). The Royal Netherlands Academy of Arts and Sciences (2018 , p. 20) concluded that “it is impossible to identify a single, universal approach to determining [replicability].” As noted in Chapter 2 , different scientific disciplines are distinguished in part by the types of tools, methods, and techniques used to answer questions specific to the discipline, and these differences include how replicability is assessed.

___________________

1 See, for example, the cancer biology project in Table 5-1 in this chapter.

Acknowledging the different approaches to assessing replicability across scientific disciplines, however, we emphasize eight core characteristics and principles:

  • Attempts at replication of previous results are conducted following the methods and using similar equipment and analyses as described in the original study or under sufficiently similar conditions ( Cova et al., 2018 ). 2 Yet regardless of how similar the replication study is, no second event can exactly repeat a previous event.
  • The concept of replication between two results is inseparable from uncertainty, as is also the case for reproducibility (as discussed in Chapter 4 ).
  • Any determination of replication (between two results) needs to take account of both proximity (i.e., the closeness of one result to the other, such as the closeness of the mean values) and uncertainty (i.e., variability in the measures of the results).
  • To assess replicability, one must first specify exactly what attribute of a previous result is of interest. For example, is only the direction of a possible effect of interest? Is the magnitude of effect of interest? Is surpassing a specified threshold of magnitude of interest? With the attribute of interest specified, one can then ask whether two results fall within or outside the bounds of “proximity-uncertainty” that would qualify as replicated results.
  • Depending on the selected criteria (e.g., measure, attribute), assessments of a set of attempted replications could appear quite divergent. 3
  • A judgment that “Result A is replicated by Result B” must be identical to the judgment that “Result B is replicated by Result A.” There must be a symmetry in the judgment of replication; otherwise, internal contradictions are inevitable.
  • There could be advantages to inverting the question from, “Does Result A replicate Result B (given their proximity and uncertainty)?”

2 Cova et al. (2018, fn. 3) discuss the challenge of defining sufficiently similar as well as the interpretation of the results:

In practice, it can be hard to determine whether the ‘sufficiently similar’ criterion has actually been fulfilled by the replication attempt, whether in its methods or in its results ( Nakagawa and Parker 2015 ). It can therefore be challenging to interpret the results of replication studies, no matter which way these results turn out ( Collins, 1975 ; Earp and Trafimow, 2015 ; Maxwell et al., 2015 ).

3 See Table 5-1 , for an example of this in the reviews of a psychology replication study by Open Science Collaboration (2015) and Patil et al. (2016) .

to “Are Results A and B sufficiently divergent (given their proximity and uncertainty) so as to qualify as a non-replication ?” It may be advantageous, in assessing degrees of replicability, to define a relatively high threshold of similarity that qualifies as “replication,” a relatively low threshold of similarity that qualifies as “non-replication,” and the intermediate zone between the two thresholds that is considered “indeterminate.” If a second study has low power and wide uncertainties, it may be unable to produce any but indeterminate results.

  • While a number of different standards for replicability/non-replicability may be justifiable, depending on the attributes of interest, a standard of “repeated statistical significance” has many limitations because the level of statistical significance is an arbitrary threshold ( Amrhein et al., 2019a ; Boos and Stefanski, 2011 ; Goodman, 1992 ; Lazzeroni et al., 2016 ). For example, one study may yield a p -value of 0.049 (declared significant at the p ≤ 0.05 level) and a second study yields a p -value of 0.051 (declared nonsignificant by the same p -value threshold) and therefore the studies are said not to be replicated. However, if the second study had yielded a p -value of 0.03, the reviewer would say it had successfully replicated the first study, even though the result could diverge more sharply (by proximity and uncertainty) from the original study than in the first comparison. Rather than focus on an arbitrary threshold such as statistical significance, it would be more revealing to consider the distributions of observations and to examine how similar these distributions are. This examination would include summary measures, such as proportions, means, standard deviations (or uncertainties), and additional metrics tailored to the subject matter.

The final point above is reinforced by a recent special edition of the American Statistician in which the use of a statistical significance threshold in reporting is strongly discouraged due to overuse and wide misinterpretation ( Wasserstein et al., 2019 ). A figure from ( Amrhein et al., 2019b ) also demonstrates this point, as shown in Figure 5-1 .

One concern voiced by some researchers about using a proximity-uncertainty attribute to assess replicability is that such an assessment favors studies with large uncertainties; the potential consequence is that many researchers would choose to perform low-power studies to increase the replicability chances ( Cova et al., 2018 ). While two results with large uncertainties and within proximity, such that the uncertainties overlap with each other, may be consistent with replication, the large uncertainties indicate that not much confidence can be placed in that conclusion.

Image

CONCLUSION 5-1: Different types of scientific studies lead to different or multiple criteria for determining a successful replication. The choice of criteria can affect the apparent rate of non-replication, and that choice calls for judgment and explanation.

CONCLUSION 5-2: A number of parametric and nonparametric methods may be suitable for assessing replication across studies. However, a restrictive and unreliable approach would accept replication only when the results in both studies have attained “statistical significance,” that is, when the p -values in both studies have exceeded a selected threshold. Rather, in determining replication, it is important to consider the distributions of observations and to examine how similar these distributions are. This examination would include summary measures, such as proportions, means, standard deviations (uncertainties), and additional metrics tailored to the subject matter.

THE EXTENT OF NON-REPLICABILITY

The committee was asked to assess what is known and, if necessary, identify areas that may need more information to ascertain the extent

of non-replicability in scientific and engineering research. The committee examined current efforts to assess the extent of non-replicability within several fields, reviewed literature on the topic, and heard from expert panels during its public meetings. We also drew on the previous work of committee members and other experts in the field of replicability of research.

Some efforts to assess the extent of non-replicability in scientific research directly measure rates of replication, while others examine indirect measures to infer the extent of non-replication. Approaches to assessing non-replicability rates include

  • direct and indirect assessments of replicability;
  • perspectives of researchers who have studied replicability;
  • surveys of researchers; and
  • retraction trends.

This section discusses each of these lines of evidence.

Assessments of Replicability

The most direct method to assess replicability is to perform a study following the original methods of a previous study and to compare the new results to the original ones. Some high-profile replication efforts in recent years include studies by Amgen, which showed low replication rates in biomedical research ( Begley and Ellis, 2012 ), and work by the Center for Open Science on psychology ( Open Science Collaboration, 2015 ), cancer research ( Nosek and Errington, 2017 ), and social science ( Camerer et al., 2018 ). In these examples, a set of studies was selected and a single replication attempt was made to confirm results of each previous study, or one-to-one comparisons were made. In other replication studies, teams of researchers performed multiple replication attempts on a single original result, or many-to-one comparisons (see e.g., Klein et al., 2014 ; Hagger et al., 2016 ; and Cova et al., 2018 in Table 5-1 ).

Other measures of replicability include assessments that can provide indicators of bias, errors, and outliers, including, for example, computational data checks of reported numbers and comparison of reported values against a database of previously reported values. Such assessments can identify data that are outliers to previous measurements and may signal the need for additional investigation to understand the discrepancy. 4 Table 5-1 summarizes the direct and indirect replication studies assembled by the committee. Other sources of non-replicabilty are discussed later in this chapter in the Sources of Non-Replicability section.

4 There is risk of missing a new discovery by rejecting data outliers without further investigation.

Many direct replication studies are not reported as such. Replication—especially of surprising results or those that could have a major impact—occurs in science often without being labelled as a replication. Many scientific fields conduct reviews of articles on a specific topic—especially on new topics or topics likely to have a major impact—to assess the available data and determine which measurements and results are rigorous (see Chapter 7 ). Therefore, replicability studies included as part of the scientific literature but not cited as such add to the difficulty in assessing the extent of replication and non-replication.

One example of this phenomenon relates to research on hydrogen storage capacity. The U.S. Department of Energy (DOE) issued a target storage capacity in the mid-1990s. One group using carbon nanotubes reported surprisingly high values that met DOE’s target ( Hynek et al., 1997 ); other researchers who attempted to replicate these results could not do so. At the same time, other researchers were also reporting high values of hydrogen capacity in other experiments. In 2003, an article reviewed previous studies of hydrogen storage values and reported new research results, which were later replicated ( Broom and Hirscher, 2016 ). None of these studies was explicitly called an attempt at replication.

Based on the content of the collected studies in Table 5-1 , one can observe that the

  • majority of the studies are in the social and behavioral sciences (including economics) or in biomedical fields, and
  • methods of assessing replicability are inconsistent and the replicability percentages depend strongly on the methods used.

The replication studies such as those shown in Table 5-1 are not necessarily indicative of the actual rate of non-replicability across science for a number of reasons: the studies to be replicated were not randomly chosen, the replications had methodological shortcomings, many replication studies are not reported as such, and the reported replication studies found widely varying rates of non-replication ( Gilbert et al., 2016 ). At the same time, replication studies often provide more and better-quality evidence than most original studies alone, and they highlight such methodological features as high precision or statistical power, preregistration, and multi-site collaboration ( Nosek, 2016 ). Some would argue that focusing on replication of a single study as a way to improve the efficiency of science is ill-placed. Rather, reviews of cumulative evidence on a subject, to gauge both the overall effect size and generalizability, may be more useful ( Goodman, 2018 ; and see Chapter 7 ).

Apart from specific efforts to replicate others’ studies, investigators will typically confirm their own results, as in a laboratory experiment, prior to

TABLE 5-1 Examples of Replication Studies

Field and Author(s) Description Results Type of Assessment
Experimental Philosophy ( ) A group of 20 research teams performed replication studies of 40 experimental philosophy studies published between 2003 and 2015 70% of the 40 studies were replicated by comparing the original effect size to the confidence interval (CI) of the replication. Direct
Behavioral Science, Personality Traits Linked to Life Outcomes ( ) Performed replications of 78 previously published associations between the Big Five personality traits and consequential life outcomes 87% of the replication attempts were statistically significant in the expected direction, and effects were typically 77% as strong as the corresponding original effects. Direct
Behavioral Science, Ego-Depletion Effect ( ) Multiple laboratories (23 in total) conducted replications of a standardized ego-depletion protocol based on a sequential-task paradigm by Meta-analysis of the studies revealed that the size of the ego-depletion effect was small with 95% CI that encompassed zero (d = 0.04, 95% CI [−0.07, 0.15]).
General Biology, Preclinical Animal Studies ( ) Attempt by researchers from Bayer HealthCare to validate data on potential drug targets obtained in 67 projects by copying models exactly or by adapting them to internal needs Published data were completely in line with the results of the validation studies in 20%-25% of cases. Direct
Oncology, Preclinical Studies ( ) Attempt by Amgen team to reproduce the results of 53 “landmark” studies Scientific results were confirmed in 11% of the studies. Direct
Genetics, Preclinical Studies ( ) Replication of data analyses provided in 18 articles on microarray-based gene expression studies Of the 18 studies, 2 analyses (11%) were replicated; 6 were partially replicated or showed some discrepancies in results; and 10 could not be replicated. Direct
Experimental Psychology ( ) Replication of 13 psychological phenomena across 36 independent samples 77% of phenomena were replicated consistently. Direct
Field and Author(s) Description Results Type of Assessment
Experimental Psychology, Many Labs 2 ( ) Replication of 28 classic and contemporary published studies 54% of replications produced a statistically significant effect in the same direction as the original study, 75% yielded effect sizes smaller than the original ones, and 25% yielded larger effect sizes than the original ones. Direct
Experimental Psychology ( ) Attempt to independently replicate selected results from 100 studies in psychology 36% of the replication studies produced significant results, compared to 97% of the original studies. The mean effect sizes were halved. Direct
Experimental Psychology ( ) Using reported data from the replication study in psychology, reanalyzed the results 77% of the studies replicated by comparing the original effect size to an estimated 95% CI of the replication. Direct
Experimental Psychology ( ) Attempt to replicate 21 systematically selected experimental studies in the social sciences published in and in 2010-2015 Found a significant effect in the same direction as the original study for 62% (13 of 21) studies, and the effect size of the replications was on average about 50% of the original effect size. Direct
Empirical Economics ( ) 2-year study that collected programs and data from authors and attempted to replicate their published results on empirical economic research Two of nine replications were successful, three “near” successful, and four unsuccessful; findings suggest that inadvertent errors in published empirical articles are a commonplace rather than a rare occurrence. Direct
Economics ( ) Progress report on the number of journals with data sharing requirements and an assessment of 167 studies 10 journals explicitly note they publish replications; of 167 published replication studies, approximately 66% were unable to confirm the original results; 12% disconfirmed at least one major result of the original study, while confirming others. N/A
Field and Author(s) Description Results Type of Assessment
Economics ( ) An effort to replicate 18 studies published in the and the from 2011-2014 Significant effect in the same direction as the original study found for 11 replications (61%); on average, the replicated effect size was 66% of the original. Direct
Chemistry ( ; ) Collaboration with National Institute of Standards and Technology (NIST) to check new data against NIST database, 13,000 measurements 27% of papers reporting properties of adsorption had data that were outliers; 20% of papers reporting carbon dioxide isotherms as outliers. Indirect
Chemistry ( ) Collaboration with NIST, Thermodynamics Research Center (TRC) databases, prepublication check of solubility, viscosity, critical temperature, and vapor pressure 33% experiments had data problems, such as uncertainties too small, reported values outside of TRC database distributions. Indirect
Biology Reproducibility Project: Cancer Biology Large-scale replication project to replicate key results in 29 cancer papers published in , , and other high-impact journals The first five articles have been published; two replicated important parts of the original papers, one did not replicate, and two were uninterpretable. Direct
Psychology, Statistical Checks ( ) Statcheck tool used to test statistical values within psychology articles from 1985-2013 49.6% of the articles with null hypothesis statistical test (NHST) results contained at least one inconsistency (8,273 of the 16,695 articles), and 12.9% (2,150) of the articles with NHST results contained at least one gross inconsistency. Indirect
Engineering, Computational Fluid Dynamics ( ) Full replication studies of previously published results on bluff-body aerodynamics, using four different computational methods Replication of the main result was achieved in three out of four of the computational efforts. Direct
Field and Author(s) Description Results Type of Assessment
Psychology, Many Labs 3 ( ) Attempt to replicate 10 psychology studies in one online session 3 of 10 studies replicated at < 0.05. Direct
Psychology ( ) Argued that one of the failed replications in Ebersole et al. was due to changes in the procedure. They randomly assigned participants to a version closer to the original or to Ebersole et al.’s version. The original study replicated when the original procedures were followed more closely, but not when the Ebersole et al. procedures were used. Direct
Psychology ( ) 17 different labs attempted to replicate one study on facial feedback by . None of the studies replicated the result at < 0.05. Direct
Psychology ( ) Pointed out that all of the studies in the replication project changed the procedure by videotaping participants. Conducted a replication in which participants were randomly assigned to be videotaped or not. The original study was replicated when the original procedure was followed ( = 0.01); the original study was not replicated when the video camera was present ( = 0.85). Direct
Psychology ( ) 31 labs attempted to replicate a study by Schooler and Engstler-Schooler (1990). Replicated the original study. The effect size was much larger when the original study was replicated more faithfully (the first set of replications inadvertently introduced a change in the procedure). Direct

NOTES: Some of the studies in this table also appear in Table 4-1 as they evaluated both reproducibility and replicability. N/A = not applicable.

a From Cova et al. (2018 , p. 14): “For studies reporting statistically significant results, we treated as successful replications for which the replication 95 percent CI [confidence interval] was not lower than the original effect size. For studies reporting null results, we treated as successful replications for which original effect sizes fell inside the bounds of the 95 percent CI.”

b From Soto (2019 , p. 7, fn. 1): “Previous large-scale replication projects have typically treated the individual study as the primary unit of analysis. Because personality-outcome studies often examine multiple trait-outcome associations, we selected the individual association as the most appropriate unit of analysis for estimating replicability in this literature.”

publication. More generally, independent investigators may replicate prior results of others before conducting, or in the course of conducting, a study to extend the original work. These types of replications are not usually published as separate replication studies.

Perspectives of Researchers Who Have Studied Replicability

Several experts who have studied replicability within and across fields of science and engineering provided their perspectives to the committee. Brian Nosek, cofounder and director of the Center for Open Science, said there was “not enough information to provide an estimate with any certainty across fields and even within individual fields.” In a recent paper discussing scientific progress and problems, Richard Shiffrin, professor of psychology and brain sciences at Indiana University, and colleagues argued that there are “no feasible methods to produce a quantitative metric, either across science or within the field” to measure the progress of science ( Shiffrin et al., 2018 , p. 2632). Skip Lupia, now serving as head of the Directorate for Social, Behavioral, and Economic Sciences at the National Science Foundation, said that there is not sufficient information to be able to definitively answer the extent of non-reproducibility and non-replicability, but there is evidence of p- hacking and publication bias (see below), which are problems. Steven Goodman, the codirector of the Meta-Research Innovation Center at Stanford University (METRICS), suggested that the focus ought not be on the rate of non-replication of individual studies, but rather on cumulative evidence provided by all studies and convergence to the truth. He suggested the proper question is “How efficient is the scientific enterprise in generating reliable knowledge, what affects that reliability, and how can we improve it?”

Surveys of scientists about issues of replicability or on scientific methods are indirect measures of non-replicability. For example, Nature published the results of a survey in 2016 in an article titled “1,500 Scientists Lift the Lid on Reproducibility ( Baker, 2016 )” 5 ; this article reported that a large percentage of researchers who responded to an online survey believe that replicability is a problem. This article has been widely cited by researchers studying subjects ranging from cardiovascular disease to crystal structures ( Warner et al., 2018 ; Ziletti et al., 2018 ). Surveys and studies have also assessed the prevalence of specific problematic research practices, such as a 2018 survey about questionable research practices in ecology and evolution

5 Nature uses the word “reproducibility” to refer to what we call “replicability.”

( Fraser et al., 2018 ). However, many of these surveys rely on poorly defined sampling frames to identify populations of scientists and do not use probability sampling techniques. The fact that nonprobability samples “rely mostly on people . . . whose selection probabilities are unknown [makes it] difficult to estimate how representative they are of the [target] population” ( Dillman, Smyth, and Christian, 2014 , pp. 70, 92). In fact, we know that people with a particular interest in or concern about a topic, such as replicability and reproducibility, are more likely to respond to surveys on the topic ( Brehm, 1993 ). As a result, we caution against using surveys based on nonprobability samples as the basis of any conclusion about the extent of non-replicability in science.

High-quality researcher surveys are expensive and pose significant challenges, including constructing exhaustive sampling frames, reaching adequate response rates, and minimizing other nonresponse biases that might differentially affect respondents at different career stages or in different professional environments or fields of study ( Corley et al., 2011 ; Peters et al., 2008 ; Scheufele et al., 2009 ). As a result, the attempts to date to gather input on topics related to replicability and reproducibility from larger numbers of scientists ( Baker, 2016 ; Boulbes et al., 2018 ) have relied on convenience samples and other methodological choices that limit the conclusions that can be made about attitudes among the larger scientific community or even for specific subfields based on the data from such surveys. More methodologically sound surveys following guidelines on adoption of open science practices and other replicability-related issues are beginning to emerge. 6 See Appendix E for a discussion of conducting reliable surveys of scientists.

Retraction Trends

Retractions of published articles may be related to their non-replicability. As noted in a recent study on retraction trends ( Brainard, 2018 , p. 392), “Overall, nearly 40% of retraction notices did not mention fraud or other kinds of misconduct. Instead, the papers were retracted because of errors, problems with reproducibility [or replicability], and other issues.” Overall, about one-half of all retractions appear to involve fabrication, falsification, or plagiarism. Journal article retractions in biomedicine increased from 50-60 per year in the mid-2000s, to 600-700 per year by the mid-2010s ( National Library of Medicine, 2018 ), and this increase attracted much commentary and analysis (see, e.g., Grieneisen and Zhang, 2012 ). A recent comprehensive review of an extensive database of 18,000 retracted papers

6 See https://cega.berkeley.edu/resource/the-state-of-social-science-betsy-levy-paluck-bitssannual-meeting-2018 .

dating back to the 1970s found that while the number of retractions has grown, the rate of increase has slowed; approximately 4 of every 10,000 papers are now retracted ( Brainard, 2018 ). Overall, the number of journals that report retractions has grown from 44 journals in 1997 to 488 journals in 2016; however, the average number of retractions per journal has remained essentially flat since 1997.

These data suggest that more journals are attending to the problem of articles that need to be retracted rather than a growing problem in any one discipline of science. Fewer than 2 percent of authors in the database account for more than one-quarter of the retracted articles, and the retractions of these frequent offenders are usually based on fraud rather than errors that lead to non-replicability. The Institute of Electrical and Electronics Engineers alone has retracted more than 7,000 abstracts from conferences that took place between 2009 and 2011, most of which had authors based in China ( McCook, 2018 ).

The body of evidence on the extent of non-replicabilty gathered by the committee is not a comprehensive assessment across all fields of science nor even within any given field of study. Such a comprehensive effort would be daunting due to the vast amount of research published each year and the diversity of scientific and engineering fields. Among studies of replication that are available, there is no uniform approach across scientific fields to gauge replication between two studies. The experts who contributed their perspectives to the committee all question the feasibility of such a science-wide assessment of non-replicability.

While the evidence base assessed by the committee may not be sufficient to permit a firm quantitative answer on the scope of non-replicability, it does support several findings and a conclusion.

FINDING 5-1: There is an uneven level of awareness of issues related to replicability across fields and even within fields of science and engineering.

FINDING 5-2: Efforts to replicate studies aimed at discerning the effect of an intervention in a study population may find a similar direction of effect, but a different (often smaller) size of effect.

FINDING 5-3: Studies that directly measure replicability take substantial time and resources.

FINDING 5-4: Comparing results across replication studies may be compromised because different replication studies may test different study attributes and rely on different standards and measures for a successful replication.

FINDING 5-5: Replication studies in the natural and clinical sciences (general biology, genetics, oncology, chemistry) and social sciences (including economics and psychology) report frequencies of replication ranging from fewer than one out of five studies to more than three out of four studies.

CONCLUSION 5-3: Because many scientists routinely conduct replication tests as part of a follow-on work and do not report replication results separately, the evidence base of non-replicability across all science and engineering research is incomplete.

SOURCES OF NON-REPLICABILITY

Non-replicability can arise from a number of sources. In some cases, non-replicability arises from the inherent characteristics of the systems under study. In others, decisions made by a researcher or researchers in study execution that reasonably differ from the original study such as judgment calls on data cleaning or selection of parameter values within a model may also result in non-replication. Other sources of non-replicability arise from conscious or unconscious bias in reporting, mistakes and errors (including misuse of statistical methods), and problems in study design, execution, or interpretation in either the original study or the replication attempt. In many instances, non-replication between two results could be due to a combination of multiple sources, but it is not generally possible to identify the source without careful examination of the two studies. Below, we review these sources of non-replicability and discuss how researchers’ choices can affect each. Unless otherwise noted, the discussion below focuses on the non-replicability between two results (i.e., a one-to-one comparison) when assessed using proximity and uncertainty of both results.

Non-Replicability That Is Potentially Helpful to Science

Non-replicability is a normal part of the scientific process and can be due to the intrinsic variation and complexity of nature, the scope of current scientific knowledge, and the limits of current technologies. Highly surprising and unexpected results are often not replicated by other researchers. In other instances, a second researcher or research team may purposefully make decisions that lead to differences in parts of the study. As long as these differences are reported with the final results, these may be reasonable actions to take yet result in non-replication. In scientific reporting, uncertainties within the study (such as the uncertainty within measurements, the potential interactions between parameters, and the variability of the

system under study) are estimated, assessed, characterized, and accounted for through uncertainty and probability analysis. When uncertainties are unknown and not accounted for, this can also lead to non-replicability. In these instances, non-replicability of results is a normal consequence of studying complex systems with imperfect knowledge and tools. When non-replication of results due to sources such as those listed above are investigated and resolved, it can lead to new insights, better uncertainty characterization, and increased knowledge about the systems under study and the methods used to study them. See Box 5-1 for examples of how investigations of non-replication have been helpful to increasing knowledge.

The susceptibility of any line of scientific inquiry to sources of non-replicability depends on many factors, including factors inherent to the system under study, such as the

  • complexity of the system under study;
  • understanding of the number and relations among variables within the system under study;
  • ability to control the variables;
  • levels of noise within the system (or signal to noise ratios);
  • mismatch of scale of the phenomena and the scale at which it can be measured;
  • stability across time and space of the underlying principles;
  • fidelity of the available measures to the underlying system under study (e.g., direct or indirect measurements); and
  • prior probability (pre-experimental plausibility) of the scientific hypothesis.

Studies that pursue lines of inquiry that are able to better estimate and analyze the uncertainties associated with the variables in the system and control the methods that will be used to conduct the experiment are more replicable. On the other end of the spectrum, studies that are more prone to non-replication often involve indirect measurement of very complex systems (e.g., human behavior) and require statistical analysis to draw conclusions. To illustrate how these characteristics can lead to results that are more or less likely to replicate, consider the attributes of complexity and controllability. The complexity and controllability of a system contribute to the underlying variance of the distribution of expected results and thus the likelihood of non-replication. 7

7 Complexity and controllability in an experimental system affect its susceptibility to non-replicability independently from the way prior odds, power, or p- values associated with hypothesis testing affect the likelihood that an experimental result represents the true state of the world.

The systems that scientists study vary in their complexity. Although all systems have some degree of intrinsic or random variability, some systems are less well understood, and their intrinsic variability is more difficult to assess or estimate. Complex systems tend to have numerous interacting components (e.g., cell biology, disease outbreaks, friction coefficient between two unknown surfaces, urban environments, complex organizations and populations, and human health). Interrelations and interactions among multiple components cannot always be predicted and neither can the resulting effects on the experimental outcomes, so an initial estimate of uncertainty may be an educated guess.

Systems under study also vary in their controllability. If the variables within a system can be known, characterized, and controlled, research on such a system tends to produce more replicable results. For example, in social sciences, a person’s response to a stimulus (e.g., a person’s behavior when placed in a specific situation) depends on a large number of variables—including social context, biological and psychological traits, verbal and nonverbal cues from researchers—all of which are difficult or impossible to control completely. In contrast, a physical object’s response to a physical stimulus (e.g., a liquid’s response to a rise in temperature) depends almost entirely on variables that can either be controlled or adjusted for, such as temperature, air pressure, and elevation. Because of these differences, one expects that studies that are conducted in the relatively more controllable systems will replicate with greater frequency than those that are in less controllable systems. Scientists seek to control the variables relevant to the system under study and the nature of the inquiry, but when these variables are more difficult to control, the likelihood of non-replicability will be higher. Figure 5-2 illustrates the combinations of complexity and controllability.

Many scientific fields have studies that span these quadrants, as demonstrated by the following examples from engineering, physics, and psychology. Veronique Kiermer, PLOS executive editor, in her briefing to the committee noted: “There is a clear correlation between the complexity of the design, the complexity of measurement tools, and the signal to noise ratio that we are trying to measure.” (See also Goodman et al., 2016 , on the complexity of statistical and inferential methods.)

Engineering . Aluminum-lithium alloys were developed by engineers because of their strength-to-weight ratio, primarily for use in aerospace engineering. The process of developing these alloys spans the four quadrants. Early generation of binary alloys was a simple system that showed high replicability (Quadrant A). Second-generation alloys had higher amounts of lithium and resulted in lower replicability that appeared as failures in manufacturing operations because the interactions of the elements were not understood (Quadrant C). The third-generation alloys contained less

Image

lithium and higher relative amounts of other alloying elements, which made it a more complex system but better controlled (Quadrant B), with improved replicability. The development of any alloy is subject to a highly controlled environment. Unknown aspects of the system, such as interactions among the components, cannot be controlled initially and can lead to failures. Once these are understood, conditions can be modified (e.g., heat treatment) to bring about higher replicability.

Physics. In physics, measurements of the electronic band gap of semiconducting and conducting materials using scanning tunneling microscopy is a highly controlled, simple system (Quadrant A). The searches for the Higgs boson and gravitational waves were separate efforts, and each required the development of large, complex experimental apparatus and careful characterization of the measurement and data analysis systems (Quadrant B). Some systems, such as radiation portal monitors, require setting thresholds for alarms without knowledge of when or if a threat will ever pass through them; the variety of potential signatures is high and there is little controllability of the system during operation (Quadrant C). Finally, a simple system with little controllability is that of precisely predicting the path of a feather dropped from a given height (Quadrant D).

Psychology. In psychology, Quadrant A includes studies of basic sensory and perceptual processes that are common to all human beings, such

as the purkinje shift (i.e., a change in sensitivity of the human eye under different levels of illumination). Quadrant D includes studies of complex social behaviors that are influenced by culture and context; for example, a study of the effects of a father’s absence on children’s ability to delay gratification revealed stronger effects among younger children ( Mischel, 1961 ).

Inherent sources of non-replicability arise in every field of science, but they can vary widely depending on the specific system undergoing study. When the sources are knowable, or arise from experimental design choices, researchers need to identify and assess these sources of uncertainty insofar as they can be estimated. Researchers need also to report on steps that were intended to reduce uncertainties inherent in the study or differ from the original study (i.e., data cleaning decisions that resulted in a different final dataset). The committee agrees with those who argue that the testing of assumptions and the characterization of the components of a study are as important to report as are the ultimate results of the study ( Plant and Hanisch, 2018 ) including studies using statistical inference and reporting p -values ( Boos and Stefanski, 2011 ). Every scientific inquiry encounters an irreducible level of uncertainty, whether this is due to random processes in the system under study, limits to researchers understanding or ability to control that system, or limitations of the ability to measure. If researchers do not adequately consider and report these uncertainties and limitations, this can contribute to non-replicability.

RECOMMENDATION 5-1: Researchers should, as applicable to the specific study, provide an accurate and appropriate characterization of relevant uncertainties when they report or publish their research. Researchers should thoughtfully communicate all recognized uncertainties and estimate or acknowledge other potential sources of uncertainty that bear on their results, including stochastic uncertainties and uncertainties in measurement, computation, knowledge, modeling, and methods of analysis.

Unhelpful Sources of Non-Replicability

Non-replicability can also be the result of human error or poor researcher choices. Shortcomings in the design, conduct, and communication of a study may all contribute to non-replicability.

These defects may arise at any point along the process of conducting research, from design and conduct to analysis and reporting, and errors may be made because the researcher was ignorant of best practices, was sloppy in carrying out research, made a simple error, or had unconscious bias toward a specific outcome. Whether arising from lack of knowledge, perverse incentives, sloppiness, or bias, these sources of non-replicability

warrant continued attention because they reduce the efficiency with which science progresses and time spent resolving non-replicablity issues that are caused by these sources do not add to scientific understanding. That is, they are unhelpful in making scientific progress. We consider here a selected set of such avoidable sources of non-replication:

  • publication bias
  • misaligned incentives
  • inappropriate statistical inference
  • poor study design
  • incomplete reporting of a study

We will discuss each source in turn.

Publication Bias

Both researchers and journals want to publish new, innovative, ground-breaking research. The publication preference for statistically significant, positive results produces a biased literature through the exclusion of statistically nonsignificant results (i.e., those that do not show an effect that is sufficiently unlikely if the null hypothesis is true). As noted in Chapter 2 , there is great pressure to publish in high-impact journals and for researchers to make new discoveries. Furthermore, it may be difficult for researchers to publish even robust nonsignificant results, except in circumstances where the results contradict what has come to be an accepted positive effect. Replication studies and studies with valuable data but inconclusive results may be similarly difficult to publish. This publication bias results in a published literature that does not reflect the full range of evidence about a research topic.

One powerful example is a set of clinical studies performed on the effectiveness of tamoxifen, a drug used to treat breast cancer. In a systematic review (see Chapter 7 ) of the drug’s effectiveness, 23 clinical trials were reviewed; the statistical significance of 22 of the 23 studies did not reach the criterion of p < 0.05, yet the cumulative review of the set of studies showed a large effect (a reduction of 16% [±3] in the odds of death among women of all ages assigned to tamoxifen treatment [ Peto et al., 1988 , p. 1684]).

Another approach to quantifying the extent of non-replicability is to model the false discovery rate—that is, the number of research results that are expected to be “false.” Ioannidis (2005) developed a simulation model to do so for studies that rely on statistical hypothesis testing, incorporating the pre-study (i.e., prior) odds, the statistical tests of significance, investigator bias, and other factors. Ioannidis concluded, and used as the title of his paper,

that “most published research findings are false.” Some researchers have criticized Ioannidis’s assumptions and mathematical argument ( Goodman and Greenland, 2007 ); others have pointed out that the takeaway message is that any initial results that are statistically significant need further confirmation and validation.

Analyzing the distribution of published results for a particular line of inquiry can offer insights into potential bias, which can relate to the rate of non-replicability. Several tools are being developed to compare a distribution of results to what that distribution would look like if all claimed effects were representative of the true distribution of effects. Figure 5-3 shows how publication bias can result in a skewed view of the body of evidence when only positive results that meet the statistical significance threshold are reported. When a new study fails to replicate the previously published results—for example, if a study finds no relationship between variables when such a relationship had been shown in previously published studies—it appears to be a case of non-replication. However, if the published literature is not an accurate reflection of the state of the evidence because only positive results are regularly published, the new study could actually have replicated previous but unpublished negative results. 8

Several techniques are available to detect and potentially adjust for publication bias, all of which are based on the examination of a body of research as a whole (i.e., cumulative evidence), rather than individual replication studies (i.e., one-on-one comparison between studies). These techniques cannot determine which of the individual studies are affected by bias (i.e., which results are false positives) or identify the particular type of bias, but they arguably allow one to identify bodies of literature that are likely to be more or less accurate representations of the evidence. The techniques, discussed below, are funnel plots, a p -curve test of excess significance, and assessing unpublished literature.

Funnel Plots. One of the most common approaches to detecting publication bias involves constructing a funnel plot that displays each effect size against its precision (e.g., sample size of study). Asymmetry in the plotted values can reveal the absence of studies with small effect sizes, especially in studies with small sample sizes—a pattern that could suggest publication/selection bias for statistically significant effects (see Figure 5-3 ). There are criticisms of funnel plots, however; some argue that the shape of a funnel plot is largely determined by the choice of method ( Tang and Liu, 2000 ),

8 Earlier in this chapter, we discuss an indirect method for assessing non-replicability in which a result is compared to previously published values; results that do not agreed with the published literature are identified as outliers. If the published literature is biased, this method would inappropriately reject valid results. This is another reason for investigating outliers before rejecting them.

Image

and others maintain that funnel plot asymmetry may not accurately reflect publication bias ( Lau et al., 2006 ).

P -Curve. One fairly new approach is to compare the distribution of results (e.g., p- values) to the expected distributions (see Simonsohn et al., 2014a , 2014b ). P- curve analysis tests whether the distribution of statistically significant p- values shows a pronounced right-skew, 9 as would be expected when the results are true effects (i.e., the null hypothesis is false), or whether the distribution is not as right-skewed (or is even flat, or, in the most extreme cases, left-skewed), as would be expected when the original results do not reflect the proportion of real effects ( Gadbury and Allison, 2012 ; Nelson et al., 2018 ; Simonsohn et al., 2014a ).

Test of Excess Significance. A closely related statistical idea for checking publication bias is the test of excess significance. This test evaluates whether the number of statistically significant results in a set of studies is improbably high given the size of the effect and the power to test it in the set of studies ( Ioannidis and Trikalinos, 2007 ), which would imply that the set of results is biased and may include exaggerated results or false positives. When there is a true effect, one expects the proportion of statistically significant results to be equal to the statistical power of the studies. If a researcher designs her studies to have 80 percent power against a given effect, then, at most, 80 percent of her studies would produce statistically significant results if the effect is at least that large (fewer if the null hypothesis is sometimes true). Schimmack (2012) has demonstrated that the proportion of statistically significant results across a set of psychology studies often far exceeds the estimated statistical power of those studies; this pattern of results that is “too good to be true” suggests that results were either not obtained following the rules of statistical inference (i.e., conducting a single statistical test that was chosen a priori ) or did not report all studies attempted (i.e., there is a “file drawer” of statistically nonsignificant studies that do not get published; or possibly the results were p -hacked or cherry picked (see Chapter 2 ).

In many fields, the proportion of published papers that report a positive (i.e., statistically significant) result is around 90 percent ( Fanelli, 2012 ). This raises concerns when combined with the observation that most studies have far less than 90 percent statistical power (i.e., would only successfully detect an effect, assuming an effect exists, far less than 90% of the time) ( Button et al., 2013 ; Fraley and Vazire, 2014 ; Szucs and Ioannidis, 2017 ; Yarkoni, 2009 ; Stanley et al., 2018 ). Some researchers believe that the

9 Distributions that have more p -values of low value than high are referred to as “right-skewed.” Similarly, “left-skewed” distributions have more p -values of high than low value.

publication of false positives is common and that reforms are needed to reduce this. Others believe that there has been an excessive focus on Type I errors (i.e., false positives) in hypothesis testing at the possible expense of an increase in Type II errors (i.e., false negatives, or failing to confirm true hypotheses) ( Fiedler et al., 2012 ; Finkel et al., 2015 ; LeBel et al., 2017 ).

Assessing Unpublished Literature. One approach to countering publication bias is to search for and include unpublished papers and results when conducting a systematic review of the literature. Such comprehensive searches are not standard practice. For medical reviews, one estimate is that only 6 percent of reviews included unpublished work ( Hartling et al., 2017 ), although another found that 50 percent of reviews did so ( Ziai et al., 2017 ). In economics, there is a large and active group of researchers collecting and sharing “grey” literature, research results outside of peer reviewed publications ( Vilhuber, 2018 ). In psychology, an estimated 75 percent of reviews included unpublished research ( Rothstein, 2006 ). Unpublished but recorded studies (such as dissertation abstracts, conference programs, and research aggregation websites) may become easier for reviewers to access with computerized databases and with the availability of preprint servers. When a review includes unpublished studies, researchers can directly compare their results with those from the published literature, thereby estimating file-drawer effects.

Misaligned Incentives

Academic incentives—such as tenure, grant money, and status—may influence scientists to compromise on good research practices ( Freeman, 2018 ). Faculty hiring, promotion, and tenure decisions are often based in large part on the “productivity” of a researcher, such as the number of publications, number of citations, and amount of grant money received ( Edwards and Roy, 2017 ). Some have suggested that these incentives can lead researchers to ignore standards of scientific conduct, rush to publish, and overemphasize positive results ( Edwards and Roy, 2017 ). Formal models have shown how these incentives can lead to high rates of non-replicable results ( Smaldino and McElreath, 2016 ). Many of these incentives may be well intentioned, but they could have the unintended consequence of reducing the quality of the science produced, and poorer quality science is less likely to be replicable.

Although it is difficult to assess how widespread the sources of non-replicability that are unhelpful to improving science are, factors such as publication bias toward results qualifying as “statistically significant” and misaligned incentives on academic scientists create conditions that favor publication of non-replicable results and inferences.

Inappropriate Statistical Inference

Confirmatory research is research that starts with a well-defined research question and a priori hypotheses before collecting data; confirmatory research can also be called hypothesis testing research. In contrast, researchers pursuing exploratory research collect data and then examine the data for potential variables of interest and relationships among variables, forming a posteriori hypotheses; as such, exploratory research can be considered hypothesis generating research. Exploratory and confirmatory analyses are often described as two different stages of the research process. Some have distinguished between the “context of discovery” and the “context of justification” ( Reichenbach, 1938 ), while others have argued that the distinction is on a spectrum rather than categorical. Regardless of the precise line between exploratory and confirmatory research, researchers’ choices between the two affects how they and others interpret the results.

A fundamental principle of hypothesis testing is that the same data that were used to generate a hypothesis cannot be used to test that hypothesis ( de Groot, 2014 ). In confirmatory research, the details of how a statistical hypothesis test will be conducted must be decided before looking at the data on which it is to be tested. When this principle is violated, significance testing, confidence intervals, and error control are compromised. Thus, it cannot be assured that false positives are controlled at a fixed rate. In short, when exploratory research is interpreted as if it were confirmatory research, there can be no legitimate statistically significant result.

Researchers often learn from their data, and some of the most important discoveries in the annals of science have come from unexpected results that did not fit any prior theory. For example, Arno Allan Penzias and Robert Woodrow Wilson found unexpected noise in data collected in the course of their work on microwave receivers for radio astronomy observations. After attempts to explain the noise failed, the “noise” was eventually determined to be cosmic microwave background radiation, and these results helped scientists to refine and confirm theories about the “big bang.” While exploratory research generates new hypotheses, confirmatory research is equally important because it tests the hypotheses generated and can give valid answers as to whether these hypotheses have any merit. Exploratory and confirmatory research are essential parts of science, but they need to be understood and communicated as two separate types of inquiry, with two different interpretations.

A well-conducted exploratory analysis can help illuminate possible hypotheses to be examined in subsequent confirmatory analyses. Even a stark result in an exploratory analysis has to be interpreted cautiously, pending further work to test the hypothesis using a new or expanded dataset. It is often unclear from publications whether the results came from an

exploratory or a confirmatory analysis. This lack of clarity can misrepresent the reliability and broad applicability of the reported results.

In Chapter 2 , we discussed the meaning, overreliance, and frequent misunderstanding of statistical significance, including misinterpreting the meaning and overstating the utility of a particular threshold, such as p < 0.05. More generally, a number of flaws in design and reporting can reduce the reliability of a study’s results.

Misuse of statistical testing often involves post hoc analyses of data already collected, making it seem as though statistically significant results provide evidence against the null hypothesis, when in fact they may have a high probability of being false positives ( John et al., 2012 ; Munafo et al., 2017 ). A study from the late-1980s gives a striking example of how such post hoc analysis can be misleading. The International Study of Infarct Survival was a large-scale, international, randomized trial that examined the potential benefit of aspirin for patients who had had a heart attack. After data collection and analysis were complete, the publishing journal asked the researchers to do additional analysis to see if certain subgroups of patients benefited more or less from aspirin. Richard Peto, one of the researchers, refused to do so because of the risk of finding invalid but seemingly significant associations. In the end, Peto relented and performed the analysis, but with a twist: he also included a post hoc analysis that divided the patients into the twelve astrological signs, and found that Geminis and Libras did not benefit from aspirin, while Capricorns benefited the most ( Peto, 2011 ). This obviously spurious relationship illustrates the dangers of analyzing data with hypotheses and subgroups that were not prespecified.

Little information is available about the prevalence of such inappropriate statistical practices as p- hacking, cherry picking, and hypothesizing after results are known (HARKing), discussed below. While surveys of researchers raise the issue—often using convenience samples—methodological shortcomings mean that they are not necessarily a reliable source for a quantitative assessment. 10

P- hacking and Cherry Picking. P- hacking is the practice of collecting, selecting, or analyzing data until a result of statistical significance is found. Different ways to p- hack include stopping data collection once p ≤ 0.05 is reached, analyzing many different relationships and only reporting those for which p ≤ 0.05, varying the exclusion and inclusion rules for data so that p ≤ 0.05, and analyzing different subgroups in order to get p ≤ 0.05. Researchers may p- hack without knowing or without understanding the consequences ( Head et al., 2015 ). This is related to the practice of cherry picking, in which researchers may (unconsciously or deliberately) pick

10 For an example of one study of this issue, see Fraser et al. (2018) .

through their data and results and selectively report those that meet criteria such as meeting a threshold of statistical significance or supporting a positive result, rather than reporting all of the results from their research.

HARKing. Confirmatory research begins with identifying a hypothesis based on observations, exploratory analysis, or building on previous research. Data are collected and analyzed to see if they support the hypothesis. HARKing applies to confirmatory research that incorrectly bases the hypothesis on the data collected and then uses that same data as evidence to support the hypothesis. It is unknown to what extent inappropriate HARKing occurs in various disciplines, but some have attempted to quantify the consequences of HARKing. For example, a 2015 article compared hypothesized effect sizes against non-hypothesized effect sizes and found that effects were significantly larger when the relationships had been hypothesized, a finding consistent with the presence of HARKing ( Bosco et al., 2015 ).

Poor Study Design

Before conducting an experiment, a researcher must make a number of decisions about study design. These decisions—which vary depending on type of study—could include the research question, the hypotheses, the variables to be studied, avoiding potential sources of bias, and the methods for collecting, classifying, and analyzing data. Researchers’ decisions at various points along this path can contribute to non-replicability. Poor study design can include not recognizing or adjusting for known biases, not following best practices in terms of randomization, poorly designing materials and tools (ranging from physical equipment to questionnaires to biological reagents), confounding in data manipulation, using poor measures, or failing to characterize and account for known uncertainties.

In 2010, economists Carmen Reinhart and Kenneth Rogoff published an article that showed if a country’s debt exceeds 90 percent of the country’s gross domestic product, economic growth slows and declines slightly (0.1%). These results were widely publicized and used to support austerity measures around the world ( Herndon et al., 2013 ). However, in 2013, with access to Reinhart and Rogoff’s original spreadsheet of data and analysis (which the authors had saved and made available for the replication effort), researchers reanalyzing the original studies found several errors in the analysis and data selection. One error was an incomplete set of countries used in the analysis that established the relationship between debt and economic growth. When data from Australia, Austria, Belgium, Canada,

and Denmark were correctly included, and other errors were corrected, the economic growth in the countries with debt above 90 percent of gross domestic product was actually +2.2 percent, rather than –0.1. In response, Reinhart and Rogoff acknowledged the errors, calling it “sobering that such an error slipped into one of our papers despite our best efforts to be consistently careful.” Reinhart and Rogoff said that while the error led to a “notable change” in the calculation of growth in one category, they did not believe it “affects in any significant way the central message of the paper.” 11

The Reinhart and Rogoff error was fairly high profile and a quick Internet search would let any interested reader know that the original paper contained errors. Many errors could go undetected or are only acknowledged through a brief correction in the publishing journal. A 2015 study looked at a sample of more than 250,000 p- values reported in eight major psychology journals over a period of 28 years. The study found that many of the p- values reported in papers were inconsistent with a recalculation of the p- value and that in one out of eight papers, this inconsistency was large enough to affect the statistical conclusion ( Nuijten et al., 2016 ).

Errors can occur at any point in the research process: measurements can be recorded inaccurately, typographical errors can occur when inputting data, and calculations can contain mistakes. If these errors affect the final results and are not caught prior to publication, the research may be non-replicable. Unfortunately, these types of errors can be difficult to detect. In the case of computational errors, transparency in data and computation may make it more likely that the errors can be caught and corrected. For other errors, such as mistakes in measurement, errors might not be detected until and unless a failed replication that does not make the same mistake indicates that something was amiss in the original study. Errors may also be made by researchers despite their best intentions (see Box 5-2 ).

Incomplete Reporting of a Study

During the course of research, researchers make numerous choices about their studies. When a study is published, some of these choices are reported in the methods section. A methods section often covers what materials were used, how participants or samples were chosen, what data collection procedures were followed, and how data were analyzed. The failure to report some aspect of the study—or to do so in sufficient detail—may make it difficult for another researcher to replicate the result. For example, if a researcher only reports that she “adjusted for comorbidities” within the study population, this does not provide sufficient information about how

11 See https://archive.nytimes.com/www.nytimes.com/interactive/2013/04/17/business/17economixresponse.html .

exactly the comorbidities were adjusted, and it does not give enough guidance for future researchers to follow the protocol. Similarly, if a researcher does not give adequate information about the biological reagents used in an experiment, a second researcher may have difficulty replicating the experiment. Even if a researcher reports all of the critical information about the conduct of a study, other seemingly inconsequential details that have an effect on the outcome could remain unreported.

Just as reproducibility requires transparent sharing of data, code, and analysis, replicability requires transparent sharing of how an experiment was conducted and the choices that were made. This allows future researchers, if they wish, to attempt replication as close to the original conditions as possible.

Fraud and Misconduct

At the extreme, sources of non-replicability that do not advance scientific knowledge—and do much to harm science—include misconduct and fraud in scientific research. Instances of fraud are uncommon, but can be sensational. Despite fraud’s infrequent occurrence and regardless of how

highly publicized cases may be, the fact that it is uniformly bad for science means that it is worthy of attention within this study.

Researchers who knowingly use questionable research practices with the intent to deceive are committing misconduct or fraud. It can be difficult in practice to differentiate between honest mistakes and deliberate misconduct because the underlying action may be the same while the intent is not.

Reproducibility and replicability emerged as general concerns in science around the same time as research misconduct and detrimental research practices were receiving renewed attention. Interest in both reproducibility and replicability as well as misconduct was spurred by some of the same trends and a small number of widely publicized cases in which discovery of fabricated or falsified data was delayed, and the practices of journals, research institutions, and individual labs were implicated in enabling such delays ( National Academies of Sciences, Engineering, and Medicine, 2017 ; Levelt Committee et al., 2012 ).

In the case of Anil Potti at Duke University, a researcher using genomic analysis on cancer patients was later found to have falsified data. This experience prompted the study and the report, Evolution of Translational Omics: Lessons Learned and the Way Forward ( Institute of Medicine, 2012 ), which in turn led to new guidelines for omics research at the National Cancer Institute. Around the same time, in a case that came to light in the Netherlands, social psychologist Diederick Stapel had gone from manipulating to fabricating data over the course of a career with dozens of fraudulent publications. Similarly, highly publicized concerns about misconduct by Cornell University professor Brian Wansink highlight how consistent failure to adhere to best practices for collecting, analyzing, and reporting data—intentional or not—can blur the line between helpful and unhelpful sources of non-replicability. In this case, a Cornell faculty committee ascribed to Wansink: “academic misconduct in his research and scholarship, including misreporting of research data, problematic statistical techniques, failure to properly document and preserve research results, and inappropriate authorship.” 12

A subsequent report, Fostering Integrity in Research ( National Academies of Sciences, Engineering, and Medicine, 2017 ), emerged in this context, and several of its central themes are relevant to questions posed in this report.

According to the definition adopted by the U.S. federal government in 2000, research misconduct is fabrication of data, falsification of data, or plagiarism “in proposing, performing, or reviewing research, or in reporting research results” ( Office of Science and Technology Policy, 2000 , p. 76262). The federal policy requires that research institutions report all

12 See http://statements.cornell.edu/2018/20180920-statement-provost-michael-kotlikoff.cfm .

allegations of misconduct in research projects supported by federal funding that have advanced from the inquiry stage to a full investigation, and to report on the results of those investigations.

Other detrimental research practices (see National Academies of Sciences, Engineering, and Medicine, 2017 ) include failing to follow sponsor requirements or disciplinary standards for retaining data, authorship misrepresentation other than plagiarism, refusing to share data or methods, and misleading statistical analysis that falls short of falsification. In addition to the behaviors of individual researchers, detrimental research practices also include actions taken by organizations, such as failure on the part of research institutions to maintain adequate policies, procedures, or capacity to foster research integrity and assess research misconduct allegations, and abusive or irresponsible publication practices by journal editors and peer review.

Just as information on rates of non-reproducibility and non-replicability in research is limited, knowledge about research misconduct and detrimental research practices is scarce. Reports of research misconduct allegations and findings are released by the National Science Foundation Office of Inspector General and the Department of Health and Human Services Office of Research Integrity (see National Science Foundation, 2018d ). As discussed above, new analyses of retraction trends have shed some light on the frequency of occurrence of fraud and misconduct. Allegations and findings of misconduct increased from the mid-2000s to the mid-2010s but may have leveled off in the past few years.

Analysis of retractions of scientific articles in journals may also shed some light on the problem ( Steen et al., 2013 ). One analysis of biomedical articles found that misconduct was responsible for more than two-thirds of retractions ( Fang et al., 2012 ). As mentioned earlier, a wider analysis of all retractions of scientific papers found about one-half attributable to misconduct or fraud ( Brainard, 2018 ). Others have found some differences according to discipline ( Grieneisen and Zhang, 2012 ).

One theme of Fostering Integrity in Research is that research misconduct and detrimental research practices are a continuum of behaviors ( National Academies of Sciences, Engineering, and Medicine, 2017 ). While current policies and institutions aimed at preventing and dealing with research misconduct are certainly necessary, detrimental research practices likely arise from some of the same causes and may cost the research enterprise more than misconduct does in terms of resources wasted on the fabricated or falsified work, resources wasted on following up this work, harm to public health due to treatments based on acceptance of incorrect clinical results, reputational harm to collaborators and institutions, and others.

No branch of science is immune to research misconduct, and the committee did not find any basis to differentiate the relative level of occurrence

in various branches of science. Some but not all researcher misconduct has been uncovered through reproducibility and replication attempts, which are the self-correcting mechanisms of science. From the available evidence, documented cases of researcher misconduct are relatively rare, as suggested by a rate of retractions in scientific papers of approximately 4 in 10,000 ( Brainard, 2018 ).

CONCLUSION 5-4: The occurrence of non-replicability is due to multiple sources, some of which impede and others of which promote progress in science. The overall extent of non-replicability is an inadequate indicator of the health of science.

This page intentionally left blank.

One of the pathways by which the scientific community confirms the validity of a new scientific discovery is by repeating the research that produced it. When a scientific effort fails to independently confirm the computations or results of a previous study, some fear that it may be a symptom of a lack of rigor in science, while others argue that such an observed inconsistency can be an important precursor to new discovery.

Concerns about reproducibility and replicability have been expressed in both scientific and popular media. As these concerns came to light, Congress requested that the National Academies of Sciences, Engineering, and Medicine conduct a study to assess the extent of issues related to reproducibility and replicability and to offer recommendations for improving rigor and transparency in scientific research.

Reproducibility and Replicability in Science defines reproducibility and replicability and examines the factors that may lead to non-reproducibility and non-replicability in research. Unlike the typical expectation of reproducibility between two computations, expectations about replicability are more nuanced, and in some cases a lack of replicability can aid the process of scientific discovery. This report provides recommendations to researchers, academic institutions, journals, and funders on steps they can take to improve reproducibility and replicability in science.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

The BMJ logo

Genuine replication and pseudoreplication: what’s the difference?

By Stanley E. Lazic ( @StanLazic )

Replication is a key idea in science and statistics, but is often misunderstood by researchers because they receive little education or training on experimental design. Consequently, the wrong entity is replicated in many experiments, leading to pseudoreplication or the “unit of analysis” problem [1,2]. This results in exaggerated sample sizes and a potential increase in both false positives and false negatives – the worst of all possible worlds.

Replication can mean many things

Replication is not always easy to understand because many parts of an experiment can be replicated, and a non-exhaustive list includes:

  • Replicating the measurements taken on a set of samples. Examples include taking two blood pressure readings on each person or dividing a blood sample into two aliquots and measuring the concentration of a substance in each aliquot.
  • Replicating the application of a treatment or intervention to a biological entity of interest. This is the traditional way of increasing the sample size, by increasing the number of treatment–entity pairs; for example, the number of times a drug or vehicle control is randomly and independently applied to a set of rats.

scientist replicate experiments

  • Replicating the experimental procedure under different conditions. Repeating the experimental procedure several times, but where a known source of variation is present on each occasion. An example is a multi-centre clinical trial where differences between centres may exist. Another example is a large animal experiment that is broken down into two smaller experiments to make it manageable, and each smaller experiment is run by a different technician.
  • Replicating the experiment by independent researchers. Repeating the whole experiment by researchers that were not part of the initial experiment. This occurs when a paper is published and others try to obtain the same results.

To add to the confusion, terms with related meanings exist, such as repeatability, reproducibility, and replicability. Furthermore, the reasons for having or increasing replication are diverse and include a need to increase statistical power, a desire to make the results more generalisable, or the result of a practical constraint, such as an inability to recruit enough patients in one centre and so multiple centres are needed.

Requirements for genuine replication

How do you design an experiment to have genuine replication and not pseudoreplication? First, ensure that replication is at the level of the biological question or scientific hypothesis. For example, to test the effectiveness of a drug in rats, give the drug to multiple rats, and compare the result with other rats that received a control treatment (corresponding to example 2 above). Multiple measurements on each rat (example 1 above) do not count towards genuine replication.

To test if a drug kills proliferating cells in a well compared to a control condition, you will need multiple drug and control wells, since the drug is applied on a per-well basis. But you may worry that the results from a single experimental run will not generalise – even if you can perform a valid statistical test – because results from in vitro experiments can be highly variable. You could then repeat the experiment four times (corresponding to example 3 above), and the sample size is now four, not the total number of wells that were used across all of the experimental runs. This second option requires more work, will take longer, and will usually have lower power, but it provides a more robust result because the experimenter’s ability to reproduce the treatment effect across multiple experimental runs has been replicated.

To test if pre-registered studies report different effect sizes from traditional studies that are not pre-registered, you will need multiple studies of both types (corresponding to example 5 above). The number of subjects in each of these studies is irrelevant for testing this study-level hypothesis.

Replication at the level of the question or hypothesis a necessary but not sufficient condition for genuine replication – three criteria must be satisfied [1,3]:

  • For experiments, the biological entities of interest must be randomly and independently assigned to treatment groups. If this criterion holds, the biological entities are also called the experimental units [1,3].
  • The treatment(s) should be applied independently to each experimental unit. Injecting animals with a drug is an independent application of a treatment, whereas putting the drug in the drinking water shared by all animals in a cage is not.
  • The experimental units should not influence each other, especially on the measured outcome variables. This criterion is often impossible to verify – how do you prove that the aggressive behaviour of one rat in a cage is not influencing the behaviour of the other rats?

It follows that cells in a well or neurons in a brain or slice culture can rarely be considered genuine replicates because the above criteria are unlikely to be met, whereas fish in a tank, rats in a cage, or pigs in a pen could be genuine replicates in some cases but not in others. If the criteria are not met, the solution is to replicate one level up in the biological or technical hierarchy. For example, if you’re interested in the effect of a drug on cells in an in vitro experiment, but cannot use cells as genuine replicates, then the number of wells can be the replicates, and the measurements on cells within a well can be averaged so that the number of data points corresponds to the number of wells, that is, the sample size (hierarchical or multi-level models can also be used and don’t require values to be averaged because they take the structure of the data into account, but they are harder to implement and interpret compared with averaging followed by simpler statistical methods). Similarly, if rats in a cage cannot be considered genuine replicates, then calculating a cage-averaged value and using cages as genuine replicates is an appropriate solution (or a multi-level model).

If genuine replication is too low, the experiment may be unable to answer any scientific questions of interest. Therefore issues about replication must be resolved when designing an experiment, not after the data have been collected. For example, if cages are the genuine replicates and not the rats, then putting fewer rats in a cage and having more cages will increase power; and power is maximised with one rat per cage, but this may be undesirable for other reasons.

Confusing pseudoreplication for genuine replication reduces our ability to learn from experiments, understand nature, and develop treatments for diseases. It is also easily fixed. The requirements for genuine replication, like the definition of a p-value, is often misunderstood by researchers, despite many papers on the topic. An open-access overview is provided in reference [1], and reference [3] has a detailed discussion along with analysis options for many experimental designs.

[1] Lazic SE, Clarke-Williams CJ, Munafo MR (2018). What exactly is “N” in cell culture and animal experiments? PLoS Biol 6(4):e2005282. https://doi.org/10.1371/journal.pbio.2005282

[2] Lazic SE (2010). The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neuroscience 11:5. https://doi.org/10.1186/1471-2202-11-5

[3] Lazic SE (2016). Experimental Design for Laboratory Biologists: Maximising Information and Improving Reproducibility. Cambridge University Press, Cambridge, UK. https://www.cambridge.org/Lazic

scientist replicate experiments

Stanley E. Lazic is Co-founder  and Chief Scientific Officer at Prioris.ai Inc.

Prioris.ai, Suite 459, 207 Bank Street, Ottawa ON, K2P 2N2, Canada.

Analysis and discussion of research | Updates on the latest issues | Open debate

All BMJ blog posts are published under a CC-BY-NC licence

BMJ Journals

Terms and Conditions Cookie Settings

The Happy Scientist

The Happy Scientist

Error message, what is science: repeat and replicate.

In the scientific process, we should not rely on the results of a single test. Instead, we should perform the test over and over. Why? If it works once, shouldn't it work the same way every time? Yes, it should, so if we repeat the experiment and get a different result, then we know that there is something about the test that we are not considering.

If your system blocks Vimeo, click here to use the alternate player

In studying the processes of science, you will often run into two words, which seem similar: Repetition and Replication

Sometimes it is a matter of random chance, as in the case of flipping a coin. Just because it comes up heads the first time does not mean that it will always come up heads. By repeating the experiment over and over, we can see if our result really supports our hypothesis ( What is a Hypothesis? ), or if it was just random chance.

Sometimes the result might be due to some variable that you have not recognized. In our example of flipping a coin, the individual's technique for flipping the coin might influence the results. To take that into consideration, we repeat the experiment over and over with different people, looking closely for any results that don't fit into the idea we are testing.

Results that don't fit are important! Figuring out why they do not fit our hypothesis can give us an opportunity to learn new things, and get a better understanding of the idea we are testing.

Replication

Once we have repeated our testing over and over, and think we understand the results, then it is time for replication. That means getting other scientists to perform the same tests, to see whether they get the same results. As with repetition, the most important things to watch for are results that don't fit our hypothesis, and for the same reason. Those different results give us a chance to discover more about our idea. The different results may be because the person replicating our tests did something different, but they also might be because that person noticed something that we missed.

What if you are wrong!

If we did miss something, it is OK, as long as we performed our tests honestly and scientifically. Science is not about proving that "I am right!" Instead, it is a process for trying to learn more about the universe and how it works. It is usually a group effort, with each scientist adding her own perspective to the idea, giving us a better understanding and often raising new questions to explore.

Please log in.

Search form

Search by topic, search better.

  • Life Science
  • Earth Science
  • Chemical Science
  • Space Science
  • Physical Science
  • Process of Science

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.10(12); 2020 Jun

The role of replication studies in ecology

Hannah fraser.

1 School of BioSciences, University of Melbourne, Parkville VIC, Australia

Ashley Barnett

Timothy h. parker.

2 Biology Department, Whitman College, Walla Walla WA, USA

Fiona Fidler

3 School of BioSciences, School of Historical and Philosophical Studies, University of Melbourne, Parkville VIC, Australia

Associated Data

All data and analysis code are available at https://osf.io/bqc74/ with a stable https://doi.org/10.17605/OSF.IO/BQC74 .

Recent large‐scale projects in other disciplines have shown that results often fail to replicate when studies are repeated. The conditions contributing to this problem are also present in ecology, but there have not been any equivalent replication projects. Here, we survey ecologists' understanding of and opinions about replication studies. The majority of ecologists in our sample considered replication studies to be important (97%), not prevalent enough (91%), worth funding even given limited resources (61%), and suitable for publication in all journals (62%). However, there is a disconnect between this enthusiasm and the prevalence of direct replication studies in the literature which is much lower (0.023%: Kelly 2019) than our participants' median estimate of 10%. This may be explained by the obstacles our participants identified including the difficulty of conducting replication studies and of funding and publishing them. We conclude by offering suggestions for how replications could be better integrated into ecological research.

Repeating (replicating) studies can tell you how valid and generalizable the findings are. Ecologists think that replicating studies is important and valuable but a very small proportion of the ecology and evolutionary biology literature is replicated.

An external file that holds a picture, illustration, etc.
Object name is ECE3-10-5197-g003.jpg

1. INTRODUCTION

While replication is often upheld as a cornerstone of scientific methodology, attempts to replicate studies appear rare, at least in some disciplines. Studies looking at the prevalence of self‐identified “replication studies” in the literature find rates of 0.023% in ecology (Kelly,  2019 ), 0.1% in education (Makel & Plucker,  2014 ), and 1% in psychology (Makel, Plucker, & Hegarty,  2012 ). These figures reflect the rate of direct replications where the method from the original study is repeated as closely as possible. Of course, the feasibility of direct replication studies in many areas of ecology is limited by factors such as the challenge of conducting research in originally studied ecosystems which may be remote from potential replicators, the large spatial and temporal scales of many ecological studies, and the dynamic nature of ecosystems (Schnitzer & Carson,  2016 ; Shavit & Ellison,  2017 ). However, some subfields, such as behavioral ecology, suffer less from these restrictions and direct (or at least close replications) are more feasible (Nakagawa & Parker,  2015 ).

In the current study, we are concerned with how researchers think about replication, whether they consider it important, and what epistemic role they believe replication plays in the formulation of scientific evidence.

1.1. The role of replication in science

Different kinds of replication studies fulfill different epistemic functions, or purposes. It is common to distinguish between “direct” and “conceptual” replications, where direct replications repeat an original study using methods, instruments, and sampling procedures as close to the original as possible (recognizing that exact replications are largely hypothetical constructs in most disciplines) and conceptual replications make deliberate variations. The dichotomy between direct and conceptual is an oversimplification of a noisy continuum, and many more fine‐grained typologies exist (for a summary see Fidler & Wilcox,  2018 ) including ecology and evolutionary biology‐specific ones (Kelly,  2006 ; Nakagawa & Parker,  2015 ). Broadly speaking, replication studies at the “direct” end of the continuum assess the “conclusion” validity of the original findings (whether the originally observed relationship between measured variables is reliable). Those original findings might be invalid because sampling error led to a misleading result, or because of questionable research practices or even fraud. Replication studies at the “conceptual” end of the continuum test generalizability and robustness, this includes what has previously been termed “quasireplication” where studies are replicated in different species or ecosystems. Where a replication study is placed on the direct‐conceptual continuum and what epistemic function it fulfils depends on the scope of the claim in the original study and how the replication study conforms to or differs from that. For example, imagine I am conducting research in the Great Barrier Reef, and I collect data from some locations in the northern part of the reef. If, after analyzing my results, I make explicit inferences to the Great Barrier Reef as a whole, then studies anywhere along the reef employing the same methods and protocols as the original could reasonably be considered direct replications (within reasonable time constraints, of course). However, if I had constrained my inference to just the northern reef, it would not be reasonable to consider new studies sampling other locations direct replications. Replications beyond the Great Barrier Reef, for instance on coral reefs in the Red Sea, would be conceptual replications in both cases. In Table  1 , we illustrate how varying different elements of a study while holding others constant can allow us to interrogate different aspects of its conclusion. However, as the example of the reef demonstrates, whether any given replication is considered direct and conceptual is intrinsically tied to the scope of the inference in the original research claim.

Direct and conceptual replications in ecology. “S” means that the study element in the replication study is similar enough to the original study that it would be considered a fair test of the original hypothesis, and “D” means that the study element is distinctly different in original and replication studies, testing beyond the original hypothesis

 LocationEnvironmental conditionsStudy systemVariablesEpistemic function
SSSSControls for result being driven by sampling error, QRPs, mistakes, fraud
DSSSControls for result being driven by its specific location within the stated scope of the study
SDSSControls for result depending on the particular environmental conditions at the time of study
SSSDControls for result being an artifact of how the research question was operationalized
SSDSInvestigates whether the result generalizes to new study systems (often called “quasireplication”)
S/DS/DS/DS/DInvestigates the generalizability and robustness of the result to multiple simultaneous changes in study design, and potential new interactions

It is worth noting in advance of the next section that the large‐scale replication studies from other disciplines we describe there, and their associated replication success rates, refer exclusively to direct replication studies.

1.2. Cause for concern over replication rates

Over the last 8–10 years, concern over a “replication crisis” in science has mounted. The basis of this concern comes from large‐scale direct replication projects in several fields which found low rates of successful replication. Studies included in these projects all attempted fair tests of the original hypothesis, and most were conducted with advice from the original authors. This may mean that the location or time of the replication study differed from the original, but only in cases where location was not specified as being part of the scope of the claim in the original study.

Rates of successful direct replications range from 36% to 62% in psychology, (Camerer et al.,  2018 ; Open Science Collaboration,  2015 ), from 11% to 49% in preclinical biomedicine (Freedman, Cockburn, & Simcoe,  2015 ), and from 67% to 78% in economics research (Camerer et al.,  2016 ) depending on the study, and the measure of “successful” used (see Fidler et al.,  2017 for a summary).

Low rates of successful replication are usually attributed to poor reliability because of low statistical power in the original studies (Maxwell, Lau, & Howard,  2015 ); publication bias toward statistically significant results (Fanelli,  2010 , 2012 ; Franco et al.,  2014 ); and the use of questionable research practices (e.g., selectively reporting statistically significant variables, hypothesizing after results known: Agnoli, Wicherts, Veldkamp, Albiero, & Cubelli,  2017 ; Fraser, Parker, Nakagawa, Barnett, & Fidler,  2018 ; John, Loewenstein, & Prelec,  2012 ).

So far, there have been no equivalent, large‐scale replication projects in ecology or related fields. However, meta‐analytic studies have shown that several classic behavioral ecology findings do not reliably replicate (Sánchez‐Tójar et al.,  2018 ; Seguin & Forstmeier,  2012 ; Wang et al.,  2018 ). In addition, all of the conditions expected to drive low rates of replication mentioned above appear common in ecology and evolution (Fidler et al.,  2017 ; Parker et al.,  2016 ): low power (Jennions & Moller,  2000 ), publication bias (Cassey, Ewen, Blackburn, & Moller,  2004 ; Fanelli,  2012 ; Franco et al.,  2014 ; Jennions & Moller,  2002 ; Murtaugh,  2002 ), and prevalence of questionable research practices (Fraser et al.,  2018 ).

1.3. Scientists' attitudes toward replication

In the late 1980s, sociologists of science Mulkay and Gilbert interviewed a sample of biochemists about their replication practices. In particular, they were interested in whether these scientists replicated others' work. Most reported that they did not. And yet, the scientists uniformly claimed that their own work had been independently replicated by others. This seems to suggest an implausible state of affairs where everyone's work is replicated but no one is doing replicating (Box 1 ).

Excerpt from Mulkay and Gilbert ( 1991 ), page 156

Interviewer: Does this imply that you don't repeat other people's experiments?

Respondent: Never

Interviewer: Does anyone repeat yours?

Respondent: Oh. Does anybody repeat my experiments? Yes, they do. I have read where people have purified rat liver enzyme from other sources. They get basically the same sub‐unit composition. I'm always happy, by the way which I see that somebody has done something and repeated some of our work, because I always worry…

Mulkay and Gilbert's explanation of this potential contradiction rested on the notion of “conceptual slippage.” That is, the definition of “replication” that researchers bring to mind when asked about replicating others' work was narrow, centering around direct or exact replication. When considering whether their own work had been replicated by others, they broadened their definition of replication, allowing conceptual replication (different operationalizations and measurements, extensions, etc.). Mulkay and Gilbert referred to the former as “mere replication” and report that it was rarely valued by the scientists in their interview sample. For example, one interviewee referring to another laboratory that is known to replicate studies said: “They actually take pride in the fact they are checking papers that have been published by others, with the result that a great deal of confirmatory work precludes their truly innovative contribution to the literature” (Mulkay & Gilbert,  1991 , p. 155).

Dismissal of the value of direct replication research is echoed in Madden's , Easley, and Dunn ( 1995 ) survey of 107 social and natural science journal editors, aimed at discovering how journal editors view replication research. Comments from two natural science editors exemplify this “Our attention is focused on avoiding replication! There are so many interesting subjects which have not been studied that it is a stupid thing to make the same work again” and “Why do you want to replicate already published work? If there is some interest [sic] puzzle, of course, but replication for its own sake is never encouraged”. Similarly, Ahadi, Hellas, Ihantola, Korhonen, and Petersen ( 2016 ) found a correlation between the perceived value of publishing original research and the perception that replication studies are less valuable in terms of obtaining citations and grant funding.

This negative stigma feeds into the difficulty of publishing replication studies. Ahadi et al. ( 2016 ) found that only 10% of computer education researchers that found the same result and 8% that found a different result to the original study were able to publish their replication studies. Baker and Penny ( 2016 ) examined the rate of publishing psychology replication studies and found that it was around 12% for replication studies that found the same result and 10% for replication studies that found a different result to the original. This is compounded by the fact that very few people submit replication studies in the first place (Baker & Penny,  2016 ).

1.4. Rationale for the current study

Our goal here is to document and evaluate researchers' self‐reported understanding of, attitudes toward, and (where applicable) objections and obstacles to engaging in replication studies.

The current work investigates Kelly's ( 2006 ) argument that there exists in ecology “a general disdain by thesis committees… and journal editors for nonoriginal research” (p232). Echoing findings by Ahadi et al. ( 2016 ), Kelly proposed that replication studies may be hard to publish when they agree with the original findings because they do not add anything novel to the literature and also when they disagree with the original findings because the evidence from the original study is given greater weight than the refuting evidence. The current project is, in the broadest sense, an empirical investigation of these issues.

2. MATERIALS AND METHODS

2.1. survey participants.

We distributed paper and online versions of our anonymous survey (created in Qualtrics Provo, UT, USA, pdf of survey available at https://osf.io/bqc74/ ) at the Ecological Society of America (ESA) 2017 conference (4,500+ attendees) and EcoTas 2017 (joint conference for the Australian and New Zealand Ecological Societies, 350–450 attendees), in line with ethics approval from the University of Melbourne Human Research Ethics Committee (Ethics ID 1749316.1). We set up a booth in the conference hall at ESA and actively approached passers‐by, asking them to take part in our survey. At EcoTas, we distributed the survey by roaming the conference on foot and announcing the survey in conference sessions. Participants at EcoTas were offered the opportunity to go into the draw and win a piece of artwork representing their research. We promoted the survey on twitter at both conferences. In total, ecologists returned 439 surveys, 218 from ESA, and 221 from EcoTas. Our sample comprises ecologists mostly from Australia, New Zealand, and North America. We have no reason to expect these populations to differ from other populations of ecologists in their opinions regarding replication. However, replication studies in other locations would be needed to assess the generalizability of our results.

2.2. Survey instrument

Our survey included multiple‐choice questions about the following:

  • How important replication is in ecology
  • Whether replication is necessary for the results to be believed or trusted
  • Whether there is enough replication taking place
  • Whether replication is a good use of resources
  • How often replication studies should be published
  • Whether participants check for a replication if the study is plausible or implausible
  • What types of study do participants consider replications (ranging from computational reproducibility to direct and quasi/conceptual replications)
  • We also asked participants to specify the percentage of studies they believe to be replicated in ecology using a slider bar and asked free‐text response questions about following:
  • Aside from replications, what might make participants believe or trust a result
  • What are the obstacles to replication

2.3. Data analysis

The code and data required to computationally reproduce our results and qualitative responses are available from https://osf.io/bqc74/ . For each of the multiple‐choice questions, we plotted the proportion (with 95% Confidence Intervals, CIs) of researchers who selected each of the options (e.g., the proportion of researchers who indicated that replication was “Very Important,” “Somewhat Important,” or “Not Important” in ecology) using ggplot2 (Valero‐Mora,  2015 , version 3.2.1) in R (R Development Core Team,  2018 , version 3.5.1). All 95% CIs are Wilson Score Intervals calculated in binom (Dorai‐Raj,  2014 , version 1.1) except for those calculated for the estimate of the prevalence of replication studies in ecology which were generated using parametric assumptions in Rmisc (Hope,  2013 , version 1.5).

3.1. Prevalence and importance of replication

Our sample of ecologists' median estimate of the proportion of replicated studies was 10% (mean 22%, 95% CIs 20%–24%, n  = 393). A high proportion of ecologists were very positive about replication. The vast majority (97%, 95%CI: 95%–98%, n  = 425 of 437 participants) of ecologists answering our survey stated that replication studies are (very or somewhat) important (Figure  1a ), and 91% (95% CI: 88%–93%, n  = 385 of 424 participants) agreed that they would like to see more (or much more) replication taking place in ecology (Figure  1b ). Many also agreed that it is “crucial” (61%, 95%CI: 56%–65%, n  = 261 of 428 participants, Figure  1c ) and that replication studies should be published in all journals (63%, 95%CI: 58–67, n  = 269 of 427 participants, Figure  1d ).

An external file that holds a picture, illustration, etc.
Object name is ECE3-10-5197-g001.jpg

Proportion of participants (with 95%CIs) selecting each option for the following questions: (a) how important is replication in ecology ( n  = 437 participants), (b) does enough replication take place ( n  = 424 participants), (c) do you consider replication studies to be a good use of resources in ecology ( n  = 437 participants), and (d) how often should replication studies be published ( n  = 443 responses from 427 participants)

Around a third of our sample agreed that replication is important with caveats, suggesting that given limited funding, the focus should remain on novel research (37%, 95%CI: 32%–41%, n  = 157 of 428 participants, Figure  1c ) or that they should only be published in special editions or specific journals (30%, 95%CI: 25%–34%, n  = 126 of 427 participants). We specifically worded these response items (i.e., pointing to funding scarcity, and publishing only in special issues) to mitigate demand characteristics, that is, undue influence to provide a positive answer to a survey question.

Very few ecologists expressed an overall negative perspective of replication studies; 1% (95%CI: 0.6%–3.0%, n  = 6 of 437 participants, Figure  1a ) agreed that they were not important, 1% (95%CI: 0.5%–2.7%, n  = 5 of 424 participants, Figure  1b ) indicated that there should be “less” or “much less” replication conducted, 0.5% (95%CI: 0.1%–1.7%, n  = 2 of 428 participants, Figure  1c ) agreed that replication studies are a waste of time and money, 6% (95%CI: 4%–9%, n  = 27 of 427 participants, Figure  1d ) indicated that replication studies should only be published if the results differ from those in the original study, and 0.23% indicated that replications should never be published (95%CI: 0.04%–1.3%, n  = 1 of 427 participants, Figure  1d ).

3.2. Believability and trust

When asked “does an effect or phenomenon need to be successfully replicated before you believe or trust it,” 43% (95%CI 38%–48%, n  = 188 of 437 participants) said “yes,” 11% (95%CI: 9%–15%, n  = 50 of 437 participants) said “no,” and 46% (95%CI: 41%–50%, n  = 199 of 437 participants) said maybe. This leaves open the question of what participants do use to determine the epistemic value of a finding. Fortunately, 395 (of the total 437) participants provided free text responses when asked what, aside from replication, made an effect or phenomenon more believable or trustworthy (Table  2 ).

Researchers' ( n  = 395) free‐text responses to a question asking “Is there anything else [aside from replication studies] that you consider to be especially important in determining believability or trustworthiness?” We show summary level results only, with illustrative quotations

 Study designOpen science practicesReputationConsistency of current finding with existing knowledgeStatistical qualities of the results
Number of comments24268666153
Indicative quotes

“Sound methodology… appropriate controls, using different approaches/ method to prove the same hypothesis”

“Temporal consistency of relationships. Test of consistency across environmental contexts”

“Open, publicly available data and code!”

“whether raw data/analysis is presented in published paper supplements or hidden away”

“Sound scientific history of publications. Well regarded in academic or practitioner community”

“Reputation of journals (sometimes, but sometimes reputable journals publish crap.)”

“theoretical validity (ie is it biologically supportable through established knowledge or does it severely contradict established theory)”

“Are results consistent with similar research? If not, the new research is revolutionary and has a higher bar to convince me”

“degree to which data build the case for the claim (i.e., different approaches (e.g., experimental and observational, different experimental approaches), sites, length of the study) all are useful”

“Sample size, power, strength of the effect, how much the findings can be generalised”

Topics covered

‐ scale of the study,

‐ sample size,

‐ use of controls,

‐ statistical approach,

‐ confounds factors

‐ transparent methods,

‐ analysis code available,

‐ data available,

‐ study preregistered

‐ funding source,

‐ conflicts of interest,

‐ reputation of:

journal, institution, researcher

consistent with:

‐reader's understanding

‐prior literature

‐existing theory

‐ large effect size,

‐ small p‐value,

‐ result supported by multiple tests,

‐ validity of the data

3.3. Checking for replications

We asked how often participants checked for replication studies when they come across an effect or phenomenon that was plausible versus implausible. Very few participants (9%, 95%CI: 7%–12%, n  = 39 of 429 participants) “almost always” checked whether a study was replicated if they thought the result were plausible. Participants were more likely to check for replication studies if they found the effect implausible but even then, only 27% (23%–31%, n  = 116 of 429 participants) of participants said that they “almost always” checked (Figure  2 ).

An external file that holds a picture, illustration, etc.
Object name is ECE3-10-5197-g002.jpg

Percentage of participants reporting that they check for replications at different frequencies if the original study seemed plausible versus implausible. Error bars at 95% Wilson confidence intervals ( n  = 429 participants)

3.4. What is a replication study?

In order to get a picture of what our sampled ecologists consider to be replication studies, we asked participants to select as many options as they wanted from Table  3 . The top four options represent the spectrum of replication studies from most direct (first option) to most conceptual (fourth option). The number of participants who considered the options to be replication studies decreased with decreasing similarity between original and replication study. Options 5 and 6 in Table  3 are related to computationally reproducing the results by reanalyzing a study's data. Computational reproducibility is a related concept to replication and has similar, if more limited, epistemic purpose: If the analysis is kept the same, it can detect mistakes and inconsistencies in the original analysis (Table  3 , option 5) and, if the analysis is altered, it can test the sensitivity of the findings to alternate modeling decisions (Table  3 , option 6).

Statements of different types of variations a new study might make to an original, and the percentage of total participants ( n  = 430) who considered each variation type a “replication study.” Also shown is the mean estimate of the replication rate in ecology, calculated separately for participants who indicated that each of the option constituted a “replication study.”

 Percentage of participants choosing this response (95% CI)Mean estimate of replication rate in ecology (95% CI)
Redoing an experiment or study as closely as possible to the original (e.g., with same methods and in the context, region, or species)90% (87–92)21% (19–24)
Redoing an experiment or study with same (or similar) methods in a new context (region or species, etc.).73% (69–77)24% (21–26)
Redoing an experiment or study with different methods in the same context (region or species, etc.).38% (34–43)23% (20–27)
Redoing an experiment or study with different methods in a different context (region or species, etc.).14% (11–18)19% (13–25)
Re‐analyzing previously collected data with the same statistical methods/models.41% (37–46)21% (18–25)
Re‐analyzing previously collected data with the different statistical methods/models.36% (32–41)21% (17–24)
None of the above1% (0–2)NA

We tested whether different understandings of the definition or scope of replication produced different estimates of the rate of replication studies. We divided participants' estimates of replication rates according to which types of study included in Table  3 each participant considered a type of replication. The estimated replication rate was similar in all subsets.

3.5. Obstacles to replication studies

When asked to comment on the obstacles to replication, 407 participants provided free‐text responses, giving insight into why the replication rate might be low (Table  4 ).

Summary of free‐text responses to the question “In your opinion, what are the main obstacles to replication?”

 Difficulty funding and publishingAcademic cultureLogistical constraintsEnvironmental variability
Number of comments3321218136
Indicative quotes

“Given competitive landscape in academia, replication studies hold little reward for researcher‐i.e. no funding/hard to publish/not seen as novel so don't frame you as a research leader in any field”

“Hard to publish…very limited resources for biodiversity/ ecology research anyway.”

“I think most scientists want to be known for original work, not for doing ‘some else's’ science.”

“Too many things to do, not enough ecologists.”

“Lack of emphasis on its importance. funding tends to favour new/novel research. Stigma ‐ people may dislike others who try to replicate their studies. People may consider it ‘lesser or easier science’ replicating.”

“$$ and availability of research sites. When doing field ecology, it can be extremely difficult to replicate sites”

“Logistics! Field/ experiments can be expensive and time consuming ‐ also in small populations!”

“Hard to find the detailed information necessary for proper replication in original study”

“Long term replication studies are vital to ecology however the problem is climate and habitat loss etc all of which can make it very hard to replicate experiments over time”

“Unique attributes of year‐to‐year variability and the challenges that presents ‐ at least for field‐based work for other settings (lab/greenhouse) it seems much more reasonable/worthwhile”

Topics covered

‐ Difficulty funding,

‐ Short duration of funding,

‐ difficulty publishing,

‐ Expect low citation rate,

‐ Not “novel”

‐ Bad for career advancement,

‐ Prioritizing important novel work,

‐ Replications not interesting to do

‐ Not enough time,

‐ Insufficient transparency of methods,

‐ Difficulty accessing original data,

‐ Few candidate sites/populations/individuals

Influence of:

‐ Climate change

‐ Landscape level changes (e.g., caused by clearing or agriculture)

‐ Year on year variation in climate

4. DISCUSSION

4.1. importance of replication.

The overwhelming majority of the ecologists in our study were very positive about replication studies. They considered replication studies to be important, want to see more of them in the literature and support publishing them (Figure  1a‐d ). Enthusiasm for replication studies is further underlined by the sheer quantity of free‐text comments our participants gave ( https://osf.io/bqc74 ). Although we did not give participants a free‐text question about their perspectives on the role of replication studies, some expressed their views about this in the general comments section at the end of the survey. Some evocative examples of these include:

Ecological replication studies should be necessary where results are applied directly to ecosystem management beyond the local/target species context of the study. Replication means different things in different fields. In biodiversity research replication of studies/phenomena, typically with different settings, species, regions etc., is absolutely essential. The question is when there is enough evidence, i.e. when to stop. There is little point in replicating the study EXACTLY (cf. your question 9 above). In molecular biology or e.g. ecotoxicology it seems that doing the latter actually makes more sense. Different labs should span together and run the same experiment in parallel to eventually publish together. I think journals should automatically publish replications (or failures to replicate) if they published the original study. I would also be interested in how microbiology vs other biology fields replicate results.

However, there is a disconnect between this message of support for replication studies expressed in portions of our survey and the data on how researchers publish, use, and prioritize replications. First, the best available estimate is that only 0.023% of ecology studies are identified by their authors as replications (Kelly,  2019 ). This is tiny compared to our participants' median estimate of 10% replication. The disconnect is evident even within our survey, where only a minority of respondents claimed to “almost always” check for replications when investigating a finding (Figure  2 ), despite emphasizing the importance of replication in other questions and free responses. Similarly, around a third of participants also indicated that, given limited funding, the focus should continue to be on novel research (Figure  1c ) and that replication studies should only be published in special editions or dedicated replication journals, or only if the results differ (Figure  1d ). This, combined with comments such as “People often want to research something novel, I think there's a mental block among scientists when it comes to replication; most recognize it's necessary, but most aren't particularly interested in doing it themselves,” suggests a gap between the perceived value of replication studies and the impetus to perform them. Comments such as this expose the mistake of assuming replication work—even direct replication—cannot make a novel contribution. For example, working out which aspects of a study are intrinsic to its conclusion and should not be varied in a replication is itself a substantial intellectual contributions (Nosek & Errington,  2017 ).

This disconnect may be explained by the obstacles identified in this paper, chief among them (a) researchers are, perhaps rightly (Ahadi et al.,  2016 ; Asendorpf & Conner,  2012 ; Baker & Penny,  2016 ), concerned that they would have trouble publishing or funding replication studies, (b) conducting replication studies can be logistically problematic, (c) environmental variation makes conducting and interpreting the results of replication studies difficult (Shavit & Ellison,  2017 ), and (d) researchers are unwilling to conduct replication studies because they believe they are boring and less likely to provide prestige than novel research (Ahadi et al.,  2016 ; Kelly,  2006 ).

There is movement toward making replication studies more feasible and publishable in other fields, with the inclusion of a criterion describing journals' stance on accepting replication studies as part of the TOP guidelines (Nosek et al.,  2015 ; to which over 5,000 journals are signatories) and the advent of Registered Replication Reports (Simons, Holcombe, & Spellman,  2014 ) at several psychology journals. Similarly, initiatives like the many laboratories projects (e.g., Klein et al.,  2014 ), StudySwap ( https://osf.io/9aj5g/ ) and the psychological science accelerator ( https://psysciacc.org/ ) build communities that may help overcome the logistical difficulties with replication studies as well as increasing the interest and prestige associated with conducting replication studies. Although no initiatives to directly replicate previously published studies yet exist in ecology, there is a growing movement to improve assessment of generality of hypotheses through collaborations across large numbers of laboratories, implementing identical experiments in different systems (Borer et al.,  2014 ; Knapp et al.,  2017 ; Peters, Loescher, Sanclements, & Havstad,  2014 ; Verheyen et al.,  2016 , 2017 ). The success of these “distributed experiments” suggests that ecologists may be open to forms of collaborations designed to replicate published work.

4.2. Conceptual slippage

As in Mulkay and Gilbert ( 1991 ), we find evidence of conceptual slippage between different types of replication study. We asked participants whether they consider different types of potential studies “replication studies.” Participants were able to select multiple options. We expected that participants who include conceptual replications in their definition of replication studies would provide higher estimates for the percentage of ecological studies that are replicated. However, there was little difference in participants' estimates of the replication rate regardless of how permissive their definition of replication was (Table  3 ). This suggests that ecologists have a fluid definition of what a “replication study” is. Similarly, the majority of surveys were distributed by hand, and early in the data collection, it became evident that some were thinking about replicates within a study (i.e., samples) rather than replication of the whole study. As soon as this became evident, we informed each new participant that we were interested in repeating whole studies, not replicates or samples within study. The effect of this confusion on our results is likely to be minimal, because certainly virtually all ecology studies contain within‐study replicates but only 36 of 439 participants (8%) gave answers higher than 50% for the question “What percentage of studies do you think are replicated in ecology?”. This 8% presumably captures all the participants who were answering about “replicates” as well as some that have a very broad definition of what constitutes a replication study.

4.3. The continuum of replication

We found very high level of agreement (90%) that “redoing an experiment or study as closely as possible to the original” (i.e., a direct replication) should be considered a replication study. Most ecologists had a view of replication studies that is much broader than direct replication to the extent that 38% considered “redoing an experiment or study with different methods in the same context” and 14% considered “redoing an experiment or study with different methods in a different context” to be replication studies. This permissive definition of a replication study may be driven by the strong influence of environmental variability on the results of ecological research. It is also consistent with Kelly's ( 2006 ) observation that conceptual and quasireplication are common in behavioral ecology. Conceptual and quasireplications are required to extend spatial, temporal, and taxonomic generalizability in a field with multitudes of study systems, all of which are strongly influenced by their environment.

Many participating ecologists commented that direct replications may be difficult or impossible in ecology due to the strong influence of environmental variability and need for long‐term studies, concerns that are also voiced by Kelly ( 2006 ), Nakagawa and Parker ( 2015 ), Kress (2017), and Schnitzer and Carson ( 2016 ). Schnitzer and Carson ( 2016 ) propose that putting more resources into ensuring that new studies are conducted over a large spatial and temporal scale performing a similar epistemic function as certain types of replication study. Nakagawa and Parker ( 2015 ) suggest that the impact of environmental variability can be overcome by conducting multiple robust replications (inevitably in different environmental conditions) and evaluating the overall trends using meta‐analysis. In contrast, Kelly ( 2006 ) advocates pairing direct and conceptual replications within a single study, providing insights about both the validity and generalizability of the results and increasing the chance of publication (when compared to a direct replication alone). These suggestions have the potential to make replication studies in ecology more feasible and thereby improve the reliability of the ecology literature. Emphasizing the importance of conceptual replications may also make it easier to build a research culture that is more accepting of replication studies.

Conceptual replications may already be common in ecology and evolutionary biology, but presumably because of the desire to appear novel, such studies are almost never identified as replication. Kelly ( 2006 ) found that even though direct replications were absent from a sample of studies in three animal behavior journals, more than a quarter of these studies could be classified as conceptual replications with the same study species, and most of the rest were “quasireplications” in which a previously tested hypothesis was studied in a new taxon. It seems therefore that testing previously tested hypotheses is the norm. We just do not notice because researchers explicitly distinguish their work from previously published research rather than calling attention to the ways in which their studies are replications. In fact, almost none of these conceptual or quasireplications are identified as replications by their authors (Kelly,  2019 ). This brings up two shortcomings of the current system. First, as pointed out earlier, researchers almost never conduct direct replications, and so the benefits of direct replication in terms of convincing tests of internal validity, are nearly absent. Second, even when researchers conduct conceptual or quasireplications, if they are reluctant to call their work replication, some of the inferential value of their work in testing for generality may be missed. In fact, anecdotally, it seems that inconsistency among conceptual replications is often attributed to biological variation and that this is typically interpreted as meaning that the hypothesized process is more complex or contingent on other factors than originally thought. The generality of the original hypothesis is often not directly challenged.

5. CONCLUSION

Most of our participating ecologists agreed that replication studies are important; however, some responses are suggestive of ambivalence toward conducting them. Convincing editors to accept Registered Replication Reports, emphasizing the value of less direct, more conceptual replication, and beginning grassroots replication initiatives (inspired by StudySwap, psychological science accelerator, the many laboratories projects, and existing distributed experiments in ecology) in ecology and related fields may combat ecologists' reluctance to conduct replication studies. Beyond that, we believe that the best approach to replication studies in ecology is to:

  • Identify subsets of studies for which direct or close replication is possible and, because of their importance, value and put resources into such replications. If possible, conduct these as Registered Reports (Nosek & Lakens,  2014 ).
  • direct computational reproducibility: analyzing the original data using the original analysis scripts (Powers & Hampton,  2019 ),
  • conceptual computational reproducibility: analyzing the same data with a different analysis method, and/or
  • robustness/sensitivity analysis: analyzing the same data and strategically varying some elements of the analysis as in the multiverse approach (Steegen, Tuerlinckx, Gelman, & Vanpaemel,  2016 ).
  • Identify subsets of studies for which generalizability is the main concern, and work toward developing “constraints of generality” statements for them (Simons, Shoda, & Lindsay,  2017 ). Constraints on generality statements explicitly identify the conditions in which the authors think their results are or are not valid. This frees replicators from matching conditions directly and allows replications for generality within constraints laid out by the original authors.

CONFLICT OF INTEREST

The authors have no conflict of interests.

AUTHOR CONTRIBUTIONS

Hannah Fraser: Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (supporting); Project administration (equal); Visualization (lead); Writing‐original draft (lead); Writing‐review & editing (lead). Ashley Barnett: Conceptualization (equal); Data curation (supporting); Formal analysis (supporting); Methodology (supporting). Timothy H. Parker: Conceptualization (equal); Writing‐original draft (supporting); Writing‐review & editing (supporting). Fiona Fidler: Conceptualization (equal); Funding acquisition (lead); Investigation (supporting); Methodology (supporting); Project administration (supporting); Resources (lead); Supervision (lead); Writing‐original draft (supporting); Writing‐review & editing (supporting).

Open Research Badges

This article has been awarded Open Data and Open Materials Badges. All materials and data are publicly accessible via the Open Science Framework at https://doi.org/10.17605/OSF.IO/BQC74 .

ACKNOWLEDGMENTS

Franca Agnoli provided feedback that improved the manuscript and 439 anonymous ecologists generously gave their time to fill in our survey.

Fraser H, Barnett A, Parker TH, Fidler F. The role of replication studies in ecology . Ecol Evol . 2020; 10 :5197–5207. 10.1002/ece3.6330 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

This research was funded by the Australian Research Council Future Fellowship FT150100297s.

DATA AVAILABILITY STATEMENT

  • Agnoli, F. , Wicherts, J. M. , Veldkamp, C. L. S. , Albiero, P. , & Cubelli, R. (2017). Questionable research practices among Italian research psychologists . PLoS ONE , 12 , 1–17. 10.1371/journal.pone.0172792 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ahadi, A. , Hellas, A. , Ihantola, P. , Korhonen, A. , & Petersen, A. (2016). Replication in computing education research: Researcher attitudes and experiences . Proceedings of the 16th Koli Calling International Conference on Computing Education Research (pp. 2–11). [ Google Scholar ]
  • Asendorpf, J. B. , & Conner, M. (2012). Recommendations for increasing replicability in psychology . European Journal of Personality , 119 , 108–119. [ Google Scholar ]
  • Baker, M. , & Penny, D. (2016). Is there a reproducibility crisis? Nature , 533 , 452–454. [ PubMed ] [ Google Scholar ]
  • Borer, E. T. , Harpole, W. S. , Adler, P. B. , Lind, E. M. , Orrock, J. L. , Seabloom, E. W. , & Smith, M. D. (2014). Finding generality in ecology: A model for globally distributed experiments . Methods in Ecology and Evolution , 5 , 65–73. 10.1111/2041-210X.12125 [ CrossRef ] [ Google Scholar ]
  • Camerer, C. F. , Dreber, A. , Forsell, E. , Ho, T.‐H. , Huber, J. , Johnnesson, M. , … Wu, H. (2016). Evaluating replicability of laboratory experiments in economics . Science , 351 , 1433–1437. 10.1126/science.aaf0918 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Camerer, C. F. , Dreber, A. , Holzmeister, F. , Ho, T.‐H. , Huber, J. , Johannesson, M. , … Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 . Nature Human Behaviour , 2 , 637–644. [ PubMed ] [ Google Scholar ]
  • Cassey, P. , Ewen, J. G. , Blackburn, T. M. , & Moller, A. P. (2004). A survey of publication bias within evolutionary ecology . Proceedings of the Royal Society B: Biological Sciences , 271 , S451–S454. 10.1098/rsbl.2004.0218 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dorai‐Raj, S. (2014). binom: Binomial confidence intervals for several parameterizations . [ Google Scholar ]
  • Fanelli, D. (2010). “Positive” results increase down the hierarchy of the sciences . PLoS ONE , 5 , e10068 10.1371/journal.pone.0010068 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries . Scientometrics , 90 , 891–904. 10.1007/s11192-011-0494-7 [ CrossRef ] [ Google Scholar ]
  • Fidler, F. , En Chee, Y. , Wintle, B. C. , Burgman, M. A. , McCarthy, M. A. , & Gordon, A. (2017). Metaresearch for evaluating reproducibility in ecology and evolution . BioScience , 67 , 282–289. 10.1093/biosci/biw159 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fidler, F. , & Wilcox, J. (2018). Reproducibility of scientific results In Zalta E. N. (Ed.), The Stanford encyclopedia of philosophy . Stanford, CA: Metaphysics Research Lab, Stanford University; https://plato.stanford.edu/archives/win2018/entries/scientific‐reproducibility/ [ Google Scholar ]
  • Franco, A. , Malhotra, N. , Simonovits, G. , Dickersin, K. , Rosenthal, R. , Begg, C. B. , … Miguel, E. (2014). Publication bias in the social sciences: Unlocking the file drawer . Science , 345 , 1502–1505. 10.1126/science.1255484 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fraser, H. , Parker, T. H. , Nakagawa, S. , Barnett, A. , & Fidler, F. (2018). Questionable research practices in ecology and evolution . PLoS ONE , 13 , e0200303 10.1371/journal.pone.0200303 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Freedman, L. P. , Cockburn, I. M. , & Simcoe, T. S. (2015). The economics of reproducibility in preclinical research . PLoS Biology , 13 , 1–9. 10.1371/journal.pbio.1002165 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hope, R. M. (2013). Rmisc: Ryan miscellaneous . [ Google Scholar ]
  • Jennions, M. D. , & Moller, A. P. (2000). A survey of the statistical power of research in behavioral ecology and animal behavior . Behavioral Ecology , 14 , 438–445. 10.1093/beheco/14.3.438 [ CrossRef ] [ Google Scholar ]
  • Jennions, M. D. , & Moller, A. P. (2002). Publication bias in ecology and evolution: An empirical assessment using the ‘trim and fill’ method . Biological Reviews of the Cambridge Philosophical Society , 77 , 211–222. 10.1017/S1464793101005875 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • John, L. K. , Loewenstein, G. , & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling . Psychological Science , 23 , 524–532. 10.1177/0956797611430953 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kelly, C. D. (2006). Replicating empirical research in behavioural ecology: How and why it should be done but rarely is . The Quarterly Review of Biology , 80 , 221–236. [ PubMed ] [ Google Scholar ]
  • Kelly, C. D. (2019). Rate and success of study replication in ecology and evolution . PeerJ, 7 , e7654 10.7717/peerj.7654 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Klein, R. A. , Ratliff, K. A. , Vianello, M. , Adams, R. B. , Bahník, Š. , Bernstein, M. J. , … Nosek, B. A. (2014). Investigating variation in replicability: A “many labs” replication project . Social Psychology , 45 , 142–152. 10.1027/1864-9335/a000178 [ CrossRef ] [ Google Scholar ]
  • Knapp, A. K. , Avolio, M. L. , Beier, C. , Carroll, C. J. W. , Collins, S. L. , Dukes, J. S. , … Smith, M. D. (2017). Pushing precipitation to the extremes in distributed experiments: Recommendations for simulating wet and dry years . Global Change Biology , 23 , 1774–1782. 10.1111/gcb.13504 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Madden, C. , Easley, R. , & Dunn, M. (1995). How journal editors view replication research . Journal of Advertising , 24 , 77–87. 10.1080/00913367.1995.10673490 [ CrossRef ] [ Google Scholar ]
  • Makel, M. C. , & Plucker, J. A. (2014). Facts are more important than novelty: Replication in the education sciences . Educational Researcher , 43 , 304–316. 10.3102/0013189X14545513 [ CrossRef ] [ Google Scholar ]
  • Makel, M. C. , Plucker, J. A. , & Hegarty, B. (2012). Replications in psychology research: How often do they really occur? Perspectives on Psychological Science , 7 , 537–542. 10.1177/1745691612460688 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Maxwell, S. E. , Lau, M. Y. , & Howard, G. S. (2015). Is psychology suffering from a replication crisis?: What does “failure to replicate” really mean? American Psychologist , 70 , 487–498. 10.1037/a0039400 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mulkay, M. , & Gilbert, N. (1991). Replication and mere replication (1986) In Mulkay M. (Ed.), Sociology of science. A sociological pilgrimage (pp. 154–166). Buckingham, UK: Open University Press. [ Google Scholar ]
  • Murtaugh, P. A. (2002). Journal quality, effect size, and publication bias in meta‐analysis . Ecology , 83 , 1162–1166. [ Google Scholar ]
  • Nakagawa, S. , & Parker, T. H. (2015). Replicating research in ecology and evolution: Feasibility, incentives, and the cost‐benefit conundrum . BMC Biology , 13 , 1–6. 10.1186/s12915-015-0196-3 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nosek, B. , Alter, G. , Banks, G. C. , Borsboom, D. , Bowman, S. D. , Breckler, S. J. , … Yarkoni, T. (2015). Promoting an open research culture . Science , 348 , 1422–1425. 10.1126/science.aab2374 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nosek, B. A. , & Errington, T. M. (2017). Making sense of replications . eLife , 6 , 4–7. 10.7554/eLife.23383 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nosek, B. A. , & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results . Social Psychology , 45 , 137–141. 10.1027/1864-9335/a000192 [ CrossRef ] [ Google Scholar ]
  • Open Science Collaboration (2015). Estimating the reproducibility of psychological science . Science , 349 , 943–951. [ PubMed ] [ Google Scholar ]
  • Parker, T. H. , Forstmeier, W. , Koricheva, J. , Fidler, F. , Hadfield, J. D. , Chee, Y. E. , … Nakagawa, S. (2016). Transparency in Ecology and Evolution: Real Problems, Real Solutions . Trends in Ecology and Evolution , 31 , 711–719. 10.1016/j.tree.2016.07.002 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Peters, D. P. C. , Loescher, H. W. , Sanclements, M. D. , & Havstad, K. M. (2014). Taking the pulse of a continent: Expanding site‐based research infrastructure for regional‐ to continental‐scale ecology . Ecosphere , 5 , 1–23. 10.1890/ES13-00295.1 [ CrossRef ] [ Google Scholar ]
  • Powers, S. M. , & Hampton, S. E. (2019). Open science, reproducibility, and transparency in ecology . Ecological Applications , 29 , 1–8. 10.1002/eap.1822 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • R Development Core Team . (2018). R: A language and environment for statistical computing . Vienna, Austria: R Foundation for Statistical Computing. [ Google Scholar ]
  • Sánchez‐Tójar, A. , Nakagawa, S. , Sánchez‐Fortún, M. , Martin, D. A. , Ramani, S. , Girndt, A. , … Schroeder, J. (2018). Meta‐analysis challenges a textbook example of status signalling and demonstrates publication bias . eLife , 7 , 1–41. 10.7554/eLife.37385 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schnitzer, S. A. , & Carson, W. P. (2016). Would ecology fail the repeatability test? BioScience , 66 , 98–99. 10.1093/biosci/biv176 [ CrossRef ] [ Google Scholar ]
  • Seguin, A. , & Forstmeier, W. (2012). No Band color effects on male courtship rate or body mass in the zebra finch: Four experiments and a meta‐analysis . PLoS ONE , 7 ( 6 ), e37785 10.1371/journal.pone.0037785 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Shavit, A. , & Ellison, A. M. (2017). Stepping in the same river twice: Replication in biological research . New Haven, CT: Yale University Press. [ Google Scholar ]
  • Simons, D. J. , Holcombe, A. O. , & Spellman, B. A. (2014). An introduction to registered replication reports at perspectives on psychological science . Perspectives on Psychological Science , 9 , 552–555. [ PubMed ] [ Google Scholar ]
  • Simons, D. J. , Shoda, Y. , & Lindsay, S. D. (2017). Constraints on Generality (COG): A Proposed Addition to All Empirical Papers . Perspectives on Psychological Science , 12 , 1123–1128. 10.1177/1745691617708630 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Steegen, S. , Tuerlinckx, F. , Gelman, A. , & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis . Perspectives on Psychological Science , 11 , 702–712. 10.1177/1745691616658637 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Valero‐Mora, P. M. (2015). ggplot2: Elegant graphics for data analysis . New York, NY: Springer‐Verlag. [ Google Scholar ]
  • Verheyen, K. , De Frenne, P. , Baeten, L. , Waller, D. M. , Hédl, R. , Perring, M. P. , … Bernhardt‐Römermann, M. (2017). Combining biodiversity resurveys across regions to advance global change research . BioScience , 67 , 73–83. 10.1093/biosci/biw150 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Verheyen, K. , Vanhellemont, M. , Auge, H. , Baeten, L. , Baraloto, C. , Barsoum, N. , … Scherer‐Lorenzen, M. (2016). Contributions of a global network of tree diversity experiments to sustainable forest plantations . Ambio , 45 , 29–41. 10.1007/s13280-015-0685-1 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wang, D. , Forstmeier, W. , Ihle, M. , Khadraoui, M. , Jer, S. , Martin, K. , & Kempenaers, B. (2018). Irreproducible text‐book “knowledge”: The effects of color bands on zebra finch fitness . Evolution , 72 ( 4 ), 961–976. 10.1111/evo.13459 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Skip to main content
  • Keyboard shortcuts for audio player

Hidden Brain

Scientific findings often fail to be replicated, researchers say.

Shankar Vedantam 2017 square

Shankar Vedantam

A massive effort to test the validity of 100 psychology experiments finds that more than 50 percent of the studies fail to replicate. This is based on a new study published in the journal "Science."

Copyright © 2015 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.

NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.

We need your support today

Independent journalism is more important than ever. Vox is here to explain this unprecedented election cycle and help you understand the larger stakes. We will break down where the candidates stand on major issues, from economic policy to immigration, foreign policy, criminal justice, and abortion. We’ll answer your biggest questions, and we’ll explain what matters — and why. This timely and essential task, however, is expensive to produce.

We rely on readers like you to fund our journalism. Will you support our work and become a Vox Member today?

More social science studies just failed to replicate. Here’s why this is good.

What scientists learn from failed replications: how to do better science.

by Brian Resnick

Psychologists are still wondering: “What’s going on in there?” They’re just doing it with greater rigor.

One of the cornerstone principles of science is replication. This is the idea that experiments need to be repeated to find out if the results will be consistent. The fact that an experiment can be replicated is how we know its results contain a nugget of truth. Without replication, we can’t be sure.

For the past several years, social scientists have been deeply worried about the replicability of their findings. Incredibly influential, textbook findings in psychology — like the “ ego depletion” theory of willpower, or the “ marshmallow test ” — have been bending or breaking under rigorous retests. And the scientists have learned that what they used to consider commonplace methodological practices were really just recipes to generate false positives. This period has been called the “ replication crisis ” by some.

And the reckoning is still underway. Recently, a team of social scientists — spanning psychologists and economists — attempted to replicate 21 findings published in the most prestigious general science journals: Nature and Science. Some of the retested studies have been widely influential in science and in pop culture, like a 2011 paper on whether access to search engines hinders our memories, or whether reading books improves a child’s theory of mind (meaning their ability to understand that other people have thoughts and intentions different from their own).

On Monday, they’re publishing their results in the journal Nature Human Behavior. Here’s their take-home lesson: Even studies that are published in the top journals should be taken with a grain of salt until they are replicated. They’re initial findings, not ironclad truth. And they can be really hard to replicate, for a variety of reasons.

Rigorous retests of social science studies often yield less impressive results

The scientists who ran the 21 replication tests didn’t just repeat the original experiments — they made them more rigorous. In some cases, they increased the number of participants by a factor of five, and preregistered their study and analysis designs before a single participant was brought into the lab.

All the original authors (save for one group that couldn’t be reached), signed off on the study designs too. Preregistering is like making a promise to not deviate from a plan and inject bias into the results.

Here are the results: 13 of the 21 results replicated. But perhaps just as notable: Even among the studies that did pass, the effect sizes (that is, the difference between the experimental group and the control group in the experiment, or the size of the change the experimental manipulation made) decreased by around half, meaning that the original findings likely overstated the power of the experimental manipulation.

“Overall, our study shows statistically significant scientific findings should be interpreted rather cautiously until they have been replicated, even if they have been published in the most renowned journals,” Felix Holzmeister, an Austrian economist and one of the study co-authors, says.

It’s not always clear why a study doesn’t replicate. Science is hard.

Many of the papers that were retested contained multiple experiments. Only one experiment from each paper was tested. So these failed replications don’t necessarily mean the theory behind the original findings is totally bunk.

For instance, the famous “Google Effects on Memory” paper — which found that we often don’t remember things as well when we know we can search for them online — did not replicate in this study. But the experiment chosen was a word-priming task (i.e., does thinking about the internet make it harder to retrieve information), and not the more real-world experiment that involved actually answering trivia statements. And other research since has bolstered that paper’s general argument that access to the internet is shifting the relationship we have with, and the utility of, our own memories.

There could be a lot of reasons a result doesn’t replicate. One is that the experimenters doing the replication messed something up.

The world needs more wonder

The Unexplainable newsletter guides you through the most fascinating, unanswered questions in science — and the mind-bending ways scientists are trying to answer them. Sign up today .

Another reason can be that the study stumbled on a false positive.

One of the experiments that didn’t replicate was from University of Kentucky psychologist Will Gervais. The experiment tried to see if getting people to think more rationally would make them less willing to report religious belief.

“In hindsight, our study was outright silly,” Gervais says. They had people look at a picture of Rodin’s The Thinker or another statue. They thought The Thinker would nudge people to think harder.

“When we asked them a single question on whether they believe in God, it was a really tiny sample size, and barely significant ... I’d like to think it wouldn’t get published today,” Gervais says. (And know, this study was published in Science a top journal.)

In other cases, a study may not replicate because the target — the human subjects — has changed. In 2012, MIT psychologist David Rand published a paper in Nature on human cooperation. The experiment involved online participants playing an economics game. He argues that a lot of online study participants have since grown familiar with this game, which makes it a less useful tool to probe real-life behaviors. His experiment didn’t replicate in the new study.

Finding out why a study didn’t replicate is hard work. But it’s exactly the type of work, and thinking, that scientists need to be engaged in. The point of this replication project, and others like it , is not to call out individual researchers. “It’s a reminder of our values,” says Brian Nosek, a psychologist and the director of the Center for Open Science , who collaborated on the new study. Scientists who publish in top journals should know their work may be checked up on. It’s also important, he notes, to know that social science’s inability to be replicable is in itself a replicable finding.

Often, when studies don’t replicate, it’s not that the effort totally disproves the underlying hypothesis. And it doesn’t mean the original study authors were frauds. But replication results do often significantly change the story we tell about the experiment .

For instance, I recently wrote about a replication effort of the famous “marshmallow test” studies, which originally showed that the ability to delay gratification early in life is correlated with success later on. A new paper found this correlation, but when the authors controlled for factors like family background, the correlation went away.

Here’s how the story changed: Delay of gratification is not a unique lever to pull to positively influence other aspects of a person’s life. It’s a consequence of bigger-picture, harder-to-change components of a person.

In science, too often, the first demonstration of an idea becomes the lasting one. Replications are a reminder that in science, this isn’t supposed to be the case. Science ought to embrace and learn from failure.

The “replication crisis” in psychology has been going on for years now. And scientists are reforming their ways.

The “replication crisis” in psychology, as it is often called, started around 2010 , when a paper using completely accepted experimental methods was published purporting to find evidence that people were capable of perceiving the future, which is impossible. This prompted a reckoning : Common practices like drawing on small samples of college students were found to be insufficient to find true experimental effects.

Scientists thought if you could find an effect in a small number of people, that effect must be robust. But often, significant results from small samples turn out to be statistical flukes. (For more on this, read our explainer on p-values.)

The crisis intensified in 2015 when a group of psychologists, which included Nosek, published a report in Science with evidence of an overarching problem: When 270 psychologists tried to replicate 100 experiments published in top journals, only around 40 percent of the studies held up. The remainder either failed or yielded inconclusive data. And again, the replications that did work showed weaker effects than the original papers. The studies that tended to replicate had more highly significant results compared to the ones that just barely crossed the threshold of significance.

Another important reason to do replications, Nosek says, is to get better at understanding what types of studies are most likely to replicate, and to sharpen scientists’ intuitions about what hypotheses are worthy of testing and which are not.

As part of the new study, Nosek and his colleagues added a prediction component. A group of scientists took bets on which studies they thought would replicate and which they thought wouldn’t. The bets largely tracked with the final results.

As you can see in the chart below, the yellow dots are the studies that did not replicate, and they were all unfavorably ranked by the prediction market survey.

“These results suggest [there’s] something systematic about papers that fail to replicate,” Anna Dreber, a Stockholm-based economist and one of the study co-authors, says.

scientist replicate experiments

One thing that stands out: Many of the papers that failed to replicate sound a little too good to be true. Take this 2010 paper that finds simply washing hands negates a common human hindsight bias . When we make a tough choice, we often look back on the choice we passed on unfavorably and are biased to find reasons to justify our decision. Washing hands in an experiment “seems to more generally remove past concerns, resulting in a metaphorical ‘clean slate’ effect,” the study’s abstract stated .

It all sounds a little too easy, too simple — and it didn’t replicate.

All that said, there are some promising signs that social science is getting better. More and more scientists are preregistering their study designs . This prevents them from cherry-picking results and analyses that are more favorable to their favored conclusions. Journals are getting better at demanding larger subject pools in experiments and are increasingly insisting that scientists share all the underlying data of their experiments for others to assess.

“The lesson out of this project,” Nosek says, “is a very positive message of reformation. Science is going to get better.”

Most Popular

  • America isn’t ready for another war — because it doesn’t have the troops
  • Your guide to the Brittany Mahomes-Donald Trump drama, such as it is
  • The Trump Arlington National Cemetery controversy, explained
  • Democrats’ vibes are excellent. Can they turn that into votes?
  • Take a mental break with the newest Vox crossword

Today, Explained

Understand the world with a daily explainer plus the most compelling stories of the day.

 alt=

This is the title for the native ad

 alt=

More in Science

SpaceX’s risky mission will go farther into space than we’ve been in 50 years

The privately funded venture will test out new aerospace technology.

The staggering death toll of scientific lies

Scientific fraud kills people. Should it be illegal?

Big Pharma claims lower prices will mean giving up miracle medications. Ignore them.

The case against Medicare drug price negotiations doesn’t add up.

Antibiotics are failing. The US has a plan to launch a research renaissance.

But there might be global consequences.

Why does it feel like everyone is getting Covid?

Covid’s summer surge, explained

Earthquakes are among our deadliest disasters. Scientists are racing to get ahead of them.

Japan’s early-warning system shows a few extra seconds can save scores of lives.

IMAGES

  1. Scientists performing experiments Stock Photo

    scientist replicate experiments

  2. Experiments in Progress. a Male Scientist Conducting an Experiment in

    scientist replicate experiments

  3. Scientist Doing Chemical Experiments in Laboratory Stock Photo

    scientist replicate experiments

  4. Scientists doing experiment in lab Stock Photo

    scientist replicate experiments

  5. PPT

    scientist replicate experiments

  6. Young Female Scientist Holding Petri Dish Doing Experiments with

    scientist replicate experiments

VIDEO

  1. Why SCIENTIST REPLICATE ITS FINDING?#science #lesson

  2. ये जादू नहीं SCIENCE हैं।🧠 #experiment #short

  3. Replication 3D Animation

  4. Maximize Scientific Knowledge Transfer with JoVE

  5. Platelets can replicate the benefits of exercise in the brain

  6. Getting started with Cog and Replicate

COMMENTS

  1. Replicating scientific results is tough

    Credit: Anne-Christine Poujoulat/AFP/Getty. Replicabillity — the ability to obtain the same result when an experiment is repeated — is foundational to science. But in many research fields it ...

  2. Replicates and repeats—what is the difference and is it significant?

    As biological experiments can be complicated, replicate measurements are often taken to monitor the performance of the experiment, but such replicates are not independent tests of the hypothesis, and so they cannot provide evidence of the reproducibility of the main results. ... Science is knowledge obtained by repeated experiment or ...

  3. Why Should Scientific Results Be Reproducible?

    While most replication efforts have focused on biomedicine, health, and psychology, a recent survey of over 1,500 scientists from various fields suggests that the problem is widespread.

  4. 1,500 scientists lift the lid on reproducibility

    Survey sheds light on the 'crisis' rocking research. More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to ...

  5. Replicability

    CONCLUSION 5-3: Because many scientists routinely conduct replication tests as part of a follow-on work and do not report replication results separately, the evidence base of non-replicability across all science and engineering research is incomplete. ... Before conducting an experiment, a researcher must make a number of decisions about study ...

  6. Six factors affecting reproducibility in life science research and how

    Here are some of the most significant categories. A lack of access to methodological details, raw data, and research materials. For scientists to be able to reproduce published work, they must be ...

  7. Reproducibility and Replicability in Research

    The same challenges apply to scientific experiments. ... "Over time, our hope is that when a scientist takes on or attempts replication — because the value of the result can outweigh cost, because a great deal weighs on scientific basis — those types of papers will get recognition in a scholar's career." ...

  8. PDF Replicating scientific results is tough

    A high-profile replication study in cancer biology has had disappointing results. Scientists must redouble their efforts to find out why. R eplicabillity — the ability to obtain the same result when an experiment is repeated — is foun - dational to science. But in many research fields it has proved difficult to achieve. An important

  9. Reproducibility and Replicability in Science

    Scientific research has evolved from an activity mainly undertaken by individuals operating in a few locations to many teams, large communities, and complex organizations involving hundreds to thousands of individuals worldwide. In the 17th century, scientists would communicate through letters and were able to understand and assimilate major developments across all the emerging major ...

  10. Most scientists 'can't replicate studies by their peers'

    According to a survey published in the journal Nature last summer, more than 70% of researchers have tried and failed to reproduce another scientist's experiments. Marcus Munafo is one of them ...

  11. Reproducibility of Scientific Results

    Determining whether one experiment is a proper replication of another is complicated by the facts that scientific writing conventions often omit precise details of experimental methodology (Collins 2016), and, furthermore, much of the knowledge that scientists require to execute experiments is tacit and "cannot be fully explicated or ...

  12. Reproducibility and Replicability in Science

    provide definitions of "reproducibility" and "replication" accounting for the diversity of fields in science and engineering, 2. assess what is known and, if necessary, identify areas that may need more information to ascertain the extent of the issues of replication and reproducibility in scientific and engineering research, 3.

  13. Replicates and repeats—what is the difference and is it significant?:

    The answer, of course, is 'no'. Replicates serve as internal quality checks on how the experiment was performed. If, for example, in the experiment described in Table 1 and Fig 1, one of the replicate plates with saline‐treated WT bone marrow contained 100 colonies, you would immediately suspect that something was wrong. You could check ...

  14. Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got

    But how many of those experiments would produce the same results a second time around? According to work presented today in Science, fewer than half of 100 studies published in 2008 in three top ...

  15. 5 Replicability

    CONCLUSION 5-3: Because many scientists routinely conduct replication tests as part of a follow-on work and do not report replication results separately, the evidence base of non-replicability across all science and engineering research is incomplete. ... Experiments conducted under the same conditions may run the risk of finding "truths ...

  16. Evaluating the replicability of social science experiments in

    Being able to replicate scientific findings is crucial for scientific progress1-15. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and ...

  17. Genuine replication and pseudoreplication: what's the difference?

    Replication is a key idea in science and statistics, but is often misunderstood by researchers because they receive little education or training on experimental design. Consequently, the wrong entity is replicated in many experiments, leading to pseudoreplication or the "unit of analysis" problem [1,2]. This results in exaggerated sample ...

  18. To Replicate or Not To Replicate?

    The trick, he says, is in seeing replication not as an end in itself but as a means for acquainting yourself with the methods used in a study, the original author's line of thinking, the complications he or she must have faced, and the solutions they devised to those problems. Replicating an experiment, or even the whole study, can be useful ...

  19. What is Science?: Repeat and Replicate

    Replication. Once we have repeated our testing over and over, and think we understand the results, then it is time for replication. That means getting other scientists to perform the same tests, to see whether they get the same results. As with repetition, the most important things to watch for are results that don't fit our hypothesis, and for ...

  20. The role of replication studies in ecology

    Over the last 8-10 years, concern over a "replication crisis" in science has mounted. The basis of this concern comes from large‐scale direct replication projects in several fields which found low rates of successful replication. ... Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 ...

  21. Scientific Findings Often Fail To Be Replicated, Researchers Say

    A massive effort to test the validity of 100 psychology experiments finds that more than 50 percent of the studies fail to replicate. This is based on a new study published in the journal "Science."

  22. Replication

    Good experimental design practice includes planning for replication. First, identify the questions the experiment aims to answer. Next, determine the proportion of variability induced by each step ...

  23. Social science replication crisis: studies in top journals keep ...

    The scientists who ran the 21 replication tests didn't just repeat the original experiments — they made them more rigorous. In some cases, they increased the number of participants by a factor ...