Loading metrics

Open Access

Essays articulate a specific perspective on a topic of broad interest to scientists.

See all article types »

Unveiling recent and ongoing adaptive selection in human populations

Roles Conceptualization, Funding acquisition, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

ORCID logo

Published: January 18, 2024

  • https://doi.org/10.1371/journal.pbio.3002469
  • Reader Comments

Fig 1

Genome-wide scans for signals of selection have become a routine part of the analysis of population genomic variation datasets and have resulted in compelling evidence of selection during recent human evolution. This Essay spotlights methodological innovations that have enabled the detection of selection over very recent timescales, even in contemporary human populations. By harnessing large-scale genomic and phenotypic datasets, these new methods use different strategies to uncover connections between genotype, phenotype, and fitness. This Essay outlines the rationale and key findings of each strategy, discusses challenges in interpretation, and describes opportunities to improve detection and understanding of ongoing selection in human populations.

Citation: Gao Z (2024) Unveiling recent and ongoing adaptive selection in human populations. PLoS Biol 22(1): e3002469. https://doi.org/10.1371/journal.pbio.3002469

Copyright: © 2024 Ziyue Gao. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work is supported by a Research Fellowship (FG-2021-15702) from the Alfred P. Sloan Foundation ( https://sloan.org/ ) and a grant (R35GM146810) from the National Institute of General Medical Sciences to ZG. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: ASMC, ascertained sequentially Markovian coalescent; GWAS, genome-wide association study; IMR, infant mortality rate; MHC, major histocompatibility complex; PGS, polygenic score; SDS, singleton density score

Introduction

A central query in human evolutionary genetics is to understand the functions and evolutionary history of genes or genomic regions that are under natural selection. Selection favors genetic variants that lead to advantageous phenotypic changes in specific environments, resulting in increases in allele frequency over time and distinctive patterns of genetic variation in present-day populations (Figs 1 , 2A and 2B ). Beyond unraveling the origin and evolutionary history of these selective genetic changes, it is of immense interest to gauge their contribution to phenotypic diversity in present-day human populations, as well as their impacts on disease risk and overall fitness ( Box 1 ) in contemporary environments. Therefore, recent research endeavors are increasingly shifted towards identifying and characterizing extremely recent and even ongoing selection.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

In this conceptual framework, selection on genotype is mediated by fitness-relevant phenotype and manifests in allele frequency changes and genetic variation patterns. In any specific environment, genotype and environment together shape the phenotype of an individual, which in turn determines the fitness. In addition to its direct effect on the phenotype (solid purple arrow), the environment also modifies the genotype-to-phenotype mapping (i.e., genotype-by-environment interaction; indicated by the dotted purple arrow) and phenotype-to-fitness mapping (dashed purple arrow). Through interactions with other evolutionary forces (indicated by the brown plus sign), natural selection shapes the allele frequency trajectory over time and leaves footprints in genomic variation in present-day populations.

https://doi.org/10.1371/journal.pbio.3002469.g001

thumbnail

(A) The hallmark of positive selection is faster allele frequency increase than would be expected under neutrality. (B) The rapid allele frequency change leaves footprints in the surrounding genomic region, although the specific patterns depend on the strength, tempo, and mode of selection (e.g., selection on standing variation versus on de novo variants). (C) Major methods for detecting positive selection based on present-day genetic variation.

https://doi.org/10.1371/journal.pbio.3002469.g002

Box 1. Glossary

A measure of how well an individual can survive or reproduce; it consists of multiple components such as viability, mating success, and fecundity.

Positive selection

An evolutionary process in which a genetic variant becomes more common in a population because it increases the fitness of individuals who carry it.

Negative selection

An evolutionary process that weeds out fitness-reducing genetic variants from the population. Purifying selection acts directly on the deleterious variants, whereas background selection affects nearby variants linked to the deleterious variants.

Positive and negative selection

Two inseparable concepts that describe the same phenomenon from different angles. To facilitate communication, population geneticists often adopt either of these terms focusing on the impact of selection on the derived allele, such that positive selection tends to speed up molecular evolution, whereas negative selection decelerates or prevents it. Nonetheless, in many cases, identity of the derived allele is ambiguous or less relevant (e.g., during transient selection), and the direction of selection often refers to the effect of selection on the rare allele (for example, a scenario where the rare allele is beneficial is often considered positive selection, although one could consider the same scenario as negative selection against the more common allele).

Genetic adaptation

The process by which organisms evolve heritable characteristics or traits that help them to better survive and reproduce in their specific environment. In many cases, adaptation is used synonymously with positive selection, but adaptation also encompasses other selection modes such as balancing selection and polygenic adaptation.

Stabilizing selection

A type of natural selection that favors individuals with an intermediate value of a fitness-relevant trait. Individuals with deviation from the optimal trait value are selected against, and the result is a stabilization of the trait around a specific value. Stabilizing selection concerns the relationship between phenotype and fitness, regardless of the genetic basis. Other types of phenotype-focused selection include disruptive selection, which favors individuals with extreme trait values, and directional selection, which favors individuals at only one end of the phenotypic spectrum.

Polygenicity

Polygenicity refers to a scenario in which variation in a trait within a population is contributed to by genetic variants at multiple genes or genomic loci rather than by just one or a few. Many complex traits in humans, such as height and disease susceptibility, are highly polygenic.

Pleiotropy occurs when a single genetic variant (or gene) influences two or more seemingly unrelated phenotypes in an organism. Two traits are pleiotropically related when certain variants exist that simultaneously affect them.

Numerous scans have been carried out in the human genome for targets under selection of intermediate scales (e.g., over 1,000 generations), but it remains a challenging task to demonstrate that selection on the identified targets is still ongoing or to detect selection that started recently. Enabled by the recent availability of population-scale genomic data and the development of efficient algorithms for inferring local genealogical trees, many new methods have been developed in the past 20 years to detect signals of selection from the past few millennia (e.g., [ 1 – 4 ]). Complementary to this approach, ancient DNA data provide direct estimates of past allele frequencies in human populations across time and geography and have refined estimation of the tempo and strength of selection in many instances of selection signals identified in modern genomes. Most recently, population-scale biobank-style datasets, encompassing genomic information and phenotypic data on reproduction, disease, mortality, and other quantitative traits, have pinpointed variants associated with various fitness components, at times in a sex-specific manner. These findings signify the presence of ongoing selection occurring within just one or a few generations.

This Essay aims to highlight growing evidence for very recent and ongoing genetic adaptation in the human genome, with a focus on positive selection and directional selection on polygenic traits, as these modes of selection may potentially contribute to genetic and phenotypic differences across populations. It is important to note that the effects of negative selection (such as purifying selection and background selection; Box 1 ) are evident and prevalent in the human genome. However, due to space limitations, this Essay does not discuss the advances made in the past decade in identifying genomic regions and phenotypes subject to recent and ongoing negative and stabilizing selection (e.g., [ 5 – 8 ]). Instead, it only briefly discusses the challenges associated with detecting and interpretating signals of positive and directional selection in the context of pervasive negative selection. The Essay starts with the latest methodological innovations in inference of positive selection at individual genomic loci, and then discusses techniques for detecting aggregate selection signals across genetic loci that collectively influence a quantitative trait. Rather than delving deeply into the technical details, it emphasizes the connection and distinction among “genotype-focused,” “phenotype-focused,” and “fitness-focused” strategies, as well as the advantages and limitations of each ( Fig 3 ). Some major findings stemming from these innovative approaches are discussed, along with challenges in interpretation of the signals.

thumbnail

(A) A “genotype-focused” strategy focuses on the cumulative effects of historical selection on genetic variation patterns and relies on population genetics modeling to tease apart the influence of other evolutionary forces. Ancient DNA data provide direct information on allele frequency changes, which helps reduce inference uncertainty and confounding by demographic history. (B) A “fitness-focused” strategy focuses on direct association between genotype and fitness component(s) and utilizes allele frequency changes within one generation to detect selection in contemporary populations. As a special case of this strategy, between-sex differences in adult allele frequency or effect size of association to fitness components can be leveraged to detect sex-differential selection. (C) A “phenotype-focused” strategy relies on aggregation of selection signals revealed by genotype-focused or fitness-focused strategies across trait-associated variants identified by genome-wide association studies (GWAS).

https://doi.org/10.1371/journal.pbio.3002469.g003

Positive selection at individual genomic loci

Genomic footprints in present-day genetic variation.

Traditional methods for detecting selection take a genotype-focused approach ( Fig 3A ) by adopting classic population genetics models. Specifically, these models predict changes in allele frequency and patterns of surrounding genomic variation by assuming arbitrary fitness effects of different genotypes at a single genetic locus. The obvious advantage of this modeling approach is that it establishes expectations for genomic signatures of selection while requiring very little phenotypic information, such as how genotypes map to phenotypes or which phenotypes are under selective pressure.

Typical genomic signatures of positive selection include extreme differentiation in allele frequencies across populations, extended haplotypes/linkage disequilibrium, or distortion in the site frequency spectrum of segregating variants (reviewed in [ 9 – 11 ]; Fig 2C(i–iii) ). These statistics capture complementary features of genomic variation, but most are powerful in detecting selection on intermediate timescales (i.e., hundreds of generations or longer). More recent methods increase detection power by considering multiple summary statistics jointly. This idea was initially implemented using a few basic summary statistics [ 12 ] and later expanded through techniques such as Approximate Bayesian Computation [ 13 ] or supervised machine learning (reviewed in [ 14 ]). Thanks to the recently available population-scale genomic data and continuous theoretical and methodological developments, genome-wide scans based on population genetic summary statistics have identified thousands of putative targets under selection, largely independently of biological knowledge regarding the corresponding phenotype or selective pressure.

Despite being able to pick up selection signals over the past hundreds or thousands of generations, these scans are limited in power for detecting very recent selection because the narrow time window involved leaves very subtle genetic footprints in the site frequency spectrum or haplotype structure. From the perspective of the local genealogical tree, very recent selection only impacts branches near the leaf nodes but leaves most of the tree unchanged. Realizing this, researchers have developed methods that explicitly leverage features of terminal branches of the local genealogical tree. The singleton density score (SDS) is one such method that detects recent allele frequency changes based on extremely rare variants [ 15 ]. Specifically, SDS tests for deficiency of singletons (i.e., variants that appear exactly once in the entire sample) on haplotypes carrying the putatively favored allele, which is indicative of a faster coalescent rate in the recent past ( Fig 2C(iv) ). Along these lines, another method called ascertained sequentially Markovian coalescent (ASMC) detects targets of recent positive selection by inferring pairwise coalescent times and looking for unusually high densities of coalescent events in the recent past ( Fig 2C(v) ) [ 16 , 17 ]. When applied to whole-genome sequences of approximately 3,200 individuals of European ancestry, SDS detected selection signals in the past 2,000 to 3,000 years in the major histocompatibility complex (MHC) region and at variants associated with lactose tolerance and pigmentation [ 15 ]. In comparison, application of ASMC to over 487,000 British individuals identified signals of selection in the past 1,500 years, including those detected by SDS, as well as several new candidate loci harboring genes related to immune response, tumor growth, and other phenotypes [ 17 ].

With the recent development of algorithms for inference of the ancestral recombination graph or its proxies, several tree-based statistics have been developed for detecting positive selection (reviewed in [ 18 ]; Fig 2C(vi) ). One of these methods, Relate, estimates local genealogy from sequence data and detects selection by searching for rapid propagation of lineages carrying a putatively beneficial allele relative to other lineages, effectively testing for differences in the coalescent rate between haplotypes carrying different alleles [ 19 ]. However, this selection metric is calculated on only one point estimate of the local genealogy. By contrast, a likelihood method called CLUES leverages the posterior distribution of local genealogical trees to infer selection coefficients and allele frequency trajectories at individual loci [ 20 ]. These new methods have confirmed strong selection on variants associated with lactase persistence, immune response, and pigmentation traits in Europeans in the past few thousand years and some signals in other populations (such as the EDAR gene in East Asians), although very few new signals have been detected.

Selection signals in ancient genomes

While modern genomes provide a snapshot of population evolution and allow for indirect inference of past demographic and selective events, genomic sequences from ancient samples enable direct glimpses into the genetic history of human populations. By providing estimates of allele frequencies at multiple time points ( Fig 2A and 2B ), ancient DNA has shed valuable insights on the evolutionary histories of multiple selected variants in human evolution during the past 15,000 years (reviewed in [ 21 – 23 ]). Analysis based on ancient DNA has also been particularly helpful in detecting candidates under spatially or temporally restricted selection.

Ancient DNA transformed our understanding of selection in humans by resolving complex interactions between selection and demographic history. As recent human history features many episodes of population splits and admixture, signals of selection are often obscured by changes in ancestry [ 24 ]. One instance is the evolutionary history of the FADS locus, which contains genes encoding enzymes involved in the conversion of long-chain polyunsaturated fatty acids. Using present-day genomic data, studies detected strong selection signals on FADS genes in human populations from multiple continents, with different alleles being favored across time and geography [ 25 – 29 ]. However, analysis of ancient DNA showed that the selection signal in Native Americans was largely an artifact driven by parallel selection in European and Asian populations [ 30 ]. Another intriguing case is the evolution of pigmentation in west Eurasia in the context of several major admixture events revealed by ancient DNA. The derived alleles associated with lighter skin or eye color at several pigmentation-associated genes exhibited distinct frequencies in different ancestral populations, potentially reflecting differential selective pressures across geography prior to the Mesolithic period (i.e., before 9,000 to 10,000 years ago) [ 31 , 32 ]. Moreover, the observed allele frequencies and ancestry fractions at these pigmentation-associated variants in later admixed populations significantly deviated from neutral expectations, suggesting subsequent selection during the Neolithic, Bronze Age, and historical periods [ 33 – 35 ]. These findings point to continued selective pressure for light pigmentation over the past 2,000 years in west Eurasia and support the concept that admixture may facilitate rapid adaptation by introducing advantageous alleles [ 34 – 37 ].

Ancient DNA data have also refined our knowledge of the onset, duration, and strength of selection events. For example, selection on the variant conferring lactase persistence was initially estimated to begin around 7,500 years ago based on modern genomic data and archeological evidence of dairy production [ 38 ]. Surprisingly, ancient DNA data have shown that the selected allele was rare in Bronze Age Europe until 3,000 years ago, suggesting a much later onset of positive selection than was previously inferred [ 31 ]. In addition, based on the allele frequency trajectory in ancient DNA samples, the positive selection for this allele was inferred to be strong 100 to 150 generations ago but drastically reduced in the past 100 generations [ 39 ]. Significant variation in selection strength has also been found at several other previously identified selected loci [ 39 ]. Overall, ancient DNA studies have confirmed selection signals near multiple genes associated with diet, pigmentation, and immune response revealed in modern genomic data, and have provided fine-resolution insights into the temporal dynamics and geographic distribution of the selected variants and the corresponding selection strengths [ 26 , 34 , 39 , 40 ].

With recurrent observations of selection targeting genes in immune pathways, the quest to discern the specific pathogens driving these selective pressures has been immensely captivating. A strategy to link selection signals with the causative pathogens is to search for variants with unusual allele frequency changes during well-documented catastrophic pandemics. A recent investigation scrutinized ancient genomes of roughly 200 individuals who died before, during, and after the Black Death pandemic in the fourteenth century [ 41 ]. This study reported an overall enrichment of allele frequency differentiation in immune genes as well as a handful of potential targets under positive selection. However, serious skepticism has been raised towards the findings due to technical concerns [ 42 ], and other studies adopting similar designs (though with smaller sample sizes) failed to replicate the selection signals at immune genes overall or at individual candidates [ 43 , 44 ]. These results suggest the selection effects of historical pandemics at individual genomic loci are relatively modest, necessitating expansive sample sizes for detection.

Fitness-focused strategy for detecting selection in contemporary populations

The fitness of an individual consists of several components such as viability, mating success, and fecundity. A genetic variant that influences any of these components is subject to natural selection unless its effects on all components cancel out. Based on this reasoning, one can identify loci under ongoing selection using a fitness-focused approach by performing GWAS on proxies for fitness components ( Fig 3B ). However, traits closely associated with fitness are expected to have low heritability [ 45 ], and fitness-related variants tend to be rare in frequency. Therefore, identification of these variants via association requires exceedingly large sample sizes, which only became feasible in the past decade. It is worth noting that, due to limited power, this association approach is biased towards detecting common variants and does not pick up fitness-influencing variants that are under strong negative selection.

One of the most studied proxies of fertility is the number of children ever born to or fathered by an individual, because it can be easily surveyed and approximates the overall fitness well in modern populations with low mortality. Using data from hundreds of thousands of individuals born in the 1950s to 1970s, dozens of genomic loci have been associated with the number of children [ 46 – 48 ]. Interestingly, among the top associations stands the FADS locus, which also harbors strong signals of historical positive selection in both ancient and present-day DNA samples [ 26 , 28 , 29 , 49 ]. By contrast, the two most significant association regions lack evidence of historical positive selection but demonstrate signals of balancing selection, possibly due to pleiotropic effects ( Box 1 ) on other fitness components or temporally fluctuating selection [ 46 , 50 , 51 ].

Besides reproduction, viability is a key component of fitness. In principle, the number of children closely reflects their contribution to the population gene pool of the next generation, but current association studies for this trait include only individuals who survived to completion of their reproductive lifespan, leaving out those who did not reach adulthood. To detect common variants linked to early-life survival, Wu and colleagues performed a clever GWAS on time- and location-matched infant mortality rate (IMR) for living individuals in the UK Biobank [ 52 ]. The rationale is that individuals who survived in tougher environments during infancy, as indexed by a higher local IMR in their birth years, tend have higher “relative viability.” Interestingly, the two genome-wide significant loci identified by this approach, LCT and TLR6-TLR1-TLR10 , are both known targets of recent positive selection in Europeans, with the survival-increasing alleles matching the evolutionarily favored allele [ 15 , 26 ].

A more direct approach for identifying variants that affect viability is by looking for shifts in allele frequency across individuals of different ages [ 2 ]. Limited by the age distribution of participating individuals in current cohorts, this method is underpowered to detect allele frequency changes in early life, when selective pressure is expected to be strong. However, in humans, even variants that exclusively affect viability late in life may be under selection, due to late male reproduction, intergenerational resource transfer, and other reasons [ 53 , 54 ]. By testing for changes in allele frequency with age, a study found and replicated two genome-wide significant signals in 2 independent datasets: one overlaps with the APOE ε4 allele that is associated with reduced lifespan and increased risk of Alzheimer’s disease and cardiovascular diseases [ 55 , 56 ]; the other locus contains variants that are close to a nicotine receptor gene CHRNA3 and associated with increased smoking quantity [ 57 ]. Intriguingly, the relatively common frequencies of these survival-reducing variants in present-day populations suggest that they were not under strong negative selection in the recent past. The authors interpreted the lack of abundant associations as evidence for purifying selection against variants with large effects on late-onset disease and speculated that the APOE and CHRNA3 loci were found because their deleterious effects have recently increased in humans due to environmental changes.

Fitness-focused strategy for detecting sex-differential selection

The extraordinary level of sexual dimorphism in many animal species, including humans, reflects sex-specific phenotypic effects and sex differences in the fitness landscape. The fitness effect of a genetic variant may differ between sexes in magnitude or sometimes in direction. Such sex-differential selection is challenging to study because mendelian inheritance equalizes autosomal allele frequencies between the 2 sexes at fertilization in each generation. Nevertheless, the special case of sex-differential selection on viability is expected to leave a distinctive signature in population genetic variation: allele frequency differences between adult females and males ( Fig 3B , right). An early study seeking this signature reported signals at hundreds of genetic regions and an enrichment of signals on the X chromosome compared to autosomes [ 58 ]. Unfortunately, these findings turned out to be largely false positives driven by random noise, sex-biased genotyping error, and biases due to hemizygosity of the X chromosome in males. Later studies on much larger biobank datasets failed to detect robust signals at any autosomal loci [ 59 ] or enrichment on the X chromosome [ 60 ].

While signals of sex-differential viability selection are expected to be exceptionally weak at individual loci [ 61 , 62 ], subtle between-sex allele frequency differences across many variants may be detectable in aggregation. Leveraging the genomic and reproductive history data of approximately 250,000 adults in the UK Biobank, Ruzicka and colleagues developed new metrics to measure between-sex allele frequency differentiation over different stages of a life cycle. They found significant shifts in the genome-wide distributions of these metrics, which is consistent with effects of sex-differential selection on survival, reproductive success, and overall fitness [ 4 ].

Limitations of the fitness-focused strategy for selection detection

One curious observation from the studies described above is the limited overlap between fitness-associated variants in contemporary populations and targets under historical positive selection. As the 2 approaches (i.e., fitness-focused and genotype-focused) capture selection signals of very different timescales, one explanation is a highly dynamic selection landscape during recent human evolution. However, the fitness-associated variants identified in biobank-style datasets need to be taken with a grain of salt for several technical reasons.

First, the effect measured by association likely does not reflect the actual fitness effect. Fitness effects that are “visible” to natural selection may be too subtle to be picked up by association studies given current sample sizes, so many targets of ongoing selection might be missed. On the other hand, proxy traits only capture certain aspects of fitness components, so the measured effect of a variant may be greater than its effect on overall fitness in the presence of antagonistic pleiotropy ( Box 1 ). In other words, there may be weaker or even no ongoing positive selection on variants with opposite effects on different fitness components.

Second, as for all GWAS in general, uncorrected population stratification remains a concern for fitness-associated variants, especially for those with highly differentiated frequencies across populations. For example, the lactase persistence variant near LCT , the top selection target identified by the IMR GWAS, is among the most differentiated variants across European populations [ 63 ]. Despite the authors’ best effort in correcting for population structure, it is still possible that the IMR association signal in UK Biobank data is driven by residual stratification, so the claim of ongoing selection on this variant remains to be validated in independent datasets or by family-based approaches [ 64 ].

A related yet different issue applies to analysis based on allele frequency differences between sex. In addition to sex-biased viability selection, between-sex allele frequency differences can also be interpreted as the result of subtly different population structures between sexes or sex-biased participation [ 65 ]. The UK Biobank requires active participation, and the participants are not representative of the general population in various sociodemographic and health-related characteristics [ 66 ]. Should a genetic variant affect participation inclination in men and women differentially, subtle allele frequency difference between sexes is expected. Consistent with this hypothesis, a “GWAS of sex” performed in 5 biobank-style datasets found significant positive autosomal single-nucleotide polymorphism heritability in those that require active participation (including UK Biobank) but not those with relatively passive recruitment, although this contrast is confounded by differences in sample size across datasets [ 65 ]. Therefore, an important future step will be to replicate the findings in more population-representative datasets or family-based studies to rule out or quantify the contribution of sex-differential participation bias.

Directional selection on quantitative traits

Integration of gwas results with genetic variation patterns.

GWAS have provided unprecedented insights into the genetic architecture of human phenotypes, revealing significant heritability and high polygenicity ( Box 1 ) of most traits, as well as unexpectedly small effect sizes for most associated variants. These observations are surprisingly close to the assumptions of classical quantitative genetics models [ 67 ]. In the context of adaptation, the measurable heritability means that at least a portion of the phenotypic variation within a population is attributed to existing genetic polymorphisms, which, in response to changes in selective pressure on the phenotype, offer the materials for genetic adaptation without having to await new mutations. In turn, the high polygenicity and tiny effect sizes of most variants suggest that the selective pressure on any individual alleles may be too small to leave discernible genomic footprints but may be detectable in aggregate. These considerations point to the importance of examining polygenic signals of selection on traits during human evolution via a phenotype-focused strategy ( Fig 3C ) [ 68 ].

If all or most trait-influencing variants can be identified in an unbiased manner, signals at these loci can be interrogated jointly to uncover selection on the trait. The most straightforward idea for detecting polygenic adaptation is to directly combine GWAS results and population genetic summary statistics (e.g., some in Fig 2C ) [ 15 , 19 , 33 – 35 , 69 ]. Common approaches include tests for shifts in distribution of single-locus summary statistics indicative of selection (e.g., F st ) at GWAS hits [ 69 ] or correlation between GWAS summary statistics (such as effect direction, magnitude, and significance level) and population genetic summary statistics. This approach has been applied to both present-day and ancient DNA data, and several studies explicitly leveraged population admixture events in recent human history to gain insights into the timing of selection [ 15 , 19 , 33 – 35 , 70 ]. Overall, these studies found consistent evidence of selection on variants underlying anthropometric, pigmentation, and immune-related trait variation in human populations in the past 10,000 years.

Rooted in the classic quantitative genetics model, more direct tests for polygenic adaptation have been devised around the concept of “genotypic value” (also known as the “breeding value” in quantitative genetics when nonadditive genetic effects are ignored) that describes the total contribution of all genetic variants of an individual to their phenotypic value. The polygenic score (PGS)—the sum of allele effect sizes across all independent GWAS loci—provides a proxy for the genotypic value that can be applied at the individual or population level. In addition to empirical comparison of observed PGS to a null distribution based on sets of matching variants [ 71 , 72 ], formal tests for polygenic signals of selection on quantitative traits have been developed in the population genetics framework [ 73 , 74 ]. In a way, these tests are analogous to tests for single-locus selection, but instead of rapid change or differentiation of allele frequency, signals for polygenic adaption come from unexpected changes or overdispersion of PGS in the history of one or multiple populations [ 73 , 74 ].

GWAS results have also been explicitly incorporated into the coalescent framework. By combining GWAS effect sizes and inferred local genealogical tree at GWAS loci, Edge and Coop developed methods for reconstructing the trajectory of population-mean PGS over time [ 75 ]. They applied these methods to test polygenic signals of selection for increased height in the British population but only found very weak signals concordant with prior reports [ 15 , 70 , 73 , 74 ]. Taking a different approach, Stern and colleagues extended their method CLUES to estimate selection intensity on a polygenic trait by considering the allele frequency trajectories of GWAS loci conditional on the inferred local coalescent trees [ 76 ]. Contrary to the conclusion of prior studies, this method detected no signal of recent directional selection on height or body mass index, but replicated some other traits previously reported to be under recent selection, such as pigmentation traits, age at first birth, glycated hemoglobin, and educational attainment. By combining theory of quantitative genetics and population genetics and incorporating empirical GWAS findings, these new methods unveiled many signals of selection on quantitative traits during recent human evolution and are paving the way for many more future findings.

Correlation between phenotypes and fitness components

Analogous to the fitness-focused approach for detecting ongoing selection at individual loci, selection effects on a polygenic trait can be estimated based on phenotypic or genetic correlations between the trait and a proxy of fitness component [ 77 ]. Approaches include regression of a measure of reproductive success on PGSs for traits of interest [ 78 , 79 ] or estimation of genetic correlation between traits of interest and proxies for fitness [ 80 , 81 ]. Partially consistent with previous epidemiological studies, these studies found selection in contemporary human populations for genetic variants underlying earlier age at first birth and shorter stature in females, as well as for those underlying increased body mass index and reduced educational attainment in both sexes.

Hypothesizing that variants influencing polygenic traits may be under sexually antagonistic selection on viability, Zhu and colleagues developed a test based on between-sex allele frequency differences and sex-specific phenotypic effect sizes from GWAS. They found suggestive signals of selection on testosterone levels [ 82 ], which is consistent with the recent findings of positive correlation between testosterone level and mortality in females and an inverse relationship in males [ 83 ]. Nonetheless, because the model makes some strong assumptions, such as allele frequency under equilibrium and selection coefficient proportional to phenotypic effect size, it remains questionable whether the detected signal is specific to sexually antagonistic selection or can also reflect the effects of other evolutionary processes.

Challenges in validating and interpreting polygenic signals of selection on quantitative traits

Despite significant progress in detecting polygenic adaptation in the past decade, serious concerns quickly emerged regarding the validity and interpretation of the reported signals of polygenic adaptation, for both technical and conceptual reasons [ 84 ]. First, technical biases in GWAS may lead to false positive signals or biased effect size estimates at individual loci. For example, the strong signals of selection on height in Europeans were found to largely result from uncorrected population stratification and weakened considerably with effect size estimates from GWAS of less-structured samples [ 75 , 85 , 86 ]. The inherent ascertainment bias and limited portability of GWAS results cast additional uncertainty on the reliability of selection signals when applying GWAS summary statistics from a study group to selection tests in other groups [ 84 , 87 , 88 ]. Furthermore, although being intuitive and powerful for combining information across sites, PGSs, especially those constructed with variants that do not reach genome-wide significance, further exacerbate biases of GWAS results due to residual population stratification [ 89 ].

Moreover, most current methods fundamentally test for deviation from neutrality (i.e., no selection on any trait or variant at all), so the detected signals may reflect effects of other modes of selection. Despite the debate on the prevalence of polygenic adaptation, there is a consensus that GWAS variants with large effect sizes are under negative selection, indicated by the strong negative correlation between variant effect size and minor allele frequency (beyond the expectation under detection bias) [ 90 – 92 ]. This phenomenon is consistent with the action of stabilizing selection ( Box 1 ) on quantitative traits: For a population centered around the phenotypic optimum, mutations that affect fitness-relevant phenotypes tend to shift the population away from the optimum and thus be deleterious [ 93 ]. The prevalence of stabilizing selection leads to challenges in detection and interpretation of population differences in PGS. First, under stabilizing selection, adaptive genetic changes do not always mirror shifts in the phenotypic optimum. Environmental changes can alter not only the optimal trait level ( Fig 1 ; dashed purple arrow) but also the mean environmental contribution to the phenotype ( Fig 1 ; solid purple arrow), which induces “genetic compensation” in the opposite direction [ 84 , 94 ]. Second, although stabilizing selection around the same trait optimum constrains phenotypic differentiation between populations, it accelerates genetic differentiation at trait-influencing loci. This counterintuitive effect of stabilizing selection, combined with incomplete and biased ascertainment of GWAS loci, inflates differences in PGS between populations and may even yield spurious signals of polygenic adaptation [ 95 ]. These considerations underscore the importance of regarding stabilizing selection (with a constant trait optimum throughout time and space) as a null model for devising and interpreting tests for polygenic adaptation, especially those reliant on inter-population comparisons.

Even when the selection signals are technically sound and effects of stabilizing selection are adequately considered, it remains a formidable challenge to tell which traits are directly under selection, given the prevalent pleiotropy ( Box 1 ) across human complex traits [ 96 – 98 ]. Aware of this issue, researchers developed methods that aim to disentangle effects of selection on genetically correlated traits and found evidence of indirect selection (e.g., signals of selection on educational attainment due to selection on other traits) and opposing selection (e.g., selection for increased type 2 diabetes and decreased glycated hemoglobin), which helps with the rejection of the hypothesis that a certain trait is under direct selection [ 76 ]. Yet, this study only tested for correlated response of 137 pairs of traits and may have missed signals driven by multi-way pleiotropy or unmeasured traits [ 96 ]. In other words, given current data and methods, one can at best conclude selection on variants associated with certain trait(s) but not selection on the trait(s).

Conclusion and future directions

Rapid growth in genomic datasets and advances in computational techniques have enabled identification of parts of the human genome under very recent or ongoing adaptive selection. Early genome-wide selection scans relied on the cumulative effects of selection over relatively long timescales, but statistical innovations have enabled efficient computation using large numbers of modern genomes to study selection over narrower time frames. The utilization of ancient DNA data has further reduced inference uncertainty and confounding due to demographic history, providing valuable insights into the temporal dynamics and geographic distribution of the selected variants. We now have lists of candidate targets with compelling evidence of selection during the past 15,000 years, along with partial information regarding variation in the selection strength.

As selection on genotypes is mediated by differences in fitness-relevant phenotypes ( Fig 1 ), a complete understanding of selective events involves not only the causal variants but also the relevant phenotypes and selective forces [ 99 ]. Integrating rich phenotypic data with genomic information in population-scale datasets has facilitated the establishment of associations between variants and phenotypes. Following numerous association studies conducted for both organismal and molecular phenotypes, it is increasingly clear that pleiotropy is widespread across human traits [ 97 , 98 , 100 ]. It is possible that many inferred selective variants will be associated with multiple phenotypes in future GWAS, so the new questions will become: which of these phenotypes, if any, is mediating selection; where does the selection pressure come from; and is selection still ongoing in present-day populations?

The expanding biobank datasets will be pivotal in addressing these questions. First, they offer an opportunity to directly identify individual or groups of variants associated with fitness components. The partial overlap between fitness-associated variants and those targeted by historical positive selection may arise from limited power to detect subtle fitness effects, antagonistic effects on different fitness components (and/or between sex), or spatial or temporal variation in fitness effects. With the anticipation of long-term longitudinal data, possibly spanning from birth to death, becoming available in the next few decades, it will be possible to develop new statistics that better approximate various fitness components and integrate them throughout a complete life cycle, thus enhancing power to identify variants that influence the overall fitness. It is important to note that since such discoveries are associations in nature, replication in additional biobank datasets or by family-based studies will be crucial.

Second, the rich phenotype data, coupled with theoretical advancements, can potentially distinguish between traits directly or indirectly under selection. Although it remains uncertain which and how many pleiotropically related traits collectively shape the fitness landscape, emerging evidence suggests that, at least for some traits, a model featuring many traits under stabilizing selection aligns well the empirical GWAS results [ 3 ]. These considerations strongly advocate for incorporating pleiotropy alongside stabilizing selection in future models and simulations that characterize genetic signatures of polygenic adaptation [ 101 , 102 ]. Findings from such models, combined with variant-level pleiotropic effect size estimates from empirical association studies, may unveil clearer adaptation signals and help differentiate between traits directly or indirectly influenced by selection.

Lastly, given the emerging evidence of sex differences in phenotypic and fitness effects of the same variant [ 4 , 82 ], along with varying prediction accuracy of PGSs across different contexts (e.g., age, sex, income level) [ 88 ], more context-dependent effects will likely be unmasked. These findings may imply gene-by-environment interactions on phenotype and fitness, hinting at the environmental conditions that exert selective pressure. This information, when combined with archeological data about past environments, diets, and lifestyles of human populations, may aid in rejecting and formulating new hypotheses regarding recent selective forces that have shaped the human genomic and phenotypic variation.

Acknowledgments

Thanks to Iain Mathieson for helpful discussion and critical feedback on the manuscript.

  • View Article
  • PubMed/NCBI
  • Google Scholar

An official website of the United States government

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List

Why and How to Switch to Genomic Selection: Lessons From Plant and Animal Breeding Experience

Aline fugeray-scarbel, catherine bastien, mathilde dupont-nivet, stéphane lemarié.

  • Author information
  • Article notes
  • Copyright and License information

Edited by: José Manuel Yáñez, University of Chile, Chile

Reviewed by: Piter Bijma, Wageningen University and Research, Netherlands; Rounak Dey, Harvard University, United States

*Correspondence: Mathilde Dupont-Nivet, [email protected]

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received 2020 Nov 15; Accepted 2021 Jun 11; Collection date 2021.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The present study is a transversal analysis of the interest in genomic selection for plant and animal species. It focuses on the arguments that may convince breeders to switch to genomic selection. The arguments are classified into three different “bricks.” The first brick considers the addition of genotyping to improve the accuracy of the prediction of breeding values. The second consists of saving costs and/or shortening the breeding cycle by replacing all or a portion of the phenotyping effort with genotyping. The third concerns population management to improve the choice of parents to either optimize crossbreeding or maintain genetic diversity. We analyse the relevance of these different bricks for a wide range of animal and plant species and sought to explain the differences between species according to their biological specificities and the organization of breeding programs.

Keywords: prediction accuracy, duration of breeding cycle, selection intensity, breeder’s equation, breeding organization, breeding program, selection costs, Mendelian sampling

Introduction

Genomic selection (GS) was first introduced by Lande and Thompson (2000) and popularized by Meuwissen et al. (2001) . This method is based on the use of high-density single nucleotide polymorphisms (SNP) genotyping to predict breeding values. The development of a genomic breeding program requires two steps: (1) in a reference population, individuals are genotyped and phenotyped, and a statistical model is then built to estimate SNP effects on phenotypes and develop corresponding prediction equations; and (2) new candidates for selection may or may not be phenotyped but are always genotyped, and their breeding values are predicted using prediction equations and phenotypes when available.

Genomic selection was first implemented in dairy cattle. GS methods have been developed quickly due to the possibility of estimating the breeding value of bulls for milk production early and precisely through genomic prediction equations rather than later through a costly progeny test ( Schaeffer, 2006 ; Hayes et al., 2009 ; Venot et al., 2016 ; Wiggans et al., 2017 ). However, GS is based on generic methods and technology and can potentially be implemented for the breeding of any plant or animal species, as long as breeding aims at improving polygenic traits. Indeed, selection for many plant species addresses polygenic traits. This is the case for field crop species with yield and other traits (e.g., adaptation to climate, size). For vegetables, major traits are controlled by numerous quantitative trait loci (QTL) [see Zhao et al. (2019) for quality traits and Bai et al. (2018) for biotic and abiotic stresses]. However, part of the traits of interest for plants are monogenic: color or form of the harvested organ (vegetable), some disease resistances (crop or vegetable). GS is not relevant to improve these monogenic traits.

Depending on the species, the organization of the breeding programs and the constraints on breeding may be very different. As a consequence, GS can be implemented differently to relax these species and breeding program constraints. For e.g., annual crops have short life cycles, but the phenotyping of these crops to cover environmental variability between locations and years and control for genotype-by-environment interactions is very costly. The life cycle is also short for some animal species, such as poultry or pigs, but a major constraint in these species is the lethal phenotyping of specific traits of economic interest. Thus, these traits are evaluated from sibs performance. Finally, the best strategy for implementing GS for a given species can change over time according to the evolution of technology, knowledge, and breeding goals.

Many articles have explored the advantages and drawbacks of GS, focusing on a single species or a limited set of related species ( Meuwissen et al., 2013 ; Heslot et al., 2015 ; Lin et al., 2016 ). These reports are often prospective, being produced before the actual implementation of GS, rather than retrospective. Additionally, they often include all the advantages independent of their importance. Thus, they do not provide reasoning that would ultimately or potentially convinces stakeholders to switch to GS. Finally, with the exception of the work by Jonas and de Koning (2013) and Hickey et al. (2017) , to our knowledge, there has been no joint analysis of the arguments for GS in plant and animal species.

In this article, our aim is to propose a common framework for analysing the multiple arguments for implementing GS in a large range of plant and animal species for the breeding of polygenic traits. Our first aim is to synthesize the basic arguments justifying the potential interest in using genomic information in any plant or animal breeding program. These arguments are defined as complementary basic bricks that can possibly be combined for the breeding of any species of interest in agriculture. As far as possible, we focus on the tipping point that is specific to each species and not on a whole range of possible applications of GS in the future. Before presenting these bricks and their importance, we first briefly review key features of the biological specificities and organization of breeding programs for the various species.

This study is based on an analysis of the literature and on multiple exchanges within the expert group R2D2 supported by the INRAE SELGEN metaprogram 1 . This group included French researchers in the field of GS (including geneticists as well as economists) in a wide range of commercially selected animal and plant species (dairy and beef cattle, dairy and meat sheep, dairy goats, pigs, horses, laying hens, broilers, fish, wheat, rice, maize, peas, forage crops, forest trees, fruit trees, oil palm trees, tomato, and grapevine). Members of this group have met twice a year since 2012, with part of each meeting being devoted to the discussion of GS strategies for each species and the comparison of these strategies among species.

Materials and Methods

Biological and breeding organization specificities.

Before analyzing the different arguments for the interest in GS implementation, it is important to review the major differences in breeding organizations and constraints among species. These differences are due to the biological specificities of each species and to the different types of selection products that are commercialized (e.g., seed, artificial insemination semen, young plants, broodstock, and juveniles). These selection products are generally such that they enable the wide diffusion of genetic gains at a rather low cost, according to the known biological specificities of each species and the available technologies.

The first important characteristic of the breeding organization is the duration of the breeding cycle. A breeding cycle is initiated by crosses/mating between selected parents, and the duration of this cycle is the time between two initial crosses. This duration does not take into account the duration of commercial development, which may necessitate extra generations for multiplying selected plants or animals. For various reasons, the breeding cycle duration varies from 1 year to one or several decades. Here, we review the main constraints determining the duration of the breeding cycle.

The breeding cycle duration is first constrained by the age at sexual maturity, which determines the shortest possible length of time between two successive generations and is incompressible due to biological limits. For animals and trees, a breeding cycle corresponds to one generation. For crops and vegetables, the breeding cycle typically encompasses several generations to produce enough seeds for repeated field trials at multiple locations. Additionally, when doubled haploids cannot be produced at a reasonable price (e.g., peas), several generations are necessary to obtain homozygous genotypes, which is usually a requirement for obtaining intellectual property rights and variety registration.

The duration of the breeding cycle can also be constrained by the way in which phenotyping is implemented. For example, cycles are longer when offspring evaluation is required to select males on the basis of female traits (e.g., male selection for dairy production in ruminants) or when backward selection is necessary (e.g., in forest tree breeding). This is also the case when selection is based on values obtained from hybrid combination (e.g., in maize). Cycles can also be long when some traits can only be phenotyped at late stages (e.g., wood production for forest trees, persistence of forage crops, and horse sport performances).

Finally, taking into account all the constraints described above, the duration of the breeding cycle in animal species before the implementation of GS is highly variable, ranging from 1 year for broilers to 2 years for laying hens and pigs, 2–4 years in most fish species, 4–5 years in small ruminants, 5–6 years for cattle and up to 10–11 years for horses. In plant species, the duration of the breeding cycle is always long, ranging from 8 to 10 years for wheat and ray grass (or 5 years, when doubled haploids are used) to the longest durations of up to 20 years for a selection cycle in fruit or forest tree.

The second important characteristic of breeding programs concerns genetic evaluation. Two factors, evaluation cost and accuracy, are to some extent interrelated because accuracy can be improved by spending more on the evaluation of each candidate. However, accuracy is primarily dependent on the traits to be evaluated to meet the needs of farmers, stakeholders, and society. Some traits are difficult to evaluate precisely because they exhibit low heritability per se (e.g., pig prolificacy), high genetic × environment interactions, or are difficult to measure (e.g., resistance to Aphanomyces fungi in pea, feed efficiency in fish, sensory quality in tomato, and resilience to drought in forest trees), or correspond to lethal traits that can only be measured in sibs (e.g., meat quality or animal disease resistance). The evaluation cost is related to the traits that need to be evaluated but is also determined by the species involved (linked to size, prolificacy, and age at sexual maturity) and the possibility of controlling environmental effects: this cost is very high for bulls or trees, quite cheap for fish or poultry and intermediate for crops.

Genomic Selection Bricks

According to the analysis of the evolution of breeding programs in numerous animal and plant species, we hypothesize that the switch from classical selection to GS occurred first because of one main, simple argument. We describe possibilities that are sufficient to cover all situations and analyse why and how GS has been or could be implemented for different species. Each possibility, which we refer to as a brick, is a specific argument for carrying out GS. Each brick must not be considered as a detailed reality since genomic breeding programs generally combine several bricks. Bricks must be considered simple alternatives that are first implemented and subsequently convince stakeholders to switch to GS. In any scenario considered here, we assume that a reference population is available and composed of individuals phenotyped and genotyped with a technology providing enough SNPs.

Three bricks are proposed. These bricks correspond to the decision of the breeder when he defines the breeding scheme. Two bricks directly concern the improvement of traits: adding genotyping while retaining phenotyping in order to increase accuracy (brick A) or replacing all or part of the phenotyping effort with genotyping in order to reduce the cost of evaluating candidates (brick B). The last brick (brick C) aims at improving the choice of parents to either optimize crossbreeding or preserve genetic diversity. Brick C concerns both the improvement of traits and population management.

Each of the three bricks impacts some of the parameters of the breeder’s equation, which describes the expected genetic gain at a given selection cycle for recurrent selection of polygenic traits:

Where Δ G is the genetic gain per year, i is the intensity of selection, r is the accuracy of selection, σ G is the genetic standard deviation, and T is the duration of the breeding cycle. Thus, to improve Δ G , there must be an increase in i , r , or σ G , or a decrease in T . The effects of the bricks on the parameters of the breeder’s equation are summarized in Table 1 . Each brick has several positive impacts, which are denoted by “+” for direct impact and “(+)” for indirect impact. These impacts are explained in more detail below.

Impact of each brick on the parameters of the breeder’s equation.

Brick/parameter of the breeder’s equation σ
(A) Add genotyping to increase selection accuracy (+) + (+)
(B) Replace all or part of the phenotyping effort with genotyping + +
(C) Improve the choice of parents to optimize crossbreeding or preserve genetic diversity + +

i , intensity of selection; r , accuracy of selection; σ G , genetic standard deviation; T , duration of the breeding cycle. Impacts are denoted by “+” for direct impact and “(+)” for indirect impact.

Brick A: Adding Genotyping to Increase Selection Accuracy, r

In this brick, all candidates for selection or their relatives (for the evaluation of lethal or sex-specific traits) are still phenotyped, but genotyping is added to improve the accuracy of the estimated breeding values of candidates for selection. Hence, additional costs related to genotyping are incurred to increase selection accuracy. This brick is interesting in two main situations.

The first situation corresponds to cases where the direct phenotyping of candidates for selection is not possible (for e.g., in the case of lethality related to disease resistance or sex-linked traits). In these cases, information from related individuals is traditionally used to predict breeding values. In practice, this situation can occur if full or half sibs are phenotyped in a collateral test or if candidates are chosen on the basis of pedigree information. Without genomic information, all individuals of the same family exhibit the same estimated breeding value. However, because of Mendelian sampling, all individuals do not have the same genotype, and they differ in their “true” breeding values. Thus, in conventional breeding programs, where candidates are evaluated from relatives, genetic gains are limited because it is not possible to exploit within-family genetic variability. GS allows us to discriminate between full sibs on the basis of the estimation of molecular information, which improves selection accuracy.

Brick A is interesting in a second type of situation in which traits are complicated to measure, and/or a large amount of data is required to accurately predict breeding values. This is the case for traits with low heritability or those characterized by high genotype by environment interactions, leading to imprecise breeding values or costly phenotyping designs. In this case, genomic prediction is of particular interest to increase the accuracy of breeding values. In this brick, the phenotyping of candidates (and possibly their relatives) is maintained for all individuals but can be reduced to some extent (e.g., fewer repetitions, earlier measurements, and fewer environments) to lower costs and/or decrease the difficulty of phenotyping.

The main expected effect of brick A is the improvement of the accuracy, r , due to the genomic information. Therefore, it may be relevant for the breeder to decrease the precision of phenotypic measurements (which are still collected in all candidates), leading to more candidates and, consequently, higher intensity of selection, i . In some cases, the reduction of phenotyping effort can be achieved through earlier phenotyping, which may affect the duration of the breeding cycle, T .

Brick B: Replace All or Part of the Phenotyping Effort With Genotyping to Increase i or Decrease T

Another advantage of introducing genomic information is that large decreases in the phenotyping effort can be achieved by removing phenotyping for all or a large portion of selection candidates. The impacts of this brick are a decrease in phenotyping costs and the possibility of performing selection very early. It relies on the assumption that genomic accuracy is fairly good compared to phenotypic prediction. One must emphasize that the reduction of phenotyping concerns the candidates and that phenotyping effort should be maintained in the reference population to maintain accurate prediction equations. Two versions of this brick are possible depending on whether all or part of the phenotyping effort is replaced by genotyping.

In the first situation, phenotyping is fully replaced by genotyping. In this case, the duration of the breeding cycle decreases because genotyping enables early selection without waiting for phenotypes to be measured. For example, it is possible to remove progeny testing for species or traits for which the progeny testing procedure is lengthy and considerably exceeds the age at sexual maturity. However, the gain in T is constrained by the age at sexual maturity ( T cannot be lower than this age) and/or by the need to include several generations in one selection cycle, for example, to produce enough seeds for field trials in crops.

In the second situation, genomic prediction is used to conduct pre-screening and eliminate the worst candidates. The pre-screened candidates are then selected using phenotypes and genotypes together. More candidates for selection are then produced, leading to an increase in the selection intensity, i . It should be observed that the combined use of phenotypes and genotypes of the pre-screened candidates corresponds to brick A, leading to a possible improvement of the accuracy r . Hence, this case with pre-screening is one particular case leading to the combination of two bricks: the prime objective corresponds to brick B, but genomic prediction used for pre-screening enables also to implement brick A without extra cost.

The expected impact of this brick is mainly a decrease in T or increase in i . The reason for the decreasing T in this case is quite different from that under brick A. Under brick A, T is decreased because it becomes possible to perform phenotyping earlier, but all candidates are still phenotyped. Under brick B, T is decreased because the first objective is to remove or greatly limit phenotyping. Brick B can potentially have negative effects on accuracy, r . The change in r needs to be estimated and balanced with the gains in i and T to evaluate the interest in brick B.

Brick C: Improve the Choice of Parents to Either Optimize Crossbreeding or Preserve Genetic Diversity

This brick mainly concerns the management of genetic diversity. GS can be used to help choose which parents to cross. Under classical selection, because of budget constraints, only a fraction of the possible crosses can be performed. Two versions of this brick are possible depending on whether the objective is the short-term genetic progress or long-term preservation of genetic diversity and maximization of recombination.

In the first case, especially in crops, the selected candidates come from abundant and diverse progeny from a cross between two parents. Breeders have to choose the parents that are crossed at the beginning of a new breeding cycle. To obtain the best possible progeny, the parents must not only present the highest breeding values but must also show a good level of complementarity. The choice of complementary parents can be made on the basis of phenotypic information (i.e., choosing parents with complementary performance), but this can be improved significantly by using genomic information.

In the second case, genotypes provide access to genomic relationships, which are more precise than classical pedigree relationships and, thus, enable better management of genetic variability based on knowledge of the realized kinship between individuals.

The expected impact of this brick is mainly an increase in σ G . In the short term, the objective is to better exploit current genetic diversity. In the long term, the objective is to maintain genetic diversity and the level of σ G after several breeding cycles. This expected gain from brick C is of potential interest for all species.

For each brick and for each animal or plant species, we determined the brick that was initially used to implement GS or could be used initially when GS has not yet been implemented. This analysis is based on the literature and discussion within the R2D2 expert group. The literature includes various articles reporting how GS has been implemented, articles reporting the efficiency of some strategies based on simulations or specific experiments, and finally, articles in which experts discuss how GS could be implemented for some species.

For brick A, the first situation we considered is when the direct phenotyping of candidates for selection is not possible. It occurs in animal species such as fish and pigs, in which the evaluation of sib performance is largely applied and allows to increase selection accuracy ( Sonesson and Meuwissen, 2009 ; Tribout, 2011 ; Robledo et al., 2018 ). A related approach could be interesting for forest trees ( Plomion et al., 2016 ) or forage plants when polycrosses (crosses using a mixture of pollens) are implemented: GS allows the redrawing of pedigrees, on the basis of data from fathers in particular, thus increasing the accuracy of selection ( Riday, 2011 ; Vidal et al., 2017 ).

Brick A is also interesting in a second type of situation in which traits are complicated to measure, and/or a large amount of data is required to accurately predict breeding values. The interest in GS to improve selection accuracy for traits with low heritability is frequently highlighted for animal species such as beef cattle ( Garrick, 2011 ), poultry ( Wolc et al., 2016 ), or pigs ( Tribout, 2011 ). For the last species, GS has actually been implemented in Landrace pigs in France since 2016 for the selection of reproduction traits, which exhibit low heritability ( Bouquet et al., 2017 ). Bouquet et al. (2017) showed that, even if the accuracy remains low for these traits with low heritability, the accuracy is much better than with pedigree evaluation. The use of GS to better select for low-heritability traits is also reported in plant breeding ( Endelman et al., 2014 ; Michel et al., 2017 , for baking ability for winter wheat). Nevertheless, the literature on plant breeding more often highlights the interest in GS to control for genetic by environment interactions ( Crossa et al., 2017 ; Rutkoski et al., 2017 ). As considerable variations in the environment may occur from 1 year to the next, genomic prediction based on data from several years may help control for this interaction ( He et al., 2016 ). GS is also used to improve predictions when the objective is to develop hybrid varieties in plants (e.g., maize) or to cross purebred parental lines in animals (e.g., poultry or pigs). GS can be used to predict the crossing ability in cases where it has not yet been evaluated (e.g., in pigs, Tribout, 2011 ; Samorè and Fontanesi, 2016 ; Tusell et al., 2016 ) or to improve such predictions by predicting the specific combining abilities (SCA) for pairs of individuals, in addition to their general combining ability (GCA) for crossing ( Bernardo, 1994 ; Zhao et al., 2015 ; Kadam and Lorenz, 2018 ; 28–31; Seye et al., 2020 ).

Globally, in the history of GS, the brick A was not the first to be implemented because it generally leads to an increase in the total cost of breeding. Thus, the adoption of this brick took more time, as it was necessary to carefully estimate the benefit of GS and convince breeders of its usefulness. Nevertheless, the interest in this brick increases as the cost of genotyping decreases.

Concerning the brick B, the strategy where phenotyping is fully replaced by genotyping has been used or is planned to be implemented in species with long breeding cycles, such as dairy cattle and trees. In dairy cattle, the introduction of GS in France in 2009 led to changes in the breeding program organization and greatly decreased T from 5 years to 2–2.5 years for the male pathway. Indeed, it allowed the reduction of progeny testing, as the breeding values of young bulls could be estimated at birth, without waiting to observe their daughters’ performance ( Schaeffer, 2006 ; Hayes et al., 2009 ; Venot et al., 2016 ). This leads to a considerable decrease in cost. Because of these obvious advantages, this scenario represented the first case of the implementation of GS at a large scale in many countries. Because of decreasing the duration of the breeding cycle, this strategy is also of interest in the case of forage plants ( Annicchiarico et al., 2015 ; Lin et al., 2016 ). For crop species such as wheat, variety development requires several generations of selfing to multiply seeds. One possible strategy is to replace some normal breeding cycle(s) with short breeding cycle(s) based only on GS ( Beyene et al., 2015 ; Longin et al., 2015 ; Cericola et al., 2018 ; Pembleton et al., 2018 ). Trees (forest trees and fruit trees) are also species for which the duration of the selection cycle is very long and could be drastically decreased via the implementation of brick B ( Kumar et al., 2012 ; Biscarini et al., 2017 ; Grattapaglia, 2017 ; Gorjanc et al., 2018 ). Decreasing the duration of the breeding cycle is also an interesting option for selection of roosters in laying hens ( Le Roy et al., 2014 ).

The other option of the brick B where genomic prediction is used to conduct pre-screening has been adopted in dairy sheep and goats, in which the selection cycle is shorter compared to dairy cattle. In these cases, the choice was made to evaluate more sires through the pre-screening of candidates based on the genomic breeding values of the young sires, with the objective of increasing the selection intensity. Then, the chosen sires are evaluated using both genomic and pedigree information ( Carillier et al., 2013 ; Baloche et al., 2014 ; Larroque et al., 2014 ). This strategy has been implemented in France since 2015 for sheep and 2018 for goats. This pre-screening option has also been tested in oil palm trees and found to be of considerable interest for choosing the best individuals with the highest genetic value in hybrid crosses before progeny testing ( Nyouma et al., 2019 ).

For brick C, the first interest is to use genomic information for a better choice of parents to obtain high genetic gain but also to enhance genetic variability. The key point is to choose new crosses which will have a large progeny variance. Strategies to promote the best choice of parents have been used for a long time in plant breeding [see for e.g., ( Dudley, 1984 )] before the availability of genomic information. Bernardo (2014) tested the adaptation of classical methods for choosing parents (classes of loci, and the usefulness criterion) with genomic information through simulation and on data from maize inbreds. The interest to use genomic information has been studied further in plant breeding, with inbred lines ( Lehermeier et al., 2017 ; Allier et al.,2019a,b ) and in animal breeding with outbred populations, with applications to dairy cattle ( Santos et al., 2019 ; Bijma et al., 2020 ).

The second interest of brick C is a better management of genetic variability based on knowledge of the realized kinship between individuals. In classical selection, loss of genetic diversity is evaluated through evolution of inbreeding, estimated through pedigrees. Use of genomic relationships should be more precise than pedigree relationships. However, results are contrasted with, for example, Sonesson et al. (2012) showing that genomic control of inbreeding is more efficient in genomic selection while Henryon et al. (2019) showed that optimum contribution selection provided more genetic gain using pedigree relationships rather than genomic relationships and Meuwissen et al. (2020) evidenced that the different ways to compute genomic relationships lead to different conclusions about efficiency of using genomic relationships to manage genetic diversity. However, in French dairy goats, the joint management of genetic progress and genetic variability has been implemented since 2006, with minimization of pedigree relationships for desired genetic gains ( Colleau, 2002 ). This method was modified to take genomic relationships into account in 2018 when GS began to be implemented in French dairy goats ( Colleau et al., 2017 ).

Globally, this brick C does not appear to be a sufficient reason for switching to GS. It is instead a bonus to be added to the advantages of another main brick and methods are still to be developed and improved.

Prerequisites for Implementing Genomic Selection?

Even if, in short term, we cannot act on its level, linkage disequilibrium (LD) is one of the parameters which determines the GS efficiency. LD is an intrinsic characteristic of populations, consequence of the evolutionary forces involved in the construction of the population. With lower LD, larger reference population and/or higher marker density (in both reference and candidates populations) will be needed to achieve interesting accuracy of genomic selection. Thus, LD will impact the cost of implementing GS and a very low LD could prevent from switching to GS.

When considering criteria we can act on, the first criterion for moving toward GS is the availability of genotyping tools. Genotyping costs have decreased substantially in recent years, but technological development and large-scale genotyping still represent large investments. Human investment is also needed to acquire and use genomic information. Finally, an efficient reference population is another key point. If large reference populations have now been established in species or breeds which are widely used and/or with high economical impact, this can be still a limiting point for many species with little market size or for small breeds in animals. Genotyping investments are affordable to different degrees depending on the market size related to the species, the structure of the breeding programs and the species organization. For example, for dairy cattle, the market size and the huge number of animals all over the world make it possible to combine resources, assemble large reference populations and achieve low prices for genotyping. Moreover, offspring testing in a classical breeding program is very expensive; thus, replacing this strategy by genotyping is of economic interest. These characteristics, together with the opportunity to strongly reduce T, enabled the switch to GS in dairy cattle very early. The switch to GS is more difficult for species with smaller market sizes, those in which selection is carried out by smaller companies, or those in which GS will not decrease the cost of breeding programs. However, as genotyping costs decrease, GS becomes accessible for an increasing number of species. A significant lever is the launch of publicly supported applied research programs for the implementation of GS. These programs often enable the genotyping of the first commercial candidates and help private breeders switch to GS.

Historically, different bricks have been preferentially chosen as arguments for switching to GS at different times: brick B was the first to be chosen because it can reduce costs very significantly. Brick A came later, and the interest in this brick has increased as genotyping cost has decreased. Finally, brick C is often cited as a very interesting option but does not appear to be important enough to convince stakeholders to switch to GS. However, we expect that this brick will be used more widely in combination with bricks A and/or B to better valorise existing genomic data. Moreover, the theoretical developments related to brick C are more recent, and this brick could therefore become more popular in the future, especially for new traits, which may exhibit low genetic variability in current populations. Globally, one brick is initially decisive in switching to GS, but all benefits of GS are combined thereafter.

Limits of Our Study

The main difficulty of this work is that for many species, GS has generally been implemented only very recently or is about to start. Thus, it is not always possible to look back to what has occurred in the past. GS is now implemented in most animal species: dairy and beef cattle, dairy goats, fish, pigs, poultry, dairy sheep, and horse in some countries. In plants, GS has been implemented mainly in species with major economic importance. Moreover, for some species (maize, wheat, and, to some extent, poultry), selection is performed by private companies, and information on their breeding strategies is thus limited.

At the beginning of our study, we thought that reviewing the literature and talking with experts would be sufficient to identify the main reason motivating the breeders of a given species to switch from a conventional breeding scheme to a breeding scheme incorporating GS. However, most papers in this field are highly prospective, exploring all possible advantages of GS without ranking them. The other available papers report simulation studies that show the efficiency of different strategies but do not truly discuss about real implementation. Finally, in some species (trees, maize, wheat, and forage plants), different strategies are relevant, and no single option presents a clear advantage over the others. For example, in maize, Windhausen et al. (2012) proposed a decrease in phenotyping effort (brick B), while Lehermeier et al. (2017) demonstrated the interest in brick C, and experts have reported that at least in one private company, brick A is implemented through a decrease in phenotyping effort (fewer repetitions and fewer environments). In these species, it is possible that different private companies have not chosen the same option for implementing GS.

Consequences of the Biological Specificities/Breeding Organization

The specificities of the general organization of the breeding for each species has been synthetized in section “Biological and Breeding Organization Specificities.” We discuss here the consequences of these specificities on the interest for switching to GS and the strategy for implementing GS.

Switching to GS occurs more easily in contexts where the phenotyping of one candidate is rather costly, so that GS enables to make some savings on this phenotyping, at least for some of the candidates. The dairy cattle example is a good illustration of this property.

Considering the species in which GS has already been implemented, the interest for brick A vs B can be related to biological or breeding organization specificities. For example, among the species in which brick A is implemented, most are characterized by a moderate phenotyping cost and a rather short selection cycle, while most of the species in which brick B is implemented are characterized by a long selection cycle and, thus, a high unit cost of candidates. However, these characteristics are not sufficient to predict the brick chosen for GS. For example, in forest and fruit trees, at first glance, brick B seems to be a good choice because of the long duration of the breeding cycle ( Kumar et al., 2012 ; Biscarini et al., 2017 ; Grattapaglia, 2017 ). However, due to the late age of sexual maturity in forest trees, brick B can have a limited impact on the duration of the breeding cycle. In the meantime, the implementation of brick A through genotyping associated with the integration of new selection criteria linked to traits that are difficult and costly to phenotype (biotic and abiotic resistance, for instance) could be of first interest ( Lenz et al., 2020 ).

Thus, there is no universal key to choose the initial brick. This choice is determined by the whole context, including biological specificities, breeding organization and the economic context (commercial organization and regulations).

Need for More Generic Studies of GS Efficiency

A large part of the literature on the analysis of the reasons for implementing GS is limited to one species or even to a population or a limited set of closely related species/populations. These specific investigations are necessary to define the best GS strategy for each context. However, transversal analysis, such has the one proposed here can be of interest in at least two situations. A first situation corresponds to the case where, for a given species, very few publications and research has been made for implementing GS for this species. In such a case, the breeder may be interested in what we can learn from other species or populations. This paper can provide him/her a synthetic overview of the options that have been studied or implemented in other species. A second situation corresponds to a case where a reader is interested to have an overview of the possible applications of GS, rather than the application of GS in one particular species or population. This can be the case for example with scholars from other disciplines like social science. This can also help explaining the options for implementing genomic selection, in rather simple terms, to a general audience (e.g., undergraduate students and decision makers not directly involved in breeding).

We show here that the arguments for implementing GS are generic and may be applied to a wide range of species. More precisely, we show that a very limited set of basic common bricks can be used to summarize the reason for implementing GS in a wide range of animal and plant species. We think that transversal analysis, comparing a given strategy across very different contexts (biological specificities and breeding organization), should be extended. We expect, in the coming years, to have more evidence on the actual implementation of GS for a larger range of species and the impact of GS. If this is the case, the current analysis could be updated and probably be made more rigorously, with a quantitative analysis of the literature. Of course, in the end, stakeholders will make their own choice, which will be specific to each context, but such transverse analysis should help to anticipate the advantages and drawbacks of different strategies.

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author Contributions

R2D2 members provided the relevant literature, participated in discussions to compare the breeding programs in different species, and more precisely defined the bricks and their interest related to different species. SL proposed the original idea of bricks. AF-S performed the bibliographic analysis of the reviewed papers. CB, AF-S, MD-N, and SL analysed all the information (e.g., manuscripts and discussions). MD-N, SL, and AF-S wrote the final manuscript. All authors read and approved the final manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Funding. This work was funded by the SELGEN INRA metaprogram (R2D2 project–2012–2019).

https://colloque.inrae.fr/metaprograms-workshops_eng/Metaprograms/Selgen

  • Allier A., Lehermeier C., Charcosset A., Moreau L., Teyssèdre S. (2019a). Improving short- and long-term genetic gain by accounting for within-family variance in optimal cross-selection. Front. Genet. 10:1006. 10.3389/fgene.2019.01006 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Allier A., Moreau L., Charcosset A., Teyssèdre S., Lehermeier C. (2019b). Usefulness criterion and post-selection parental contributions in multi-parental crosses: application to polygenic trait introgression. G3 9 1469–1479. 10.1534/g3.119.400129 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Annicchiarico P., Nazzicari N., Li X., Wei Y., Pecetti L., Brummer E. C. (2015). Accuracy of genomic selection for alfalfa biomass yield in different reference populations. BMC Genomics 16:1020. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Bai T., Kissoudis C., Yan Z., Visser R. G. F., van der Linden G. (2018). Plant behaviour under combined stress: tomato responses to combined salinity and pathogen stress. Plant J. 93 781–793. 10.1111/tpj.13800 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Baloche G., Legarra A., Sallé G., Larroque H., Astruc J.-M., Robert-Granié C., et al. (2014). (2014) Assessment of accuracy of genomic prediction for French Lacaune dairy sheep. J. Dairy Sci. 97 1107–1116. 10.3168/jds.2013-7135 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Bernardo R. (1994). Prediction of maize single-cross performance using RFLPs and information from related hybrids. Crop Sci. 34 20–25. 10.2135/cropsci1994.0011183x003400010003x [ DOI ] [ Google Scholar ]
  • Bernardo R. (2014). Genomewide selection of parental inbreds: classes of loci and virtual biparental populations. Crop Sci. 54 2586–2595. 10.2135/cropsci2014.01.0088 [ DOI ] [ Google Scholar ]
  • Beyene Y., Semagn K., Mugo S., Tarekegne A., Babu R., Meisel B., et al. (2015). Genetic gains in grain yield through genomic selection in eight bi-parental maize populations under drought stress. Crop Sci. 55 154–163. 10.2135/cropsci2014.07.0460 [ DOI ] [ Google Scholar ]
  • Bijma P., Wientjes Y. C. J., Calus M. P. L. (2020). Breeding top genotypes and accelerating response to recurrent selection by selecting parents with greater gametic variance. Genetics 214 91–107. 10.1534/genetics.119.302643 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Biscarini F., Nazzicari N., Bink M., Arús P., Aranzana M. J., Verde I., et al. (2017). Genome-enabled predictions for fruit weight and quality from repeated records in European peach progenies. BMC Genomics 18:432. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Bouquet A., Canapale M., Brenaut P., Bellec T., Flatres-Grall L., Ligonesche B. (2017). “Mise en place de la sélection génomique dans le schéma de sélection de la population Landrace Français,” in Proccdings of the 49èmes Journées de la Recherche Porcine. IFIP - Institut du Porc, Paris (France) , (Paris: ), 31–36. [ Google Scholar ]
  • Carillier C., Larroque H., Palhière I., Clément V., Rupp R., Robert-Granié C. (2013). A first step toward genomic selection in the multi-breed French dairy goat population. J. Dairy Sci. 96 7294–7305. 10.3168/jds.2013-6789 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Cericola F., Lenk I., Fè D., Byrne S., Jensen C. S., Pedersen M. G., et al. (2018). Optimized use of low-depth genotyping-by-sequencing for genomic prediction among multi-parental family pools and single plants in perennial Ryegrass ( Lolium perenne L.). Front. Plant Sci. 9:369. 10.3389/fpls.2018.00369 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Colleau J. J. (2002). An indirect approach to the extensive calculation of relationship coefficients. Genet. Sel. Evol. 34:409. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Colleau J. J., Palhière I., Rodríguez-Ramilo S. T., Legarra A. (2017). A fast indirect method to compute functions of genomic relationships concerning genotyped and ungenotyped individuals, for diversity management. Genet Sel. Evol. 49:87. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Crossa J., Pérez-Rodríguez P., Cuevas J., Montesinos-López O., Jarquín D., de Los Campos G., et al. (2017). Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci. 22 961–975. [ DOI ] [ PubMed ] [ Google Scholar ]
  • Dudley J. W. (1984). A method for identifying populations containing favorable alleles not present in elite germplasm. Crop Sci. 24 1053–1054. 10.2135/cropsci1984.0011183X002400060011x [ DOI ] [ Google Scholar ]
  • Endelman J. B., Atlin G. N., Beyene Y., Semagn K., Zhang X., Sorrells M. E., et al. (2014). Optimal design of preliminary yield trials with genome-wide markers. Crop Sci. 54 48–59. 10.2135/cropsci2013.03.0154 [ DOI ] [ Google Scholar ]
  • Garrick D. J. (2011). The nature, scope and impact of genomic prediction in beef cattle in the United States. Genet. Sel. Evol. 43:17. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Gorjanc G., Gaynor R. C., Hickey J. M. (2018). Optimal cross selection for long-term genetic gain in two-part programs with rapid recurrent genomic selection. Theor. Appl. Genet. 131 1953–1966. 10.1007/s00122-018-3125-3 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Grattapaglia D. (2017). “Status and perspectives of genomic selection in forest tree breeding,” in Genomic Selection for Crop Improvement , eds Varshney R. K., Roorkiwal M., Sorrells M. E. (Cham: Springer International Publishing; ), 199–249. 10.1007/978-3-319-63170-7_9 [ DOI ] [ Google Scholar ]
  • Hayes B. J., Bowman P. J., Chamberlain A. J., Goddard M. E. (2009). Invited review: genomic selection in dairy cattle: progress and challenges. J. Dairy Sci. 92 433–443. 10.3168/jds.2008-1646 [ DOI ] [ PubMed ] [ Google Scholar ]
  • He S., Schulthess A. W., Mirdita V., Zhao Y., Korzun V., Bothe R., et al. (2016). Genomic selection in a commercial winter wheat population. Theor. Appl. Genet. 129 641–651. 10.1007/s00122-015-2655-1 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Henryon M., Liu H., Berg P., Su G., Nielsen H. M., Gebregiwergis G. T., et al. (2019). Pedigree relationships to control inbreeding in optimum-contribution selection realise more genetic gain than genomic relationships. Genet. Sel. Evol. 51:39. 10.1186/s12711-019-0475-5 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Heslot N., Jannink J.-L., Sorrells M. E. (2015). Perspectives for genomic selection applications and research in plants. Crop Sci. 55:1. 10.2135/cropsci2014.03.0249 [ DOI ] [ Google Scholar ]
  • Hickey J. M., Chiurugwi T., Mackay I., Powell W. (2017). Genomic prediction unifies animal and plant breeding programs to form platforms for biological discovery. Nat. Genet. 49 1297–1303. 10.1038/ng.3920 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Jonas E., de Koning D. J. (2013). Does genomic selection have a future in plant breeding? Trends Biotechnol. 31 497–504. 10.1016/j.tibtech.2013.06.003 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Kadam D. C., Lorenz A. J. (2018). “Toward redesigning hybrid maize breeding through genomics-assisted breeding,” in The Maize Genome Compendium of Plant Genomes , eds Bennetzen J., Flint-Garcia S., Hirsch C., Tuberosa R. (Cham: Springer; ), 367–388. 10.1007/978-3-319-97427-9_21 [ DOI ] [ Google Scholar ]
  • Kumar S., Chagné D., Bink M. C. A. M., Volz R. K., Whitworth C., Carlisle C. (2012). Genomic selection for fruit quality traits in Apple ( Malus×domestica Borkh.). PLoS One 7:e36674. 10.1371/journal.pone.0036674 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lande R., Thompson R. (2000). Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124 743–756. 10.1093/genetics/124.3.743 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Larroque H., Barillet F., Baloche G., Astruc J.-M., Buisson D., Shumbusho F., et al. (2014). “Toward genomic breeding programs in French dairy sheep and goats,” in Proceedings of the 10th World Congress of Genetics Applied to Livestock Production , (Vancouver, BC: ), 17–22. [ Google Scholar ]
  • Le Roy P., Chapuis H., Guémené D. S. (2014). élection génomique: quelles perspectives pour les filières avicoles? INRA Prod. Anim. 27 331–336. 10.20870/productions-animales.2014.27.5.3080 [ DOI ] [ Google Scholar ]
  • Lehermeier C., Teyssèdre S., Schön C. C. (2017). Genetic gain increases by applying the usefulness criterion with improved variance prediction in selection of crosses. Genetics 207 1651–1661. 10.1534/genetics.117.300403 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lenz P. R. N., Nadeau S., Mottet M. J., Perron M., Isabel N., Beaulieu J., et al. (2020). Multi-trait genomic selection for weevil resistance, growth, and wood quality in Norway spruce. Evol. Appl. 13 76–94. 10.1111/eva.12823 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lin Z., Cogan N. O. I., Pembleton L. W., Spangenberg G. C., Forster J. W., Hayes B. J., et al. (2016). Genetic gain and inbreeding from genomic selection in a simulated commercial breeding program for perennial ryegrass. Plant Genome 9 1–12. 10.3835/plantgenome2015.06.0046 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Longin C. F. H., Mi X., Würschum T. (2015). Genomic selection in wheat: optimum allocation of test resources and comparison of breeding strategies for line and hybrid breeding. Theor. Appl. Genet. 128 1297–1306. 10.1007/s00122-015-2505-1 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Meuwissen T., Hayes B., Goddard M. (2013). Accelerating improvement of livestock with genomic selection. Annu. Rev. Anim. Biosci. 1 221–237. 10.1146/annurev-animal-031412-103705 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Meuwissen T., Sonesson A. K., Gebregiwergis G. T., Woolliams J. A. (2020). Management of genetic diversity in the era of genomics. Front. Genet. 11:880. 10.3389/fgene.2020.00880 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Meuwissen T. H. E., Hayes B. J., Goddard M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157 1819–1829. 10.1093/genetics/157.4.1819 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Michel S., Kummer C., Gallee M., Hellinger J., Ametz C., Akgöl B., et al. (2017). Improving the baking quality of bread wheat by genomic selection in early generations. Theor. Appl. Genet. 131 477–493. 10.1007/s00122-017-2998-x [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Nyouma A., Bell J. M., Jacob F., Cros D. (2019). From mass selection to genomic selection: one century of breeding for quantitative yield components of oil palm ( Elaeis guineensis Jacq.). Tree Genet. Genomes 15:69. [ Google Scholar ]
  • Pembleton L. W., Inch C., Baillie R. C., Drayton M. C., Thakur P., Ogaji Y. O., et al. (2018). Exploitation of data from breeding programs supports rapid implementation of genomic selection for key agronomic traits in perennial ryegrass. Theor. Appl. Genet. 131 1891–1902. 10.1007/s00122-018-3121-7 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Plomion C., Bastien C., Bogeat-Triboulot M.-B., Bouffier L., Déjardin A., Duplessis S. (2016). Forest tree genomics: 10 achievements from the past 10 years and future prospects. Ann. For Sci. 73 77–103. 10.1007/s13595-015-0488-3 [ DOI ] [ Google Scholar ]
  • Riday H. (2011). Paternity testing: a non-linkage based marker-assisted selection scheme for outbred forage species. Crop Sci. 51:631. 10.2135/cropsci2010.07.0390 [ DOI ] [ Google Scholar ]
  • Robledo D., Palaiokostas C., Bargelloni L., Martínez P., Houston R. (2018). Applications of genotyping by sequencing in aquaculture breeding and genetics. Rev. Aquac. 10 670–682. 10.1111/raq.12193 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Rutkoski J. E., Crain J., Poland J., Sorrells M. E. (2017). “Genomic selection for small grain improvement,” in Genomic Selection for Crop Improvement , eds Varshney R. K., Roorkiwal M., Sorrells M. E. (Cham: Springer International Publishing; ), 99–130. 10.1007/978-3-319-63170-7_5 [ DOI ] [ Google Scholar ]
  • Samorè A. B., Fontanesi L. (2016). Genomic selection in pigs: state of the art and perspectives. Ital. J. Anim. Sci. 15 211–232. 10.1080/1828051x.2016.1172034 [ DOI ] [ Google Scholar ]
  • Santos D. J. A., Cole J. B., Lawlor T. J., VanRaden P. M., Tonhati H., Ma L. (2019). Variance of gametic diversity and its application in selection programs. J. Dairy Sci. 102 5279–5294. 10.3168/jds.2018-15971 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Schaeffer L. R. (2006). Strategy for applying genome-wide selection in dairy cattle. J. Anim. Breed. Genet. 123 218–223. 10.1111/j.1439-0388.2006.00595.x [ DOI ] [ PubMed ] [ Google Scholar ]
  • Seye A. I., Bauland C., Charcosset A., Moreau L. (2020). Revisiting hybrid breeding designs using genomic predictions: simulations highlight the superiority of incomplete factorials between segregating families over topcross designs. Theor. Appl. Genet. 133 1995–2010. 10.1007/s00122-020-03573-5 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Sonesson A. K., Meuwissen T. H. (2009). Testing strategies for genomic selection in aquaculture breeding programs. Genet. Sel. Evol. 41:37. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Sonesson A. K., Woolliams J. A., Meuwissen T. H. (2012). Genomic selection requires genomic control of inbreeding. Genet. Sel. Evol. 44:27. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Tribout T. (2011). Perspectives d’application de la sélection génomique dans les schémas d’amélioration génétique porcins. INRA Prod. Anim. 24 369–376. 10.20870/productions-animales.2011.24.4.3270 [ DOI ] [ Google Scholar ]
  • Tusell L., Gilbert H., Riquet J., Mercat M.-J., Legarra A., Larzul C. (2016). Pedigree and genomic evaluation of pigs using a terminal-cross model. Genet. Sel. Evol. 48:32. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Venot E., Barbat A., Boichard D., Ducrocq V., Croiseau P., Frit S., et al. (2016). “French genomic experience: genomics for all ruminant species,” in Proceedings of the 2016 Interbull Meeting , (Puerto Varas (Chili)), 5. [ Google Scholar ]
  • Vidal M., Plomion C., Raffin A., Harvengt L., Bouffier L. (2017). Forward selection in a maritime pine polycross progeny trial using pedigree reconstruction. Ann. For Sci. 74:21. [ Google Scholar ]
  • Wiggans G. R., Cole J. B., Hubbard S. M., Sonstegard T. S. (2017). Genomic selection in dairy cattle: the USDA experience. Annu. Rev. Anim. Biosci. 5 309–327. 10.1146/annurev-animal-021815-111422 [ DOI ] [ PubMed ] [ Google Scholar ]
  • Windhausen V. S., Atlin G. N., Hickey J. M., Crossa J., Jannink J. L., Sorrells M. E., et al. (2012). Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 2 1427–1436. 10.1534/g3.112.003699 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Wolc A., Kranis A., Arango J., Settar P., Fulton J. E., O’Sullivan N. P., et al. (2016). Implementation of genomic selection in the poultry industry. Anim. Front. 6 23–31. 10.2527/af.2016-0004 32704858 [ DOI ] [ Google Scholar ]
  • Zhao J., Sauvage C., Zhao J., Bitton F., Bauchet G., Liu D., et al. (2019). Meta-analysis of genome-wide association studies provides insights into genetic control of tomato flavor. Nat. Commun. 10:1534. 10.1038/s41467-019-09462-w [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Zhao Y., Mette M. F., Reif J. C. (2015). Genomic selection in hybrid breeding. Plant Breed. 134 1–10. 10.1111/pbr.12231 [ DOI ] [ Google Scholar ]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

  • View on publisher site
  • PDF (210.8 KB)
  • Collections

Similar articles

Cited by other articles, links to ncbi databases.

  • Download .nbib .nbib
  • Format: AMA APA MLA NLM

Add to Collections

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: November 2007

Recent and ongoing selection in the human genome

  • Rasmus Nielsen 1 ,
  • Ines Hellmann 1 ,
  • Melissa Hubisz 2 ,
  • Carlos Bustamante 3 &
  • Andrew G. Clark 4  

Nature Reviews Genetics volume  8 ,  pages 857–868 ( 2007 ) Cite this article

6090 Accesses

369 Citations

16 Altmetric

Metrics details

Genes or genomic regions that are under selection will typically be functionally important and will often be disease associated. They are, therefore, of interest not only to evolutionary biologists, but also to researchers in the fields of functional genomics and disease genetics.

Both negative selection acting against deleterious mutations and positive selection acting in favour of beneficial mutations is common in the human genome.

Although most selection acting on segregating mutations in disease genes is negative selection — acting against deleterious, predominantly recessive mutations — some mutations in complex diseases might also have been affected by positive selection in the past or present.

Several genome-wide scans for loci that are under selection have been carried out. These scans have provided a large amount of new information, but have also generated controversy as the concordance between results is not always high.

The main reason for the lack of concordance is probably that different tests differ in their power to detect different forms of selection. However, statistical problems relating to assumptions about demography, recombination and ascertainment biases can also affect the results of some studies.

The recent availability of genome-scale genotyping data has led to the identification of regions of the human genome that seem to have been targeted by selection. These findings have increased our understanding of the evolutionary forces that affect the human genome, have augmented our knowledge of gene function and promise to increase our understanding of the genetic basis of disease. However, inferences of selection are challenged by several confounding factors, especially the complex demographic history of human populations, and concordance between studies is variable. Although such studies will always be associated with some uncertainty, steps can be taken to minimize the effects of confounding factors and improve our interpretation of their findings.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

195,33 € per year

only 16,28 € per issue

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

gene selection essay

Similar content being viewed by others

gene selection essay

A selection pressure landscape for 870 human polygenic traits

gene selection essay

Widespread signatures of natural selection across human complex traits and functional genomic categories

gene selection essay

The influence of evolutionary history on human health and disease

Akey, J. M., Zhang, G., Zhang, K., Jin, L. & Shriver, M. D. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 12 , 1805–1814 (2002).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Andolfatto, P. Adaptive evolution of non-coding DNA in Drosophila . Nature 437 , 1149–1152 (2005).

Article   CAS   PubMed   Google Scholar  

Bustamante, C. D. et al. Natural selection on protein-coding genes in the human genome. Nature 437 , 1153–1157 (2005). This paper reports a genome-wide scan for selection in humans based on a derivative of the MacDonald–Kreitman test.

Sabeti, P. C. et al. Positive natural selection in the human lineage. Science 312 , 1614–1620 (2006). This review contains a comprehensive list of genes that are thought to be under selection in humans.

Williamson, S. H. et al. Localizing recent adaptive evolution in the human genome. PLoS Genet. 3 , e90 (2007).

Wang, E. T., Kodama, G., Baldi, P. & Moyzis, R. K. Global landscape of recent inferred Darwinian selection for Homo sapiens . Proc. Natl Acad. Sci. USA 103 , 135–140 (2006).

Kimura, M. The Neutral Theory of Molecular Evolution (Cambridge Univ. Press, New York, 1983).

Book   Google Scholar  

Gillespie, J. H. The Causes of Molecular Evolution (Oxford Univ. Press, New York, 1991).

Google Scholar  

Evans, P. D. et al. Microcephalin, a gene regulating brain size, continues to evolve adaptively in humans. Science 309 , 1717–1720 (2005).

Enard, W. et al. Molecular evolution of FOXP2 , a gene involved in speech and language. Nature 418 , 869–872 (2002).

Bamshad, M. & Wooding, S. P. Signatures of natural selection in the human genome. Nature Rev. Genet. 4 , 99A–111A (2003).

Article   CAS   Google Scholar  

Blanchette, M. & Tompa, M. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12 , 739–748 (2002).

Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15 , 1034–1050 (2005).

Eyre-Walker, A. & Keightley, P. D. High genomic deleterious mutation rates in hominids. Nature 397 , 344–347 (1999).

Eyre-Walker, A., Keightley, P. D., Smith, N. G. C. & Gaffney, D. Quantifying the slightly deleterious model of molecular evolution. Mol. Biol. Evol. 19 , 2142–2149 (2002).

Eyre-Walker, A., Woolfit, M. & Phelps, T. The distribution of fitness effects of new deleterious amino acid mutations in humans. Genetics 173 , 891–900 (2006).

Kryukov, G. V., Pennacchio, L. A. & Sunyaev, S. R. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am. J. Hum. Genet. 80 , 727–739 (2007).

Clark, A. G. et al. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302 , 1960–1963 (2003). This paper provides a list of genes under positive selection in the human evolutionary lineage based on the ratio of non-synonymous to synonymous mutations.

Consortium, T. C. S. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437 , 69–87 (2005).

Nielsen, R. et al. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 3 , e170 (2005).

Wyckoff, G. J., Wang, W. & Wu, C. I. Rapid evolution of male reproductive genes in the descent of man. Nature 403 , 304–309 (2000).

Swanson, W. J., Nielsen, R. & Yang, Q. Pervasive adaptive evolution in mammalian fertilization proteins. Mol. Biol. Evol. 20 , 18–20 (2003).

Gavrilets, S. Rapid evolution of reproductive barriers driven by sexual conflict. Nature 403 , 886–889 (2000).

Kaplan, N. L., Hudson, R. R. & Langley, C. H. The hitchhiking effect revisited. Genetics 123 , 887–899 (1989).

CAS   PubMed   PubMed Central   Google Scholar  

Maynard Smith, J. & Haigh, J. The hitch-hiking effect of a favorable gene. Genet. Res. 23 , 23–35 (1974). The original paper describing the effect of a selective sweep in a population.

Article   Google Scholar  

Braverman, J. M., Hudson, R. R., Kaplan, N. L., Langley, C. H. & Stephan, W. The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 140 , 783–796 (1995).

Barton, N. The effect of hitch-hiking on neutral genealogies. Genet. Res. 72 , 123–133 (1998).

Stephan, W., Song, Y. S. & Langley, C. H. The hitchhiking effect on linkage disequilibrium between linked neutral loci. Genetics 172 , 2647–2663 (2006).

Simoons, F. J. Primary adult lactose intolerance and milking habit — a problem in biologic and cultural interrelations. 2. A culture historical hypothesis. Am. J. Dig. Dis. 15 , 695–710 (1970).

Cavalli-Sforza, L. Analytic review: some current problems of population genetics. Am. J. Hum. Genet. 25 , 82–104 (1973).

Beja-Pereira, A. et al. Gene–culture coevolution between cattle milk protein genes and human lactase genes. Nature Genet. 35 , 311–313 (2003).

Bersaglieri, T. et al. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74 , 1111–1120 (2004).

Burger, J., Kirchner, M., Bramanti, B., Haak, W. & Thomas, M. G. Absence of the lactase-persistence-associated allele in early Neolithic Europeans. Proc. Natl Acad. Sci. USA 104 , 3736–3741 (2007).

Tishkoff, S. A. et al. Convergent adaptation of human lactase persistence in Africa and Europe. Nature Genet. 39 , 31–40 (2007). A recent paper that demonstrates positive selection acting independently on different lactase alleles that confer lactose tolerance in adults, in African and European populations.

Tishkoff, S. A. et al. Haplotype diversity and linkage disequilibrium at human G6PD : recent origin of alleles that confer malarial resistance. Science 293 , 455–462 (2001).

Verrelli, B. C., Argyropoulos, G., Destro-Bisol, G., Williams, S. M. & Tishkoff, S. A. Signature of selection at the G6PD locus inferred from patterns of nucleotide variation and linkage disequilibrium in Africans. Am. J. Hum. Genet. 69 , 395–395 (2001).

Saunders, M. A., Hammer, M. F. & Nachman, M. W. Nucleotide variability at G6PD and the signature of malarial selection in humans. Genetics 162 , 1849–1861 (2002).

Sabeti, P. C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature 419 , 832–837 (2002).

Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4 , e72 (2006). This paper provides the result of a genome-wide scan for selective sweeps based on haplotype structure information.

Article   PubMed   PubMed Central   Google Scholar  

Stefansson, H. et al. A common inversion under selection in Europeans. Nature Genet. 37 , 129–137 (2005).

Thompson, E. E. et al. CYP3A variation and the evolution of salt-sensitivity variants. Am. J. Hum. Genet. 75 , 1059–1069 (2004).

Osier, M. V. et al. A global perspective on genetic variation at the ADH genes reveals unusual patterns of linkage disequilibrium and diversity. Am. J. Hum. Genet. 1 , 84–99 (2002).

Gilad, Y., Bustamante, C. D., Lancet, D. & Paabo, S. Natural selection on the olfactory receptor gene family in humans and chimpanzees. Am. J. Hum. Genet. 73 , 489–501 (2003).

Carlson, C. S. et al. Genomic regions exhibiting positive selection identified from dense genotype data. Genome Res. 15 , 1553–1565 (2005).

Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123 , 585–595 (1989). This paper describes the most common methods for detecting selection based on population genetic data.

Fay, J. C. & Wu, C. I. Hitchhiking under positive Darwinian selection. Genetics 155 , 1405–1413 (2000).

Hinds, D. A. et al. Whole-genome patterns of common DNA variation in three human populations. Science 307 , 1072–1079 (2005).

Nielsen, R. et al. Genomic scans for selective sweeps using SNP data. Genome Res. 15 , 1566–1575 (2005).

Henikoff, S. & Malik, H. S. Selfish drivers. Nature 417 , 227–227 (2002).

Charlesworth, B., Nordborg, M. & Charlesworth, D. The effects of local selection, balanced polymorphism and background selection on equilibrium patterns of genetic diversity in subdivided populations. Genet. Res. 70 , 155–174 (1997).

Slatkin, M. & Wiehe, T. Genetic hitch-hiking in a subdivided population. Genet. Res. 71 , 155–160 (1998).

Lewontin, R. C. & Krakauer, J. Distribution of gene frequency as a test of theory of selective neutrality of polymorphisms. Genetics 74 , 175–195 (1973). The original paper discussing the effect of selection on measures of population subdivision.

Weir, B. S., Cardon, L. R., Anderson, A. D., Nielsen, D. M. & Hill, W. G. Measures of human population structure show heterogeneity among genomic regions. Genome Res. 15 , 1468–1476 (2005).

Kayser, M., Brauer, S. & Stoneking, M. A genome scan to detect candidate regions influenced by local natural selection in human populations. Mol. Biol. Evol. 20 , 893–900 (2003).

The International HapMap Consortium. A haplotype map of the human genome. Nature 437 , 1299–1320 (2005).

Simonsen, K. L., Churchill, G. A. & Aquadro, C. F. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics 141 , 413–429 (1995).

Andolfatto, P. & Przeworski, M. A genome-wide departure from the standard neutral model in natural populations of Drosophila . Genetics 156 , 257–268 (2000).

Przeworski, M., Hudson, R. R. & Di Rienzo, A. Adjusting the focus on human variation. Trends Genet. 16 , 296–302 (2000).

Nielsen, R. Statistical tests of selective neutrality in the age of genomics. Heredity 86 , 641–647 (2001).

Stajich, J. E. & Hahn, M. W. Disentangling the effects of demography and selection in human history. Mol. Biol. Evol. 22 , 63–73 (2005).

Wall, J. D., Andolfatto, P. & Przeworski, M. Testing models of selection and demography in Drosophila simulans . Genetics 162 , 203–216 (2002).

Galtier, N., Depaulis, F. & Barton, N. H. Detecting bottlenecks and selective sweeps from DNA sequence polymorphism. Genetics 155 , 981–987 (2000).

Jensen, J. D., Kim, Y., DuMont, V. B., Aquadro, C. F. & Bustamante, C. D. Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics 170 , 1401–1410 (2005).

Edmonds, C. A., Lillie, A. S. & Cavalli-Sforza, L. L. Mutations arising in the wave front of an expanding population. Proc. Natl Acad. Sci. USA 101 , 975–979 (2004).

Klopfstein, S., Currat, M. & Excoffier, L. The fate of mutations surfing on the wave of a range expansion. Mol. Biol. Evol. 23 , 482–490 (2006).

Nielsen, R. & Signorovitch, J. Correcting for ascertainment biases when analyzing SNP data: applications to the estimation of linkage disequilibrium. Theor. Popul. Biol. 63 , 245–255 (2003).

Article   PubMed   Google Scholar  

Nielsen, R., Hubisz, M. J. & Clark, A. G. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics 168 , 2373–2382 (2004).

Kong, A. et al. A high-resolution recombination map of the human genome. Nature Genet. 10 , 10 (2002).

Myers, S., Bottolo, L., Freeman, C., McVean, G. & Donnelly, P. A fine-scale map of recombination rates and hotspots across the human genome. Science 310 , 321–324 (2005).

Teshima, K. M., Coop, G. & Przeworski, M. How reliable are empirical genomic scans for selective sweeps? Genome Res. 16 , 702–712 (2006).

Teshima, K. M. & Przeworski, M. Directional positive selection on an allele of arbitrary dominance. Genetics 172 , 713–718 (2006).

MacCallum, C. & Hill, E. Being positive about selection. PLoS Biol. 4 , 293–295 (2006).

Shu, W. et al. Altered ultrasonic vocalization in mice with a disruption in the Foxp2 gene. Proc. Natl Acad. Sci. USA 102 , 9643–9648 (2005).

Williams, G. C. Adaptation and Natural Selection: A Critique of Some Current Evolutionary Thought (Princeton Univ. Press, Princeton, 1966).

Gould, S. J. & Lewontin, R. C. Spandrels of San-Marco and the Panglossian paradigm — a critique of the adaptationist program. Proc. R. Soc. London Series B Biol. Sci. 205 , 581–598 (1979).

CAS   Google Scholar  

Diaz, G. A. et al. Gaucher disease: The origins of the Ashkenazi Jewish N370S and 84GG acid β-glucosidase mutations. Am. J. Hum. Genet. 66 , 1821–1832 (2000).

Hugot, J. P. et al. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn's disease. Nature 411 , 599–603 (2001).

Schwartz, K., Carrier, L., Guicheney, P. & Komajda, M. Molecular-basis of familial cardiomyopathies. Circulation 91 , 532–540 (1995).

Saxena, R. et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316 , 1331–1336 (2007).

Steinthorsdottir, V. et al. A variant in CDKAL1 influences insulin response and risk of type 2 diabetes. Nature Genet. 39 , 770–775 (2007).

Zeggini, E. et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 316 , 1336–1341 (2007).

Verrelli, B. C. et al. Evidence for balancing selection from nucleotide sequence analyses of human G6PD . Am. J. Hum. Genet. 71 , 1112–1128 (2002).

Allen, S. J. et al. α + -thalassemia protects children against disease caused by other infections as well as malaria. Proc. Natl Acad. Sci. USA 94 , 14736–14741 (1997).

Schroeder, S. A., Gaughan, D. M. & Swift, M. Protection against bronchial-asthma by Cftr δ-f508 mutation — a heterozygote advantage in cystic-fibrosis. Nature Med. 1 , 703–705 (1995).

Neel, J. V. Diabetes mellitus: a 'thrifty' genotype rendered detrimental by 'progress'? Am. J. Hum. Genet. 14 , 353–362 (1962).

Thomas, P. D. & Kejariwal, A. Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: evolutionary evidence for differences in molecular effects. Proc. Natl Acad. Sci. USA 101 , 15398–15403 (2004).

Zlotogora, J. Multiple mutations responsible for frequent genetic diseases in isolated populations. Eur. J. Hum. Genet. 15 , 272–278 (2007).

Ng, P. C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome Res. 11 , 863–874 (2001).

Sunyaev, S. et al. Prediction of deleterious human alleles. Hum. Mol. Genet. 10 , 591–597 (2001). This paper describes the most popular bioinformatical method for predicting disease mutations without phenotypic data.

Fan, Y., Linardopoulou, E., Friedman, C., Williams, E. & Trask, B. J. Genomic structure and evolution of the ancestral chromosome fusion site in 2q13–12q14.1 and paralogous regions on other human chromosomes. Genome Res. 12 , 1651–1662 (2002).

Kim, Y. & Stephan, W. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 160 , 765–777 (2002).

Hudson, R. R., Kreitman, M. & Aguade, M. A test of neutral molecular evolution based on nucleotide data. Genetics 116 , 153–159 (1987). The original paper describing the combined use of divergence and diversity data to detect selection.

Fisher, S. E., Vargha-Khadem, F., Watkins, K. E., Monaco, A. P. & Pembrey, M. E. Localisation of a gene implicated in a severe speech and language disorder. Nature Genet. 18 , 168–170 (1998).

Lai, C. S. et al. The SPCH1 region on human 7q31: genomic characterization of the critical interval and localization of translocations associated with speech and language disorder. Am. J. Hum. Genet. 67 , 357–368 (2000).

Lai, C. S., Fisher, S. E., Hurst, J. A., Vargha-Khadem, F. & Monaco, A. P. A forkhead-domain gene is mutated in a severe speech and language disorder. Nature 413 , 519–523 (2001).

Zhang, J., Webb, D. M. & Podlaha, O. Accelerated protein evolution and origins of human-specific features: Foxp2 as an example. Genetics 162 , 1825–1835 (2002).

Mekel-Bobrov, N. et al. Ongoing adaptive evolution of ASPM , a brain size determinant in Homo sapiens . Science 309 , 1720–1722 (2005).

Currat, M. et al. Comment on 'Ongoing adaptive evolution of ASPM , a brain size determinant in Homo sapiens ' and 'Microcephalin, a gene regulating brain size, continues to evolve adaptively in humans'. Science 313 , 172 (2006).

Yu, F. L. et al. Comment on 'Ongoing adaptive evolution of ASPM , a brain size determinant in Homo sapien s'. Science 316 , 367 (2007).

Mekel-Bobrov, N. et al. The ongoing adaptive evolution of ASPM and microcephalin is not explained by increased intelligence. Hum. Mol. Genet. 16 , 600–608 (2007).

Nielsen, R. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154 , 931–942 (2000).

Clark, A. G., Hubisz, M. J., Bustamante, C. D., Williamson, S. H. & Nielsen, R. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 15 , 1496–1502 (2005).

Hudson, R. R. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18 , 337–338 (2002).

Download references

Acknowledgements

We would like to thank D. Reich, M. Przeworski and two anonymous reviewers for their helpful comments on earlier versions of this manuscript. This work was supported by Danmarks Grundforskningsfond and the US National Insitutes of Health grants R01HG003229 and U01HL084706.

Author information

Authors and affiliations.

Center for Comparative Genomics, University of Copenhagen, Universitetsparken 15, Kbh Ø, 2100, Denmark

Rasmus Nielsen & Ines Hellmann

Department of Human Genetics, University of Chicago, 920 E. 58th Street, Chicago, 60637, Illinois, USA

Melissa Hubisz

Department of Biological Statistics and Computational Biology, Cornell University, 1198 Comstock Hall, Ithaca, 14853, New York, USA

Carlos Bustamante

Department of Molecular Biology and Genetics, Cornell University, 107 Biotechnology Building, Ithaca, 14853, New York, USA

Andrew G. Clark

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Rasmus Nielsen .

Related links

Alström syndrome

Crohn disease

cystic fibrosis

familial hypertrophic cardiomyopathy

FURTHER INFORMATION

Rasmus Nielsen's homepage

Haplotter genome browser

International HapMap Project

Seattle SNP database

The stochastic change in population frequency of a mutation due to the sampling process that is inherent in reproduction.

Heritable changes in genotype or phenotype that result in increased fitness.

A measure of the capacity of an organism to survive and reproduce.

The size of a population measured by the expected effect (through genetic drift) of the population size on genetic variablity. N e is typically much lower than the actual population size ( N ).

The process by which new favourable mutations become fixed so quickly that physically linked alleles also become either fixed or lost depending on the phase of the linkage.

A measure of genetic associations between alleles at different loci, which indicates whether allelic or marker associations on the same chromosome are more common than expected.

Describes the situation in which a mutation has achieved a frequency of 100% in a natural population.

The distribution of allele frequencies in a single site of a DNA sequence averaged over multiple sites.

Allelic composition over a contiguous chromosome stretch.

Also known as genome-wide association studies. Genetic variants across the whole genome (or markers linked to these variants) are genotyped in a population for which phenotypic information is available (such as disease occurrence, or a range of different trait values). If a correlation is observed between genotype and phenotype, there is said to be an association between the variant and the disease or trait.

Any process that causes some alleles to be overrepresented in gametes formed during meiosis.

A marked reduction in population size followed by the survival and expansion of a small random sample of the original population.

A departure from random mating that is typically caused by geographical subdivision.

A selection regime that results in the maintenance of two or more alleles at a single locus in a population.

The increase if frequency of a selective neutral or weakly selected mutation due to linkage with a positively selected mutation.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Nielsen, R., Hellmann, I., Hubisz, M. et al. Recent and ongoing selection in the human genome. Nat Rev Genet 8 , 857–868 (2007). https://doi.org/10.1038/nrg2187

Download citation

Issue Date : November 2007

DOI : https://doi.org/10.1038/nrg2187

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Deciphering climate resilience in indian cattle breeds by selection signature analyses.

  • Sonali Sonejita Nayak
  • Manjit Panigrahi
  • Triveni Dutt

Tropical Animal Health and Production (2024)

The adaptive evolution of cancer driver genes

  • Guofen Yang

BMC Genomics (2023)

Genome-wide scans for signatures of selection in Mangalarga Marchador horses using high-throughput SNP genotyping

  • Wellington B. Santos
  • Gustavo P. Schettini
  • Rogério A. Curi

BMC Genomics (2021)

Direct introgression of untapped diversity into elite wheat lines

  • Sukhwinder Singh
  • N. S. Bains

Nature Food (2021)

Selection signatures of Fuzhong Buffalo based on whole-genome sequences

  • Guang-yun Huang
  • Yu-ying Liao

BMC Genomics (2020)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

gene selection essay

  • Methodology article
  • Open access
  • Published: 06 January 2006

Gene selection and classification of microarray data using random forest

  • Ramón Díaz-Uriarte 1 &
  • Sara Alvarez de Andrés 2  

BMC Bioinformatics volume  7 , Article number:  3 ( 2006 ) Cite this article

97k Accesses

1959 Citations

11 Altmetric

Metrics details

Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.

We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.

Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

Selection of relevant genes for sample classification (e.g., to differentiate between patients with and without cancer) is a common task in most gene expression studies (e.g., [ 1 – 6 ]). When facing gene selection problems, biomedical researchers often show interest in one of the following objectives:

To identify relevant genes for subsequent research; this involves obtaining a (probably large) set of genes that are related to the outcome of interest, and this set should include genes even if they perform similar functions and are highly correlated.

To identify small sets of genes that could be used for diagnostic purposes in clinical practice; this involves obtaining the smallest possible set of genes that can still achieve good predictive performance (thus, "redundant" genes should not be selected).

We will focus here on the second objective. Most gene selection approaches in class prediction problems combine ranking genes (e.g., using an F -ratio or a Wilcoxon statistic) with a specific classifier (e.g., discriminant analysis, nearest neighbor). Selecting an optimal number of features to use for classification is a complicated task, although some preliminary guidelines, based on simulation studies by [ 4 ], are available. Frequently an arbitrary decision as to the number of genes to retain is made (e.g., keep the 50 best ranked genes and use them with a linear discriminant analysis as in [ 1 , 7 ]; keep the best 150 genes as in [ 8 ]). This approach, although it can be appropriate when the only objective is to classify samples, is not the most appropriate if the objective is to obtain the smaller possible sets of genes that will allow good predictive performance. Another common approach, with many variants (e.g., [ 9 – 11 ]), is to repeatedly apply the same classifier over progressively smaller sets of genes (where we exclude genes based either on the ranking statistic or on the effect of the elimination of a gene on error rate) until a satisfactory solution is achieved (often the smallest error rate over all sets of genes tried). A potential problem of this second approach, if the elimination is based on univariate rankings, is that the ranking of a gene is computed in isolation from all other genes, or at most in combinations of pairs of genes [ 12 ], and without any direct relation to the classification algorithm that will later be used to obtain the class predictions. Finally, the problem of gene selection is generally regarded as much more problematic in multi-class situations (where there are three or more classes to be differentiated), as evidence by recent papers in this area (e.g., [ 2 , 8 ]). Therefore, classification algorithms that directly provide measures of variable importance (related to the relevance of the variable in the classification) are of great interest for gene selection, specially if the classification algorithm itself presents features that make it well suited for the types of problems frequently faced with microarray data. Random forest is one such algorithm.

Random forest is an algorithm for classification developed by Leo Breiman [ 13 ] that uses an ensemble of classification trees [ 14 – 16 ]. Each of the classification trees is built using a bootstrap sample of the data, and at each split the candidate set of variables is a random subset of the variables. Thus, random forest uses both bagging (bootstrap aggregation), a successful approach for combining unstable learners [ 16 , 17 ], and random variable selection for tree building. Each tree is unpruned (grown fully), so as to obtain low-bias trees; at the same time, bagging and random variable selection result in low correlation of the individual trees. The algorithm yields an ensemble that can achieve both low bias and low variance (from averaging over a large ensemble of low-bias, high-variance but low correlation trees).

Random forest has excellent performance in classification tasks, comparable to support vector machines. Although random forest is not widely used in the microarray literature (but see [ 18 – 23 ]), it has several characteristics that make it ideal for these data sets:

Can be used when there are many more variables than observations.

Can be used both for two-class and multi-class problems of more than two classes.

Has good predictive performance even when most predictive variables are noise, and therefore it does not require a pre-selection of genes (i.e., "shows strong robustness with respect to large feature sets", sensu [ 4 ]).

Does not overfit.

Can handle a mixture of categorical and continuous predictors.

Incorporates interactions among predictor variables.

The output is invariant to monotone transformations of the predictors.

There are high quality and free implementations: the original Fortran code from L. Breiman and A. Cutler, and an R package from A. Liaw and M. Wiener [ 24 ].

Returns measures of variable (gene) importance.

There is little need to fine-tune parameters to achieve excellent performance. The most important parameter to choose is mtry , the number of input variables tried at each split, but it has been reported that the default value is often a good choice [ 24 ]. In addition, the user needs to decide how many trees to grow for each forest ( ntree ) as well as the minimum size of the terminal nodes ( nodesize ). These three parameters will be thoroughly examined in this paper.

Given these promising features, it is important to understand the performance of random forest compared to alternative state-of-the-art prediction methods with microarray data, as well as the effects of changes in the parameters of random forest. In this paper we present, as necessary background for the main topic of the paper (gene selection), the first through examination of these issues, including evaluating the effects of mtry , ntree and nodesize on error rate using nine real microarray data sets and simulated data.

The main question addressed in this paper is gene selection using random forest. A few authors have previously used variable selection with random forest. [ 25 ] and [ 20 ] use filtering approaches and, thus, do not take advantage of the measures of variable importance returned by random forest as part of the algorithm. Svetnik, Liaw, Tong and Wang [ 26 ] propose a method that is somewhat similar to our approach. The main difference is that [ 26 ] first find the "best" dimension ( p ) of the model, and then choose the p most important variables. This is a sound strategy when the objective is to build accurate predictors, without any regards for model interpretability. But this might not be the most appropriate for our purposes as it shifts the emphasis away from selection of specific genes, and in genomic studies the identity of the selected genes is relevant (e.g., to understand molecular pathways or to find targets for drug development).

The last issue addressed in this paper is the multiplicity (or lack of uniqueness or lack of stability) problem. Variable selection with microarray data can lead to many solutions that are equally good from the point of view of prediction rates, but that share few common genes. This multiplicity problem has been emphasized by [ 27 ] and [ 28 ] and recent examples are shown in [ 29 ] and [ 30 ]. Although multiplicity of results is not a problem when the only objective of our method is prediction, it casts serious doubts on the biological interpretability of the results [ 27 ]. Unfortunately most "methods papers" in bioinformatics do not evaluate the stability of the results obtained, leading to a false sense of trust on the biological interpretability of the output obtained. Our paper presents a through and critical evaluation of the stability of the lists of selected genes with the proposed (and two competing) methods.

In this paper we present the first comprehensive evaluation of random forest for classification problems with microarray data, including an assessment of the effects of changes in its parameters and we show it to be an excellent performer even in multi-class problems, and without any need to fine-tune parameters or pre-select relevant genes. We then propose a new method for gene selection in classification problems (for both two-class and multi-class problems) that uses random forest; the main advantage of this method is that it returns very small sets of genes that retain a high predictive accuracy, and is competitive with existing methods of gene selection.

Evaluation of performance and comparisons with alternative approaches

We have used both simulated and real microarray data sets to evaluate the variable selection procedure. For the real data sets, original reference paper and main features are shown in Table 1 and further details are provided in the supplementary material [see Additional file 1 ]. To evaluate if the proposed procedure can recover the signal in the data and can eliminate redundant genes, we need to use simulated data, so that we know exactly which genes are relevant. Details on the simulated data are provided in the methods and in the supplementary material [see Additional file 1 ].

We have compared the predictive performance of the variable selection approach with: a) random forest without any variable selection (using m t r y = n u m b e r o f g e n e s MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGTbqBcqWG0baDcqWGYbGCcqWG5bqEcqGH9aqpdaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjaaykW7cqWGVbWBcqWGMbGzcaaMc8Uaem4zaCMaemyzauMaemOBa4MaemyzauMaem4Camhaleqaaaaa@4876@ , ntree = 5000, nodesize = 1); b) three other methods that have shown good performance in reviews of classification methods with microarray data [ 7 , 31 , 32 ] but that do not include any variable selection; c) three methods that carry out variable selection. For the three methods that do not carry out variable selection, Diagonal Linear Discriminant Analysis (DLDA) , K nearest neighbor (KNN) , and Support Vector Machines (SVM) with linear kernel, we have used, based on [ 7 ], the 200 genes with the largest F -ratio of between to within groups sums of squares. For KNN , the number of neighbors ( K ) was chosen by cross-validation as in [ 7 ]. The methods that incorporate variable selection are two different versions of Shrunken centroids (SC) [ 33 ], SC.l and SC.s , as well as Nearest neighbor + variable selection (NN.vs) ; further details are provided in the methods and in the supplementary material [see Additional file 1 ].

Estimation of error rates

To estimate the prediction error rate of all methods we have used the .632+ bootstrap method [ 34 , 35 ]. The .632+ bootstrap method uses a weighted average of the resubstitution error (the error when a classifier is applied to the training data) and the error on samples not used to train the predictor (the "leave-one-out" bootstrap error); this average is weighted by a quantity that reflects the amount of overfitting. It must be emphasized that the error rate used when performing variable selection is not what we report in as prediction error rate in Tables 2 or 3 . To calculate the prediction error rate as reported, for example, in Table 2 , the .632+ bootstrap method is applied to the complete procedure, and thus the samples used to compute the leave-one-out bootstrap error used in the .632+ method are samples that are not used when fitting the random forest, or carrying out variable selection. The .632+ bootstrap method was also used when evaluating the competing methods.

Effects of parameters of random forest on prediction error rate

Before examining gene selection, we first evaluated the effect of changes in parameters of random forest on its classification performance. Random forest returns a measure of error rate based on the out-of-bag cases for each fitted tree, the OOB error, and this is the measure of error we will use here to assess the effects of parameters. We examined whether the OOB error rate is substantially affected by changes in mtry , ntree , and nodesize .

Figure 1 and the Figure"error.vs.mtry.pdf" in Additional file 2 show that, for both real and simulated data, the relation of OOB error rate with mtry is largely independent of ntree (for ntree between 1000 and 40000) and nodesize (nodesizes 1 and 5). In addition, the default setting of mtry ( mtryFactor = 1 in the figures) is often a good choice in terms of OOB error rate. In some cases, increasing mtry can lead to small decreases in error rate, and decreases in mtry often lead to increases in the error rate. This is specially the case with simulated data with very few relevant genes (with very few relevant genes, small mtry results in many trees being built that do not incorporate any of the relevant genes). Since the OOB error and the relation between OOB error and mtry do not change whether we use nodesize of 1 or 5, and because the increase in computing speed from using nodesize of 5 is inconsequential, all further analyses will use only the default nodesize = 1. These results show the robustness of random forest to changes in its parameters; nevertheless, to re-examine robustness of gene selection to these parameters, in the rest of the paper we will report results for different settings of ntree and mtry (and these results will again show the robustness of the gene selection results to changes in ntree and mtry ).

figure 1

Out-of-Bag (OOB) vs mtryFactor for the nine microarray data sets . mtryFactor is the multiplicative factor of the default mtry ( n u m b e r ⋅ o f ⋅ g e n e s MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjabgwSixlabd+gaVjabdAgaMjabgwSixlabdEgaNjabdwgaLjabd6gaUjabdwgaLjabdohaZbWcbeaaaaa@4332@ ); thus, an mtryFactor of 3 means the number of genes tried at each split is 3 * n u m b e r ⋅ o f ⋅ g e n e s MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjabgwSixlabd+gaVjabdAgaMjabgwSixlabdEgaNjabdwgaLjabd6gaUjabdwgaLjabdohaZbWcbeaaaaa@4332@ ; an mtryFactor = 0 means the number of genes tried was 1; the mtryFactors examined were = {0, 0.05, 0.1, 0.17, 0.25, 0.33, 0.5, 0.75, 0.8, 1, 1.15, 1.33, 1.5, 2, 3, 4, 5, 6, 8, 10, 13}. Results shown for six different ntree = {1000, 2000, 5000, 10000, 20000, 40000}, nodesize = 1.

The error rates of random forest (without gene selection) compared with the alternative methods, using the real microarray data, and estimated in all cases using the .632+ bootstrap method, are shown in Table 2 . These results clearly show that random forest has a predictive performance comparable to that of the alternative methods, without any need for pre-selection of genes or tuning of its parameters.

Gene selection using random forest

Random forest returns several measures of variable importance. The most reliable measure is based on the decrease of classification accuracy when values of a variable in a node of a tree are permuted randomly [ 13 , 36 ], and this is the measure of variable importance (in its unscaled version – see Additional file 1 ) that we will use in the rest of the paper. (In the Supplementary material [see Additional file 1 ] we show that this measure of variable importance is not the same as a non-parametric statistic of difference between groups, such as could be obtained with a Kruskal-Wallis test). Other measures of variable importance are available, however, and future research should compare the performance of different measures of importance.

To select genes we iteratively fit random forests, at each iteration building a new forest after discarding those variables (genes) with the smallest variable importances; the selected set of genes is the one that yields the smallest OOB error rate. Note that in this section we are using OOB error to choose the final set of genes, not to obtain unbiased estimates of the error rate of this rule. Because of the iterative approach, the OOB error is biased down and cannot be used to asses the overall error rate of the approach, for reasons analogous to those leading to "selection bias" [ 34 , 37 ]. To assess prediction error rates we will use the bootstrap, not OOB error (see above). (Using error rates affected by selection bias to select the optimal number of genes is not necessarily a bad procedure from the point of view of selecting the final number of genes; see [ 38 ]).

In our algorithm we examine all forests that result from eliminating, iteratively, a fraction, fraction.dropped , of the genes (the least important ones) used in the previous iteration. By default, fraction.dropped = 0.2 which allows for relatively fast operation, is coherent with the idea of an "aggressive variable selection" approach, and increases the resolution as the number of genes considered becomes smaller. We do not recalculate variable importances at each step as [ 26 ] mention severe overfitting resulting from recalculating variable importances. After fitting all forests, we examine the OOB error rates from all the fitted random forests. We choose the solution with the smallest number of genes whose error rate is within u standard errors of the minimum error rate of all forests. Setting u = 0 is the same as selecting the set of genes that leads to the smallest error rate. Setting u = 1 is similar to the common "1 s.e. rule", used in the classification trees literature [ 14 , 15 ]; this strategy can lead to solutions with fewer genes than selecting the solution with the smallest error rate, while achieving an error rate that is not different, within sampling error, from the "best solution". In this paper we will examine both the "1 s.e. rule" and the "0 s.e. rule".

On the simulated data sets [see Additional file 1 , Tables 3 and 4] backwards elimination often leads to very small sets of genes, often much smaller than the set of "true genes". The error rate of the variable selection procedure, estimated using the .632+ bootstrap method, indicates that the variable selection procedure does not lead to overfitting, and can achieve the objective of aggressively reducing the set of selected genes. In contrast, when the simplification procedure is applied to simulated data sets without signal (see Tables 1 and 2 in Additional file 1 ), the number of genes selected is consistently much larger and, as should be the case, the estimated error rate using the bootstrap corresponds to that achieved by always betting on the most probable class.

Results for the real data sets are shown in Tables 2 and 3 (see also Additional file 1 , Tables 5, 6, 7, for additional results using different combinations of ntree = {2000, 5000, 20000}, mtryFactor = {1, 13}, se = {0, 1}, fraction.dropped = {0.2, 0.5}). Error rates (see Table 2 ) when performing variable selection are in most cases comparable (within sampling error) to those from random forest without variable selection, and comparable also to the error rates from competing state-of-the-art prediction methods. The number of genes selected varies by data set, but generally (Table 3 ) the variable selection procedure leads to small (< 50) sets of predictor genes, often much smaller than those from competing approaches (see also Table 8 in Additional file 1 and discussion). There are no relevant differences in error rate related to differences in mtry , ntree or whether we use the "s.e. 1" or "s.e. 0" rules. The use of the "s.e. 1" rule, however, tends to result in smaller sets of selected genes.

Stability (uniqueness) of results

Following [ 39 , 40 ], and [ 41 ], we have evaluated the stability of the variable selection procedure using the bootstrap. This allows us to asses how often a given gene, selected when running the variable selection procedure in the original sample, is selected when running the procedure on bootstrap samples.

The results here will focus on the real microarray data sets (results from the simulated data are presented in Additional file 1 ). Table 3 (see also Additional file 1 , Tables 5, 6, 7, for other combinations of ntree , mtryFactor , fraction.dropped , se ) shows the variation in the number of genes selected in bootstrap samples, and the frequency with which the genes selected in the original sample appear among the genes selected from the bootstrap samples. In most cases, there is a wide range in the number of genes selected; more importantly, the genes selected in the original samples are rarely selected in more than 50% of the bootstrap samples. These results are not strongly affected by variations in ntree or mtry ; using the "s.e. 1" rule can lead, in some cases, to increased stability of the results.

As a comparison, we also show in Table 3 the stability of two alternative approaches for gene selection, the shrunken centroids method, and a filter approach combined with a Nearest Neighbor classifier (see Table 8 in Additional file 1 for results of SC.l). Error rates are comparable, but both alternative methods lead to much larger sets of selected genes than backwards variable selection with random forests. The alternative approaches seem to lead to somewhat more stable results in variable selection (probably a consequence of the large number of genes selected) but in practical applications this increase in stability is probably far out-weighted by the very large number of selected genes.

We have first presented an exhaustive evaluation of the performance of random forest for classification problems with microarray data, and shown it to be competitive with alternative methods, without requiring any fine-tuning of parameters or pre-selection of variables. The performance of random forest without variable selection is also equivalent to that of alternative approaches that fine-tune the variable selection process (see below).

We have then examined the performance of an approach for gene selection using random forest, and compared it to alternative approaches. Our results, using both simulated and real microarray data sets, show that this method of gene selection accomplishes the proposed objectives. Our method returns very small sets of genes compared to alternative variable selection methods, while retaining predictive performance. Our method of gene selection will not return sets of genes that are highly correlated, because they are redundant. This method will be most useful under two scenarios: a) when considering the design of diagnostic tools, where having a small set of probes is often desirable; b) to help understand the results from other gene selection approaches that return many genes, so as to understand which ones of those genes have the largest signal to noise ratio and could be used as surrogates for complex processes involving many correlated genes. A backwards elimination method, precursor to the one used here, has been already used to predict breast tumor type based on chromosomic alterations [ 18 ].

We have also thoroughly examined the effects of changes in the parameters of random forest (specifically mtry , ntree , nodesize ) and the variable selection algorithm ( se , fraction.dropped ). Changes in these parameters have in most cases negligible effects, suggesting that the default values are often good options, but we can make some general recommendations. Time of execution of the code increases ≈ linearly with ntree . Larger ntree values lead to slightly more stable values of variable importances, but for the data sets examined, ntree = 2000 or ntree = 5000 seem quite adequate, with further increases having negligible effects. The change in nodesize from 1 to 5 has negligible effects, and thus its default setting of 1 is appropriate. For the backwards elimination algorithm, the parameter fraction.dropped can be adjusted to modify the resolution of the number of variable selected; smaller values of fraction.dropped lead to finer resolution in the examination of number of genes, but to slower execution of the code. Finally, the parameter se has also minor effects on the results of the backwards variable selection algorithm but a value of se = 1 leads to slightly more stable results and smaller sets of selected genes.

In contrast to other procedures (e.g., [ 3 , 8 ]) our procedure does not require to pre-specify the number of genes to be used, but rather adaptively chooses the number of genes. [ 3 ] have conducted an evaluation of several gene selection algorithms, including genetic algorithms and various ranking methods; these authors show results for the Leukemia and NCI60 data sets, but the Leukemia results are not directly comparable since [ 3 ] focus on a three-class problem. They report the best results with the NCI60 data set estimated with the .632 bootstrap rule (compared to the .632+ method that we use, the .632 can be downwardly biased specially with highly overfit rules like nearest neighbor that they use – [ 35 ]). These best error rates are 0.408 for their evolutionary algorithm with 30 genes and 0.318 for 40 top-ranked genes. Using a number of genes slightly larger than us, these error rates are similar to ours; however, these are the best error rates achieved over a range of ranking methods and error rates, and not the result of a complete procedure that automatically determines the best number of genes and ranking scheme (such as our method provides). [ 8 ] conducted a comparative study of feature selection and multi-class classification. Although they use four-fold cross-validation instead of the bootstrap to assess error rates, their results for three data sets common to both studies (Srbct, Lymphoma, NCI60) are similar to, or worse than, ours. In contrast to our method, their approach pre-selects a set of 150 genes for prediction and their best error rates are those over a set of seven different algorithms and eight different rank selection methods, where no algorithm or gene selection was consistently the best. In contrast, our results with one single algorithm and gene selection method (random forest) match or outperform their results.

Recently, several approaches that adaptively select the best number of genes or features have been reported. For the Leukemia data set our method consistently returns sets of two genes, similar to [ 27 ] using an exhaustive search method, and lower than the numbers given by [ 42 ] of 3 to 25. [ 2 ] have proposed a Bayesian model averaging (BMA) approach for gene selection; comparing the results for the two common data sets between our study and theirs, in one case (Leukemia) our procedure returns a much smaller set of genes (2 vs. 15), whereas in another (Breast, 2 class) their BMA procedure returns 8 fewer genes (14 vs. 6); in contrast to BMA, however, our procedure does not require setting a limit in the maximum number of relevant genes to be selected. [ 43 ] have developed a method for gene selection and classification, LS Bound, related to least-squares SVMs; their method uses an initial pre-filtering (they choose 1000 initial genes) and is not clear how it could be applied to multi-class problems. The performance of their procedure with the leukemia data set is better than that reported by our method, but they use a total of 72 samples (the original 38 training plus the 34 validation of [ 44 ]) thus making these results hard to compare. With the colon data sets, however, their best performing results are not better than ours with a number of features that is similar to ours. [ 5 ] proposed two Bayesian classification algorithms that incorporate gene selection (though it is not clear how their algorithms can be used in multi-class problems). The results for the Leukemia data set are not comparable to ours (as they use the validation set of 34 samples), but their results for the colon data set show error rates of 0.167 to 0.242, slightly larger than ours (although these authors used random partitions with 50 training and 12 testing samples instead of the .632+ bootstrap to assess error rate), with between 8 and 15 features selected (somewhat larger than those from random forest). Finally, [ 31 ], applied both shrunken centroids and a genetic algorithm + KNN technique to the NCI60 and Srcbt data sets; their results with shrunken centroids are similar to ours with that technique, but the genetic algorithm + KNN technique used larger sets of genes (155 and 72 for the NCI60 and Srbct, respectively) than variable selection with random forest using the suggested parameters. In summary, then, our proposed procedure matches or outperforms alternative approaches for gene selection in terms of error rate and number of genes selected, without any need to fine-tune parameters or preselect genes; in addition, this method is equally applicable to two-class and multi-class problems, and has software readily available. Thus, the newly proposed method is an ideal candidate for gene selection in classification problems with microarray data.

A reviewer has alerted us to the paper by Jiang et al. [ 45 ], previously unknown to us. In fact, our approach is virtually the same as the one used by Jiang et al., with the exception that these authors recompute variable importances at each step (we do not do this in this paper, although the option is available in our code) and, more importantly, that their gene selection is based both in the OOB error, as well as the prediction error when the forest trained with one data set is applied to a second, independent, data set; thus, this approach for gene selection is not feasible when we only have one data set. Jiang et al. [ 45 ] also show the excellent performance of variable selection using random forest when applied to their data sets. The final issue addressed in this paper is instability or multiplicity of the selected sets of genes. From this point of view, the results are slightly disappointing. But so are the results of the competing methods. And so are the results of most examined methods so far with microarray data, as shown in [ 29 ] and [ 30 ] and discussed thoroughly by [ 27 ] for classification and by [ 28 ] for the related problem of the effect of threshold choice in gene selection. However, and except for the above cited papers and [ 6 , 46 ] and [ 5 ], this is an issue that still seems largely ignored in the microarray literature. As these papers and the statistical literature on variable selection (e.g., [ 40 , 47 ]) discusses, the causes of the problem are small sample sizes and the extremely small ratio of samples to variables (i.e., number of arrays to number of genes). Thus, we might need to learn to live with the problem, and try to assess the stability and robustness of our results by using a variety of gene selection features, and examining whether there is a subset of features that tends to be repeatedly selected. This concern is explicitly taken into account in our results, and facilities for examining this problem are part of our R code.

The multiplicity problem, however, does not need to result in large prediction errors. This and other papers [ 7 , 27 , 31 , 32 , 48 , 49 ] (see also above) show that very different classifiers often lead to comparable and successful error rates with a variety of microarray data sets. Thus, although improving prediction rates is important, when trying to address questions of biological mechanism or discover therapeutic targets, probably a more challenging and relevant issue is to identify sets of genes with biological relevance.

Two areas of future research are using random forest for the selection of potentially large sets of genes that include correlated genes, and improving the computational efficiency of these approaches; in the present work, we have used parallelization of the "embarrassingly parallelizable" tasks using MPI with the Rmpi and Snow packages [ 50 , 51 ] for R. In a broader context, further work is warranted on the stability properties and biological relevance of this and other gene-selection approaches, because the multiplicity problem casts doubts on the biological interpretability of most results based on a single run of one gene-selection approach.

The proposed method can be used for variable selection fulfilling the objectives above: we can obtain very small sets of non-redundant genes while preserving predictive accuracy. These results clearly indicate that the proposed method can be profitably used with microarray data and is competitive with existing methods. Given its performance and availability, random forest and variable selection using random forest should probably become part of the "standard tool-box" of methods for the analysis of microarray data.

Simulated data sets

Data have been simulated using different numbers of classes of patients (2 to 4), number of independent dimensions (1 to 3), and number of genes per dimension (5, 20, 100). In all cases, we have set to 25 the number of subjects per class. Each independent dimension has the same relevance for discrimination of the classes. The data come from a multivariate normal distribution with variance of 1, a (within-class) correlation among genes within dimension of 0.9, and a within-class correlation of 0 between genes from different dimensions, as those are independent. The multivariate means have been set so that the unconditional prediction error rate [ 52 ] of a linear discriminant analysis using one gene from each dimension is approximately 5%. To each data set we have added 2000 random normal variates (mean 0, variance 1) and 2000 random uniform [-1,1] variates. In addition, we have generated data sets for 2, 3, and 4 classes where no genes have signal (all 4000 genes are random). For the non-signal data sets we have generated four replicate data sets for each level of number of classes. Further details are provided in the supplementary material [see Additional file 1 ].

Competing methods

We have compared the predictive performance of the variable selection approach with: a) random forest without any variable selection (using m t r y = n u m b e r o f v a r i a b l e s MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGTbqBcqWG0baDcqWGYbGCcqWG5bqEcqGH9aqpdaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjaaykW7cqWGVbWBcqWGMbGzcaaMc8ocbiGae8NDayNae8xyaeMae8NCaiNaemyAaKMaemyyaeMaemOyaiMaemiBaWMaemyzauMaem4Camhaleqaaaaa@4DE7@ , ntree = 5000, nodesize = 1); b) three other methods that have shown good performance in reviews of classification methods with microarray data [ 7 , 31 ] but that do not include any variable selection (i.e., they use a number of genes decided before hand); c) two methods that carry out variable selection.

The three methods that do not carry out variable selection are:

Diagonal Linear Discriminant Analysis (DLDA) DLDA is the maximum likelihood discriminant rule, for multivariate normal class densities, when the class densities have the same diagonal variance-covariance matrix (i.e., variables are uncorrelated, and for each variable, its variance is the same in all classes). This yields a simple linear rule, where a sample is assigned to the class k which minimizes ∑ j = 1 p ( x j − x ¯ k j ) 2 / σ ^ j 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabcIcaOiabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeyOeI0IafmiEaGNbaebadaWgaaWcbaGaem4AaSMaemOAaOgabeaakiabcMcaPmaaCaaaleqabaGaeGOmaidaaOGaei4la8Iafq4WdmNbaKaadaqhaaWcbaGaemOAaOgabaGaeGOmaidaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabdchaWbqdcqGHris5aaaa@43ED@ , where p is the number of variables, x j is the value on variable (gene) j of the test sample, x ¯ k j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWG4baEgaqeamaaBaaaleaacqWGRbWAcqWGQbGAaeqaaaaa@3127@ is the sample mean of class k and variable (gene) j , and σ ^ j 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuaHdpWCgaqcamaaDaaaleaacqWGQbGAaeaacqaIYaGmaaaaaa@30FD@ is the (pooled) estimate of the variance of gene j [ 7 ]. In spite of its simplicity and its somewhat unrealistic assumptions (independent multivariate normal class densities), this method has been found to work very well.

K nearest neighbor (KNN) KNN is a non-parametric classification method that predicts the sample of a test case as the majority vote among the k nearest neighbors of the test case [ 15 , 16 ]. To decide on "nearest" we use, as in [ 7 ], the Euclidean distance. The number of neighbors used (k) is chosen by cross-validation as in [ 7 ]: for a given training set, the performance of the KNN for values of k in {1, 3, 5, ..., 21} is determined by cross-validation, and the k that produces the smallest error is used.

Support Vector Machines (SVM) SVM are becoming increasingly popular classifiers in many areas, including microarrays [ 53 – 55 ]. SVM (with linear kernel, as used here) try to find an optimal separating hyperplane between the classes. When the classes are linearly separable, the hyperplane is located so that it has maximal margin (i.e., so that there is maximal distance between the hyperplane and the nearest point of any of the classes) which should lead to better performance on data not yet seen by the SVM. When the data are not separable, there is no separating hyperplane; in this case, we still try to maximize the margin but allow some classification errors subject to the constraint that the total error (distance from the hyperplane in the "wrong side") is less than a constant. For problems involving more than two classes there are several possible approaches; the one used here is the "one-against-one" approach, as implemented in "libsvm" [ 56 ]. Reviews and introductions to SVM can be found in [ 16 , 57 ].

For each of these three methods we need to decide which of the genes will be used to build the predictor. Based on the results of [ 7 ] we have used the 200 genes with the largest F -ratio of between to within groups sums of squares. [ 7 ] found that, for the methods they considered, 200 genes as predictors tended to perform as well as, or better than, smaller numbers (30, 40, 50 depending on data set). The three methods that include gene selection are:

Shrunken centroids (SC) The method of "nearest shrunken centroids" was originally described in [ 33 ]. It uses "de-noised" versions of centroids to classify a new observations to the nearest centroid. The "de-noising" is achieved using soft-thresholding or penalization, so that for each gene, class centroids are shrunken towards the overall centroid. This method is very similar to a DLDA with shrinkage on the centroids. The optimal amount of shrinkage can be found with cross-validation, and used to select the number of genes to retain in the final classifier. We have used two different approaches to determine the best number of features.

     - SC.l : we choose the number of genes that minimizes the cross-validated error rate and, in case of several solutions with minimal error rates, we choose the one with largest likelihood.

     - SC.s : we choose the number of genes that minimizes the cross-validated error rate and, in case of several solutions with minimal error rates, we choose the one with smallest number of genes (larger penalty).

Nearest neighbor + variable selection (NN.vs) We first rank all genes based on their F-ratio, and then run a Nearest Neighbor classifier (KNN with K = 1; using N = 1 is often a successful rule [ 15 , 16 ]) on all subsets of variables that result from eliminating 20% of the genes (the ones with the smallest F-ratio) used in the previous iteration. The final number of genes is the one that leads to the smallest cross-validated error rate.

The ranking of the genes using the F-ratio is done without using the left-out sample. In other words, for a given data set, we first divide it 10 samples of about the same size; then, we repeat 10 times the following:

Exclude sample "i", the "left-out" sample.

Using the other 9 samples, rank the genes using the F-ratio

Predict the values for the left-out sample at each of the pre-specified numbers of genes (subsets of genes), using the genes as given by the ranking in the previous step.

At the end of the 10 iterations, we average the error rate over the 10 left-out samples, and obtain the average cross-validated error rate at each number of genes. These estimates are not affected by "selection bias" [ 34 , 37 ] as the error rate is obtained from the left-out samples, but the left-out samples are not involved in the ranking of genes. (Note, that using error rates affected by selection bias to select the optimal number of genes is not necessarily a bad procedure from the point of view of selecting the final number of genes; see [ 38 ]).

Even if we use, as here, error rates not affected by selection bias, using that cross-validated error rate as the estimated error rate of the rule would lead to a biased-down error rate (for reasons analogous to those leading to selection bias). Thus, we do not use these error rates in the tables, but compute the estimated prediction error rate of the rule using the .632+ bootstrap method.

This type of approach, in its many variants (changing both the classifier and the ordering criterion) is popular in many microarray papers; a recent example is [ 10 ], and similar general strategies are implemented in the program Tnasas [ 58 ].

Software and data sets

All simulations and analyses were carried out with R [ 59 ], using packages randomForest (from A. Liaw and M. Wiener) for random forest, e1071 (E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel) for SVM, class (B. Ripley and W. Venables) for KNN, PAM [ 33 ] for shrunken centroids, and geSignatures (by R.D.-U.) for DLDA.

The microarray and simulated data sets are available from the supplementary material web page [ 60 ].

Availability and requirements

Our procedure is available both as an R package (varSelRF) and as a web-based application (GeneSrF).

Project name: varSelRF.

Project home page: http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html

Operating system(s): Linux and UNIX, Windows, MacOS.

Programming language: R.

Other requirements: Linux/UNIX and LAM/MPI for parallelized computations.

License: GNU GPL 2.0 or newer.

Any restrictions to use by non-academics: None.

Project name: GeneSrF

Project home page: http://genesrf.bioinfo.cnio.es

Operating system(s): Platform independent.

Programming language: Python and R.

Other requirements: A web browser.

License: Not applicable. Access non-restricted.

Abbreviations

Diagonal linear discriminant analysis.

K-nearest neighbor.

nearest neighbor (like KNN with K = 1).

Nearest neighbor with variable selection.

Out-of-bag error; error rate from samples not used in the construction of a given tree.

Shrunken centroids with minimization of error and maximization of likelihood if ties.

Shrunken centroids with minimization of error and minimization of features if ties.

Support vector machine.

Number of input variables tried at each split by random forest.

Multiplicative factor of the default mtry ( n u m b e r ⋅ o f ⋅ g e n e s MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6gaUjabdwha1jabd2gaTjabdkgaIjabdwgaLjabdkhaYjabgwSixlabd+gaVjabdAgaMjabgwSixlabdEgaNjabdwgaLjabd6gaUjabdwgaLjabdohaZbWcbeaaaaa@4332@ )

Minimum size of the terminal nodes of the trees in a random forest.

Number of trees used by random forest.

"0 s.e." (respectively "1 s.e.") rule for choosing the best solution for gene selection (how far the selected solution can be from the minimal error solution).

Lee JW, Lee JB, Park M, Song SH: An extensive evaluation of recent classification tools applied to microarray data. Computation Statistics and Data Analysis 2005, 48: 869–885.

Article   Google Scholar  

Yeung KY, Bumgarner RE, Raftery AE: Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 2005, 21: 2394–2402.

Article   CAS   PubMed   Google Scholar  

Jirapech-Umpai T, Aitken S: Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics 2005, 6: 148.

Article   PubMed Central   PubMed   Google Scholar  

Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 2005, 21: 1509–1515.

Li Y, Campbell C, Tipping M: Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics 2002, 18: 1332–1339.

Díaz-Uriarte R: Supervised methods with genomic data: a review and cautionary view. In Data analysis and visualization in genomics and proteomics . Edited by: Azuaje F, Dopazo J. New York: Wiley; 2005:193–214.

Chapter   Google Scholar  

Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors suing gene expression data. J Am Stat Assoc 2002, 97 (457):77–87.

Article   CAS   Google Scholar  

Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 2004, 20: 2429–2437.

van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415: 530–536.

Roepman P, Wessels LF, Kettelarij N, Kemmeren P, Miles AJ, Lijnzaad P, Tilanus MG, Koole R, Hordijk GJ, van der Vliet PC, Reinders MJ, Slootweg PJ, Holstege FC: An expression profile for diagnosis of lymph node metastases from primary head and neck squamous cell carcinomas. Nat Genet 2005, 37: 182–186.

Furlanello C, Serafini M, Merler S, Jurman G: An accelerated procedure for recursive feature ranking on microarray data. Neural Netw 2003, 16: 641–648.

Bø TH, Jonassen I: New feature subset selection procedures for classification of expression profiles. Genome Biology 2002, 3 (4):0017.1–0017.11.

Breiman L: Random forests. Machine Learning 2001, 45: 5–32.

Breiman L, Friedman J, Olshen R, Stone C: Classification and regression trees . New York: Chapman & Hall; 1984.

Google Scholar  

Ripley BD: Pattern recognition and neural networks . Cambridge: Cambridge University Press; 1996.

Book   Google Scholar  

Hastie T, Tibshirani R, Friedman J: The elements of statistical learning . New York: Springer; 2001.

Breiman L: Bagging predictors. Machine Learning 1996, 24: 123–140.

Alvarez S, Diaz-Uriarte R, Osorio A, Barroso A, Melchor L, Paz MF, Honrado E, Rodriguez R, Urioste M, Valle L, Diez O, Cigudosa JC, Dopazo J, Esteller M, Benitez J: A Predictor Based on the Somatic Genomic Changes of the BRCA1/BRCA2 Breast Cancer Tumors Identifies the Non-BRCAl/BRCA2 Tumors with BRCA1 Promoter Hypermethylation. Clin Cancer Res 2005, 11: 1146–1153.

CAS   PubMed   Google Scholar  

Izmirlian G: Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Ann NY Acad Sci 2004, 1020: 154–174.

Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 2003, 19: 1636–1643.

Gunther EC, Stone DJ, Gerwien RW, Bento P, Heyes MP: Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc Natl Acad Sci USA 2003, 100: 9608–9613.

Article   PubMed Central   CAS   PubMed   Google Scholar  

Man MZ, Dyson G, Johnson K, Liao B: Evaluating methods for classifying expression data. J Biopharm Statist 2004, 14: 1065–1084.

Schwender H, Zucknick M, Ickstadt K, Bolt HM: A pilot study on the application of statistical classification procedures to molecular epidemiological data. Toxicol Lett 2004, 151: 291–299.

Liaw A, Wiener M: Classification and regression by randomForest. Rnews 2002, 2: 18–22.

Dudoit S, Fridlyand J: Classification in microarray experiments. In Statistical analysis of gene expression microarray data . Edited by: Speed T. New York: Chapman & Hall; 2003:93–158.

Svetnik V, Liaw A, Tong C, Wang T: Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules. Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9–11 June 2004, Cagliari, Italy. Lecture Notes in Computer Science, Springer 2004, 3077: 334–343.

Somorjai RL, Dolenko B, Baumgartner R: Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 2003, 19: 1484–1491.

Pan KH, Lih CJ, Cohen SN: Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. Proc Natl Acad Sci USA 2005, 102: 8961–8965.

Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005, 21: 171–178.

Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005, 365: 488–492.

Romualdi C, Campanaro S, Campagna D, Celegato B, Cannata N, Toppo S, Valle G, Lanfranchi G: Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. Hum Mol Genet 2003, 12 (8):823–836.

Dettling M: BagBoosting for tumor classification with gene expression data. Bioinformatics 2004, 20: 3583–593.

Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002, 99 (10):6567–6572.

Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 2002, 99 (10):6562–6566.

Efron B, Tibshirani RJ: Improvements on cross-validation: the .632+ bootstrap method. J American Statistical Association 1997, 92: 548–560.

Bureau A, Dupuis J, Hayward B, Falls K, Van Eerdewegh P: Mapping complex traits using Random Forests. BMC Genet 2003, 4 (Suppl 1):S64.

Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute 2003, 95: 14–18.

Braga-Neto U, Hashimoto R, Dougherty ER, Nguyen DV, Carroll RJ: Is cross-validation better than resubstitution for ranking genes? Bioinformatics 2004, 20: 253–258.

Faraway J: On the cost of data analysis. Journal of Computational and Graphical Statistics 1992, 1: 251–231.

Harrell JFE: Regression modeling strategies . New York: Springer; 2001.

Efron B, Gong G: A leisurely look at the bootstrap, the jacknife, and cross-validation. Am Stat 1983, 37: 36–48.

Deutsch JM: Evolutionary algorithms for finding optimal gene sets in microarray prediction. Bioinformatics 2003, 19: 45–52.

Zhou X, Mao KZ: LS Bound based gene selection for DNA microarray data. Bioinformatics 2005, 21: 1559–1564.

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537.

Jiang H, Deng Y, Chen H, Tao L, Sha Q, Chen J, Tsai C, Zhang S: Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics 2004, 5: 81.

Yeung KY, Bumgarner RE: Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biol 2003, 4: R83.

Breiman L: Statistical modeling: the two cultures (with discussion). Statistical Science 2001, 16: 199–231.

Dettling M, Bühlmann P: Finding predictive gene groups from microarray data. J Multivariate Anal 2004, 90: 106–131.

Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y: Design and analysis of DNA microarray investigations . New York: Springer; 2003.

Yu H: Rmpi: Interface (Wrapper) to MPI (Message-Passing Interface). Tech. rep., Department of Statistics, University of Western Ontario; 2004. [ http://www.stats.uwo.ca/faculty/yu/Rmpi ]

Tierney L, Rossini AJ, Li N, Sevcikova H: SNOW: Simple Network of Workstations. Tech. rep 2004. [ http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html ]

McLachlan GJ: Discriminant analysis and statistical pattern recognition . New York: Wiley; 1992.

Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16 (10):906–914.

Lee Y, Lee CK: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 2003, 19 (9):1132–1139.

Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J, Poggio T, Gerald W, Loda M, Lander E, Golub T: Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 2001, 98 (26):15149–15154.

Chang CC, Lin CJ: LIBSVM: a library for Support Vector Machines. Tech. rep., Department of Computer Science, National Taiwan University; 2003. [ http://www.csie.ntu.edu.tw/~cjlin/libsvm ]

Burgues CJC: A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining 1998, 2: 121–167.

Vaquerizas JM, Conde L, Yankilevich P, Cabezon A, Minguez P, Diaz-Uriarte R, Al-Shahrour F, Herrero J, Dopazo J: GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data. Nucleic Acids Res 2005, 33: W616–20.

R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2004. [ http://www.R-project.org ]

[ http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html ]

Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, de Rijn MV, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO: Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics 2000, 24 (3):227–235.

Ramaswamy S, Ross KN, Lander ES, Golub TR: A molecular signature of metastasis in primary solid tumors. Nature Genetics 2003, 33: 49–54.

Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002, 415: 436–442.

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999, 96: 6745–6750.

Alizadeh AA, Eisen MB, Davis RE, Ma C, Losses IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403: 503–511.

Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002, 1: 203–209.

Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001, 7: 673–679.

[ http://cran.r-project.org/src/contrib/PACKAGES.html ]

Download references

Acknowledgements

Most of the simulations and analyses were carried out in the Beowulf cluster of the Bioinformatics unit at CNIO, financed by the RTICCC from the FIS; J. M. Vaquerizas provided help with the administration of the cluster. A. Liaw provided discussion, unpublished manuscripts, and code. C. Lázaro-Perea provided many discussions and comments on the ms. A. Sánchez provided comments on the ms. I. Díaz showed R.D.-U. the forest, or the trees, or both. Two anonymous reviewers for comments that have improved the ms. R.D.-U. partially supported by the Ramón y Cajal program of the Spanish MEC (Ministry of Education and Science); S.A.A. supported by project C.A.M. GR/SAL/0219/2004; funding provided by project TIC2003-09331-C02-02 of the Spanish MEC.

Author information

Authors and affiliations.

Bioinformatics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernandez Almagro 3, Madrid, 28029, Spain

Ramón Díaz-Uriarte

Cytogenetics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernández Almagro 3, Madrid, 28029, Spain

Sara Alvarez de Andrés

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Ramón Díaz-Uriarte .

Additional information

Authors' contributions.

R.D-U developed the gene selection methodology, designed and carried out the comparative study, wrote the code, and drafted the manuscript. S.A.A. brought up the biological problem that prompted the methodological development and verified and provided discussion on the methodology, and co-authored the manuscript. Both authors read and approved the manuscript.

Electronic supplementary material

12859_2005_742_moesm1_esm.pdf.

Additional File 1: A PDF file with additional results, showing error rates and stability for simulated data under various parameters, as well as error rates and stabilities for the real microarray data with other parameters, and further details on the data sets, simulations, and alternative methods. (PDF 343 KB)

12859_2005_742_MOESM2_ESM.pdf

Additional File 2: A PDF file with additional plots of OOB error rate vs. mtry for both simulated data and real data under other parameters. (PDF 3 MB)

12859_2005_742_MOESM3_ESM.gz

Additional File 3: Source code for the R package varSelRF. This is a compressed (tar.gz) file ready to be installed with the usual R installation procedure under Linux/UNIX. Additional formats are available from CRAN [ 68 ], the Comprehensive R Archive Network. (GZ 27 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions.

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Díaz-Uriarte, R., Alvarez de Andrés, S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7 , 3 (2006). https://doi.org/10.1186/1471-2105-7-3

Download citation

Received : 08 July 2005

Accepted : 06 January 2006

Published : 06 January 2006

DOI : https://doi.org/10.1186/1471-2105-7-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Support Vector Machine
  • Variable Selection
  • Random Forest
  • Gene Selection
  • Variable Importance

BMC Bioinformatics

ISSN: 1471-2105

gene selection essay

  • Search Menu
  • Sign in through your institution
  • Advance Articles
  • Editor's Choice
  • Author Guidelines
  • Why Publish with Evolution
  • Submit your Reviews and Perspectives
  • Submission Site
  • Open Access Options
  • Self-Archiving Policy
  • About Evolution
  • About the Society for the Study of Evolution
  • Editorial Board
  • Advertising & Corporate Services

Issue Cover

Article Contents

The new approach proposed by lande and arnold, historical context: the influence of the spandrels paper by gould and lewontin, general impact, empirical applications: why were lande and arnold so successful, criticisms and issues about causality, alan grafen’s critique: adaptation vs. selection in progress, data availability, author contribution, acknowledgments, phenotypic selection in natural populations: what have we learned in 40 years.

ORCID logo

  • Article contents
  • Figures & tables
  • Supplementary Data

Erik I Svensson, Phenotypic selection in natural populations: what have we learned in 40 years?, Evolution , Volume 77, Issue 7, July 2023, Pages 1493–1504, https://doi.org/10.1093/evolut/qpad077

  • Permissions Icon Permissions

In 1983, Russell Lande and Stevan Arnold published “ The measurement of selection on correlated characters ,” which became a highly influential citation classic in evolutionary biology. This paper stimulated a cottage industry of field studies of natural and sexual selection in nature and resulted in several large-scale meta-analyses, statistical developments, and method papers. The statistical tools they suggested contributed to a breakdown of the traditional dichotomy between ecological and evolutionary time scales and stimulated later developments such as “eco-evolutionary dynamics”. However, regression-based selection analyses also became criticized from philosophical, methodological, and statistical viewpoints and stimulated some still ongoing debates about causality in evolutionary biology. Here I return to this landmark paper by Lande and Arnold, analyze the controversies and debates it gave rise to and discuss the past, present, and future of selection analyses in natural populations. A remaining legacy of Lande & Arnold, 1983 is that studies of selection and inheritance can fruitfully be decoupled and be studied separately, since selection acts on phenotypes regardless of their genetic basis, and hence selection and evolutionary responses to selection are distinct processes.

Natural selection is not evolution ( Fisher, 1930 ). Natural selection acts on phenotypes, regardless of their genetic basis, and produces immediate phenotypic effects within a generation that can be measured without recourse to principles of heredity or evolution. In contrast, evolutionary response to selection, the genetic change that occurs from one generation to the next, does depend on genetic variation. ( Lande & Arnold, 1983 ).

In 1983, Russell Lande and Stevan Arnold published “ The measurement of selection on correlated characters ” in Evolution , which became one of the most cited papers in this journal ( Lande & Arnold, 1983 ). The first author was the young evolutionary biologist Lande, who had made a name by himself from his work beginning with a key paper in 1976 ( Lande, 1976 ). Lande took statistical tools from the plant and animal breeding literature and merged them with the paleontologist George Gaylord Simpson’s version of the adaptive landscape for phenotypic characters, thereby giving birth to a new discipline: evolutionary quantitative genetics ( Arnold et al., 2001 ; Lande, 1976 , 1977 , 1979 , 1980a , 1980b ; Svensson & Calsbeek, 2012 ). The second author was Arnold, a herpetologist and field biologist interested in animal behavior ( Arnold, 1983 ). In their paper, Lande and Arnold introduced regression analysis as a novel tool to estimate selection simultaneously on multiple phenotypic characters in the field, provided that individual fitness data could be connected to individual trait variation.

Central to the new approach proposed by Lande and Arnold was that selection and inheritance were empirically separable and distinct processes. Thus, estimating selection on a character does not require any information about the genetic basis of the trait. Indeed selection can operate on traits without any heritable basis at all, although then there will of course be no be evolutionary change. This insight was not entirely new; the first sentence in the mathematical population geneticist Fisher’s classical book “ The Genetical Theory of Natural Selection ” ( Fisher, 1930 ) states something similar (see quotation above): selection is a within-generation process whereby some phenotypes are more successful than others, whereas evolution by natural selection is the transmission of such selection to the next generation, which requires that phenotypes are at least partly heritable ( Lewontin, 1970 ). This key point made by Lande’s former doctoral advisor Richard Lewontin was re-emphasized by Lande and Arnold, but was not entirely new. This message had been made explicit already in 1948 in a pioneering paper in one of the first issues of the new journal Evolution by Michael Lerner and Everett Dempster. They made an explicit analogy between plant and animal breeding and evolution, and they suggested that insights from the former literature could be used to study selection in natural populations ( Lerner & Dempster, 1948 ). However, Lerner and Dempster’s paper seemed to have been largely forgotten in 1983 (and was interestingly not cited by Lande and Arnold), so Lande and Arnold re-introduced their idéa to introduce and methods from the breeding literature to the evolutionary biology community. The Lande and Arnold paper also proposed solutions for how to estimate selection on several characters simultaneously, when characters were correlated with each other, as well as suggestions of how to estimate nonlinear selection.

Here, I discuss the scientific legacy of the paper by Lande and Arnold, the discussions it gave rise to, and the criticisms their approach encountered. I also briefly suggest some profitable future directions of phenotypic selection studies in natural populations in light of the many methodological and statistical advancements that have been made in the four decades since 1983. The title of the present paper has been inspired by similar titles of Perspectives in Evolution on reproductive isolation and speciation ( Gavrilets, 2003 ; Rice & Hostert, 1993 ). My rationale is that it often takes several decades to evaluate the impact of papers in a slow-moving and largely conceptual field like evolutionary biology.

The importance of Lande and Arnold’s paper for studies of selection on multiple characters simultaneously (multivariate selection) cannot be overstated. Before their paper, field biologists had typically estimated selection in a univariate fashion and on a trait-by-trait basis ( Boag & Grant, 1981 ; Endler, 1986 ). Estimating the strength of selection on a single character is relatively straightforward and can be done using the linear selection differential ( Falconer, 1989 ; Figure 1A ). When selection operates on a single trait, the evolutionary response to selection ( R ) is simply the selection differential ( S ) times the heritability ( h 2 ) following the classical breeder’s equation in quantitative genetics:

Illustration of the difference between univariate selection and multivariate selection and the effects of correlations between traits on the latter. (A) Univariate selection on a single trait towards a fitness optimum. Here, the selection differential (S) is simply the distance between the population trait mean and the location of the fitness optimum, which can both be estimated. The selection differential can either be expressed as the absolute distance in units of the scale on which the trait is measured (e.g., grams in the case of body mass) or be standardized with either the standard deviation (Lande & Arnold, 1983) or the phenotypic mean (Hereford et al., 2004). (B) Multivariate selection towards a joint fitness optimum (showed in gray shading) determined by two phenotypic traits (Z1 and Z2). Three different populations with different initial locations and multivariate phenotypes are shown (Populations 1–3) and these populations also differ in their trait correlations. In population 1, there is no correlation between Z1 and Z2, which is shown as spherical ellipse depicting the population variation. In this case, the population evolves as if selection operates independently on the two traits and it climbs straight up towards the fitness peak. In contrast, in populations 2 and 3, Z1 and Z2 are correlated with each other, meaning that both direct selection on each trait and indirect selection operates. When trait covariation is not aligned with the direction of maximum fitness, the consequences of this is that the populations will follow curved trajectories through phenotype space, and evolution towards the optimum will be delayed, compared to the univariate case (Lande & Arnold, 1983; Schluter, 1996).

Illustration of the difference between univariate selection and multivariate selection and the effects of correlations between traits on the latter. (A) Univariate selection on a single trait towards a fitness optimum. Here, the selection differential (S) is simply the distance between the population trait mean and the location of the fitness optimum, which can both be estimated. The selection differential can either be expressed as the absolute distance in units of the scale on which the trait is measured (e.g., grams in the case of body mass) or be standardized with either the standard deviation ( Lande & Arnold, 1983 ) or the phenotypic mean ( Hereford et al., 2004 ). (B) Multivariate selection towards a joint fitness optimum (showed in gray shading) determined by two phenotypic traits (Z 1 and Z 2 ). Three different populations with different initial locations and multivariate phenotypes are shown (Populations 1–3) and these populations also differ in their trait correlations. In population 1, there is no correlation between Z 1 and Z 2 , which is shown as spherical ellipse depicting the population variation. In this case, the population evolves as if selection operates independently on the two traits and it climbs straight up towards the fitness peak. In contrast, in populations 2 and 3, Z 1 and Z 2 are correlated with each other, meaning that both direct selection on each trait and indirect selection operates. When trait covariation is not aligned with the direction of maximum fitness, the consequences of this is that the populations will follow curved trajectories through phenotype space, and evolution towards the optimum will be delayed, compared to the univariate case ( Lande & Arnold, 1983 ; Schluter, 1996 ).

When selection operates on a single trait, the evolutionary response to selection ( R ) is therefore perfectly aligned with the direction of selection ( S ) and the population will move directly to the closest adaptive peak, the rate of evolution only being limited by the additive genetic variance which is part of h 2 ( h 2 is the additive genetic variance V a divided by the phenotypic variance V p , i.e., V a / V p ; Figure 1A ).

However, when traits are correlated with each other, the population will not necessarily follow a straightest path towards the closest adaptive peak, although it might eventually end up there ( Figure 1B ). Instead, when traits are correlated, the rate of adaptive evolution towards the optimum will be delayed and the population will follow a curved trajectory through phenotype space ( Schluter, 1996 ; Figure 1B ). In the case of such multivariate selection on two or more traits, the individual fitness surface ( W ) can be estimated as (from equation 3 in [ Phillips & Arnold, 1989 ], modified from Equation 16 in [ Lande & Arnold, 1983 ]):

Here, W is relative fitness (absolute fitness divided by mean absolute fitness), α is a constant (an intercept in a multiple regression), β i is the directional selection gradient for trait z i , γ ii is the quadratic selection gradient (indicating concave or convex selection) for trait z i , γ ij is the quadratic selection gradient for trait interactions between z i and z j (indicating correlational selection) and ε is an error term. These selection gradients can be obtained from the partial regression coefficients in a standard parametric multiple regression ( Lande & Arnold, 1983 ). Note that to obtain the quadratic selection coefficients (i.e., stabilizing and disruptive selection), the partial regression coefficient in front of the term γ ij should be multiplied by two ( Stinchcombe et al., 2008 ).

An adaptationist programme has dominated evolutionary thought in England and the United States during the past 40 years. It is based on faith in the power of natural selection as an optimizing agent. It proceeds by breaking an organism into unitary ‘traits’ and proposing an adaptive story for each considered separately. ( Gould & Lewontin, 1979 ). Critiques of the “adaptationist program” ( Gould & Lewontin, 1979 ; Lewontin, 1978 ) stress that adaptation and selection are often invoked without strong supporting evidence. We suggest quantitative measurements of selection as the best alternative to the fabrication of adaptive scenarios…The essential fact is that selection and adaptation can be measured. ( Lande & Arnold, 1983 ) .

In motivating their study, Lande and Arnold referred to Stephen Jay Gould’s and Richard Lewontin’s famous paper “The Spandrels of San Marco and the Panglossian Paradigm: A Critique of the Adaptationist Programme” that was published only four years earlier ( Gould & Lewontin, 1979 ). They refer to this paper on the first page in their introduction. The Spandrels-paper is a highly cited paper in evolutionary biology; more cited than Lande and Arnold (1983) , although it has also had four more years to accumulate citations ( Figure 2A ). Obviously, Lande and Arnold motivated their new approach to estimate selection with an aim increasing the scientific rigor in evolutionary biology. It was precisely this lack of rigor that Gould and Lewontin had criticized when they argued that many biologists just presented adaptive “Just So”-stories without strong evidence ( Gould & Lewontin, 1979 ). Lande and Arnold clearly thought that this new method where selection could be estimated and quantified, rather than just vaguely inferred, they would increase rigor and empirical standards, thereby responding to the criticism by Gould and Lewontin. Thus, Lande and Arnold saw selection analyses as a constructive solution to the problem of documenting adaptation and selection.

Accumulated citation statistics over different years (1979–2022) and influences on various fields (ecology, evolutionary biology, philosophy of science etc.) for Lande and Arnold (1983) and Gould and Lewontin (1979). Data obtained from a Web of Science (WoS) search in November 2022. (A) Annual number of new citations for both these papers. Note that although Lande and Arnold (1983) was published four years later than Gould and Lewontin (1979) it did soon catch up, and for most of the last four decades they have been cited equally many times. (B) Number of citations to Lande and Arnold (1983) from different research areas (as defined by WoS). Note that a single paper can be classified in to multiple areas so the numbers in each category are overlapping. Shown are citations to the same research areas for Gould and Lewontin (1979) for comparison. Lande and Arnold have been cited more than Gould and Lewontin (1979) in evolution, environmental science/ecology, zoology, and in applied research areas like plant sciences and agriculture, in spite of being published four years later. (C–D) Top five scientific journals among the papers that cited Lande and Arnold (1983) vs. Gould and Lewontin (1979). Note that citations to Lande and Arnold (1983) are dominated by five leading journals in ecology and evolution (American Naturalist, Ecology, Evolution, Journal of Evolutionary Biology and Proc. R. Soc. Lond. B.), whereas Gould and Lewontin (1979) has also influenced other fields, as revealed by Biology and Philosophy representing almost a quarter of the citations from the top five journals in which this paper was cited (N = 122; 23%).

Accumulated citation statistics over different years (1979–2022) and influences on various fields (ecology, evolutionary biology, philosophy of science etc.) for Lande and Arnold (1983) and Gould and Lewontin (1979) . Data obtained from a Web of Science (WoS) search in November 2022. (A) Annual number of new citations for both these papers. Note that although Lande and Arnold (1983) was published four years later than Gould and Lewontin (1979) it did soon catch up, and for most of the last four decades they have been cited equally many times. (B) Number of citations to Lande and Arnold (1983) from different research areas (as defined by WoS). Note that a single paper can be classified in to multiple areas so the numbers in each category are overlapping. Shown are citations to the same research areas for Gould and Lewontin (1979) for comparison. Lande and Arnold have been cited more than Gould and Lewontin (1979) in evolution, environmental science/ecology, zoology, and in applied research areas like plant sciences and agriculture, in spite of being published four years later. (C–D) Top five scientific journals among the papers that cited Lande and Arnold (1983) vs. Gould and Lewontin (1979) . Note that citations to Lande and Arnold (1983) are dominated by five leading journals in ecology and evolution ( American Naturalist, Ecology, Evolution, Journal of Evolutionary Biology and Proc. R. Soc. Lond. B.), whereas Gould and Lewontin (1979) has also influenced other fields, as revealed by Biology and Philosophy representing almost a quarter of the citations from the top five journals in which this paper was cited ( N = 122; 23%).

Lande and Arnold’s paper had a huge impact, judged by the number of citations following its publication in 1983, particularly in ecology and field studies of selection ( Figure 2A ). Compared to the Spandrels paper published four years earlier, its main impact has been in empirical studies in ecology, evolution, plant sciences, and agriculture. In contrast, Gould and Lewontin’s paper has had less influence on empirical research in ecology and population biology but has instead influenced other areas of evolutionary biology, developmental biology, and philosophy of biology ( Figure 2B–D ). Their paper gave rise to a flurry of selection studies in natural populations. This increasingly popular research approach was even jokingly called “ The Chicago School of Evolutionary Biology ” ( Grafen, 1988 ), alluding to the neoliberal economic school led by Milton Friedman that thrived simultaneously at the same university.

…natural selection is daily and hourly scrutinising, throughout the world, every variation, even the slightest; rejecting that which is bad, preserving and adding up all that is good; silently and insensibly working, whenever and wherever opportunity offers, at the improvement of each organic being in relation to its organic and inorganic conditions of life. We see nothing of these slow changes in progress, until the hand of time has marked the long lapse of ages. ( Darwin, 1859 ).

The success and popularity of the new approach suggested by Lande and Arnold was probably not only because they provided the formal theory behind selection analyses but also because they demonstrated empirical and statistical solutions of how to quantify selection in natural populations. They provided two worked-through empirical examples illustrating their new method. First, they used an old dataset on the mortality of House Sparrows ( Passer domesticus ) that were found dead after a winter storm and that were collected by Hermon Bumpus and which was compared with live individuals ( Bumpus, 1899 ). They complemented this with a new and similar dataset collected by Arnold along the shores of Lake Michigan on the mortality of pentatomid bugs ( Euschistus variolarius ), also following a storm. Both these datasets were so-called cross-sectional fitness data, as opposed to longitudinal data, where individuals are followed throughout their lives. Selection on different phenotypes is thus estimated by comparing survivors and nonsurvivors or comparing mated and nonmated individuals (in the case of sexual selection). Such cross-sectional selection analyses have been carried out many times afterwards ( Campbell-Staton et al., 2017 ; Svensson & Friberg, 2007 ; Young et al., 2004 ) and they are often the only practical alternative available to estimate selection. In contrast, longitudinal data using life-time reproductive success (LRS) are only possible to obtain for a limited number of species, usually long-lived vertebrates where researchers can mark individuals and follow them over their entire lives ( Grafen, 1988 ).

Using the cross-sectional fitness data on mortality, Lande and Arnold estimated the variance-standardized directional selection gradients ( β :as) on size-related morphological traits to vary between −0.27 and −0.52 for size in the House Sparrows (i.e., selection for smaller birds) and between −0.74 (wing length) and 0.58 (thorax) in the bugs ( Lande & Arnold, 1983 ). These surprising findings indicated unexpectedly strong selection. These estimates indicated that relative fitness would change between 27% and 74% for a change in standard deviation of these traits experiencing selection.

It is important to underscore how unexpected these results were in this era when the neutral theory in population genetics was well-established ( Kimura, 1983 ) and when many were increasingly skeptical of the pervasiveness of natural selection. Neutral theory was preceded by and partly stimulated by a paradox discussed by the population geneticist J. B. S. Haldane’s about the demographic “costs of selection” (Haldane, 1937 , 1957 ). Other findings of strong selection at about the same time as Lande and Arnold’s paper was published ( Boag & Grant, 1981 ) raised the question how populations could persist in the long run in the face of such strong selection. As Lande and Arnold noted themselves, the persistence of a population in the long run requires that the total selective mortality should not exceed the reproductive rate, otherwise the population would go extinct ( Lande & Arnold, 1983 ). They suggested that a solution to this dilemma is that most of the directional selection within a generation may be concentrated in a few relatively short periods of mortality ( Lande & Arnold, 1983 ), whereas during other periods or in other generations selection might be weak or even nonexistent, allowing the population to recover demographically.

Research in the decades after Lande and Arnold have also revealed strong directional selection and sometimes rapid evolutionary change after brief intense selective episodes, such as winter storms ( Campbell-Staton et al., 2017 ), in response to anthroprogenic disturbances such as traffic (Brown & Brown, 1998 ; Price et al., 2000 ), hunting or fishing pressure ( Allendorf & Hard, 2009 ; Campbell-Staton et al., 2021 ; Sanderson et al., 2022 ), or when organisms invade novel environments, such as cities ( Santangelo et al., 2022 ). Natural or sexual selection is often strong, driving rapid evolutionary change in response to new predators, or when organism invade novel selective environments ( Endler, 1980 ; Hendry & Kinnison, 1999 ; Reznick et al., 1997 ; Svensson, 2019 ; Svensson & Gosden, 2007 ).

The later emerging field of “eco-evolutionary dynamics” can also partly be traced back to the influence of Lande and Arnold and a growing awareness that evolutionary and ecological time scales are often similar and therefore that ecological and evolutionary processes interact and feedback on each other ( Hendry, 2016 ; Schoener, 2011 ; Svensson, 2019 ). This is a marked change from 1983 when ecology and evolution were still largely separate fields. Back in 1983, it was often assumed that ecological processes were fast, relatively to evolutionary processes, and that therefore ecologists could largely ignore evolutionary processes in their day-to-day research. A central message from Lande and Arnold was that studies of selection, inheritance, and evolutionary response to selection are conceptually different and can be separated which made it possible for ecologists who only had access to data on fitness components and phenotypic trait data of individuals to contribute to the evolutionary literature by estimating selection using the common currency provided by evolutionary quantitative genetics ( Barton & Turelli, 1989 ; Hansen & Pélabon, 2021 ; Lynch & Walsh, 1998 ; Walsh & Lynch, 2018 ). Lande and Arnold might have contributed to breaking up the borders between the still separated fields ecology and evolutionary biology. It is worth underscoring that both Laurent Slobodkin, author of Growth And Regulation of Animal Populations ( Slobodkin, 1961 ) and Eric Pianka, author of Evolutionary Ecology ( Pianka, 1988 ) were two out of several authors of influential ecology text books who emphasized the distinction between ecological and evolutionary time scales. This dichotomy only started breaking down several decades after Lande and Arnold (1983) , catalyzed by an influential paper by Thomas Schoener and the growing field of eco-evolutionary dynamics ( Schoener, 2011 ; Svensson, 2019 ).

Before Lande and Arnold, there were very few formal selection studies in natural populations and no studies estimating multivariate selection, simply because biologists did not have any statistical tools to carry out such studies. The major architects of the Modern Synthesis (primarily Mayr and Dobzhansky) argued for the pervasive role of natural selection as a major evolutionary process but interestingly none of them estimated selection themselves, presumably because they considered selection to be too weak for such an effort to be worthwhile ( Antonovics, 1987 ; Endler & McLellan, 1988 ). Thus, it took almost four decades after the Modern Synthesis and after 1983 before biologists regularly started to estimate selection in natural populations. Presumably, many biologists—even those confident about the power of natural selection and its ability to evolutionarily transform populations—still implicitly adhered to Darwin’s view that natural selection was a too slow process so that it could only be inferred, but not observed directly ( Darwin, 1859 ).

How representative were these strong selection gradients documented by Lande and Arnold? In the first major meta-analysis of published selection gradients in nature, Kingsolver and colleagues found that the average variance-standardized selection gradient across thousands of studies was 0.16 ( Kingsolver et al., 2001b ). Thus, relative fitness is expected to change by 16% for each standard deviation in a trait, indicating considerable evolutionary potential of natural populations, under the assumptions that traits are at least partly heritable, which they almost always are ( Lynch & Walsh, 1998 ; Mousseau & Roff, 1987 ; Walsh & Blows, 2009 ). Issues have been raised, however, about the utility of the variance-standardized selection gradient and it has been proposed that the mean-standardized selection gradient is more appropriate ( Hansen & Pélabon, 2021 ; Hereford et al., 2004 ). A related methodological issue is at what spatial scale fitness should be relativized and traits should be standardized when one is interested in comparing selection among groups or populations ( De Lisle & Svensson, 2017 ).

Another methodological issue is sampling error of the selection gradients. A rough estimate of the extent of sampling error can be obtained from temporal replicated selection studies ( Morrissey & Hadfield, 2012 ). Analyses of a small subset of temporally replicated studies suggest that the mean variance-standardized selection gradient could be as low as 0.05 when sampling error is taken in to account ( Morrissey & Hadfield, 2012 ). However, in population genetic terms, this is still strong selection and would indicate high evolutionary potential of most populations, especially in combination with the existence of large amounts of additive genetic variances of most phenotypic traits as well as fitness itself ( Bonnet et al., 2022 ; Mousseau & Roff, 1987 ). Finally, the issue about fluctuating selection that was raised by Lande and Arnold as an explanation of their findings has gained some subsequent empirical support ( Calsbeek et al., 2012 ; Gibbs & Grant, 1987 ; Gosden & Svensson, 2008 ; Grant & Grant, 2002 ; Siepielski et al., 2009 ). However, it is still unclear how much of such observed fluctuations in selection that are due to sampling error vs. real fluctuations ( Morrissey & Hadfield, 2012 ).

The question, “what is the causal relationship between fitness and the characters?” cannot be answered conclusively by an observational approach, simply because the paths of causation, particularly for life-history traits and fitness itself, are so numerous. ( Mitchell-Olds & Shaw, 1987 ). The multivariate analysis of selection is insufficient for identifying the causal agents of selection. We discuss how the observational approach of multivariate selection analysis can be complemented by experimental manipulations of the phenotypic distribution and the environment to identify not only how selection is operating on the phenotypic distribution but also why it operates in the observed manner… The biotic and abiotic environment is the context that gives rise to the relationship between phenotype and fitness (selection). The analysis of the causes of selection is in essence a problem in ecology. ( Wade & Kalisz, 1990 ).

In the decade following 1983, influential papers by Thomas Mitchell-Olds, Ruth Shaw, Michael Wade, and Susan Kalisz stand out in criticizing the regression approach to study selection ( Mitchell-Olds & Shaw, 1987 ; Wade & Kalisz, 1990 ). These and other criticisms ( Kingsolver & Schemske, 1991 ) did not only discuss technical and statistical issues but also raised the deeper question about causal inference. In particular, how can we know that a trait-fitness covariance relationship reflects a causal influence of the trait? It is important to note that the question about causality cannot be solved by statistical methods alone but would require additional biological and ecological data and ideally complemented with functional analysis, experiments, and natural history information ( Figures 3 – 5 ).

Illustrations of the problems of inferring causality when estimating selection using regression analysis on unmanipulated phenotypic variation. (A) Suppose we observed a positive relationship between male mating success and male tail length in a bird population. Such a positive correlation could indicate sexual selection for longer tails in this species, but ideally one would like to confirm any such putative selection by experimentally manipulating the trait (tail length) as the relationship could be caused by purely environmental covariance. For instance, males in high condition could be able to both grow long tails and achieve high mating success and the observed correlation could then reflect a noncausal spurious relationship (Mitchell-Olds & Shaw, 1987; Price et al., 1988; Rausher, 2000). (B) In the long-tailed widowbird (Euplectes progne) in Africa, such an experiment was actually carried out by Malte Andersson (1982), who experimentally manipulated tail length by cutting and gluing and showed that longer tails did indeed increase male mating success. Photograph from KwaZulu Natal (South Africa) by Erik Svensson.

Illustrations of the problems of inferring causality when estimating selection using regression analysis on unmanipulated phenotypic variation. (A) Suppose we observed a positive relationship between male mating success and male tail length in a bird population. Such a positive correlation could indicate sexual selection for longer tails in this species, but ideally one would like to confirm any such putative selection by experimentally manipulating the trait (tail length) as the relationship could be caused by purely environmental covariance. For instance, males in high condition could be able to both grow long tails and achieve high mating success and the observed correlation could then reflect a noncausal spurious relationship ( Mitchell-Olds & Shaw, 1987 ; Price et al., 1988 ; Rausher, 2000 ). (B) In the long-tailed widowbird ( Euplectes progne ) in Africa, such an experiment was actually carried out by Malte Andersson (1982) , who experimentally manipulated tail length by cutting and gluing and showed that longer tails did indeed increase male mating success. Photograph from KwaZulu Natal (South Africa) by Erik Svensson.

Hypothetical relationships between three phenotypic traits (z1–z3) and fitness (W). (A) Lande and Arnold’s multiple regression approach: These three traits can act at the same level of the biological hierarchy, where they all influence fitness direct (single-headed arrows from zi to W). Such a causal structure makes it possible to estimate directional selection gradients when all three traits are included in a multiple regression analysis, and three separate directional selection gradients can then be estimated (βi). In addition to direct selection on these three traits, traits can also indirectly influence fitness through noncausal covariances between the traits (double-headed arrows). (B–D) Alternative trait configurations that are not captured in the classical regression framework suggested by Lande and Arnold. These causal scenarios would require explicitly different models which are here visualized as different path models. (B) The three phenotypic traits can be linearly arranged, such as when the same trait is measured at different time points during ontogeny. Traits measured earlier in the ontogeny affect traits measured later in the ontogeny, but only the final trait affects fitness directly. (C) The “morphology-performance-fitness”-paradigm proposed by Arnold (1983). Here, two of the traits (e.g., two morphological traits; z1 and z2) affects some aspects of organismal performance or behavior, such as feeding rate (z3) that is the direct target of selection and which causally influences fitness. Although only z3 is under direct selection, the two underlying morphological traits are also causally affecting fitness, albeit indirectly through z3. (D) A “diamond” causal structure, where only two traits (z1 and z2) experience direct selection, but the third trait (z3) is indirectly also influencing fitness through its effect on these two traits (e.g., some trait that operates earlier in ontogeny and with legacies up to the adult stage when selection operates on z1 and z2).

Hypothetical relationships between three phenotypic traits ( z 1 – z 3 ) and fitness (W). (A) Lande and Arnold’s multiple regression approach: These three traits can act at the same level of the biological hierarchy, where they all influence fitness direct (single-headed arrows from z i to W). Such a causal structure makes it possible to estimate directional selection gradients when all three traits are included in a multiple regression analysis, and three separate directional selection gradients can then be estimated ( β i ). In addition to direct selection on these three traits, traits can also indirectly influence fitness through noncausal covariances between the traits (double-headed arrows). (B–D) Alternative trait configurations that are not captured in the classical regression framework suggested by Lande and Arnold. These causal scenarios would require explicitly different models which are here visualized as different path models. (B) The three phenotypic traits can be linearly arranged, such as when the same trait is measured at different time points during ontogeny. Traits measured earlier in the ontogeny affect traits measured later in the ontogeny, but only the final trait affects fitness directly. (C) The “morphology-performance-fitness”-paradigm proposed by Arnold (1983) . Here, two of the traits (e.g., two morphological traits; z 1 and z 2 ) affects some aspects of organismal performance or behavior, such as feeding rate ( z 3 ) that is the direct target of selection and which causally influences fitness. Although only z 3 is under direct selection, the two underlying morphological traits are also causally affecting fitness, albeit indirectly through z 3. (D) A “diamond” causal structure, where only two traits ( z 1 and z 2 ) experience direct selection, but the third trait ( z 3 ) is indirectly also influencing fitness through its effect on these two traits (e.g., some trait that operates earlier in ontogeny and with legacies up to the adult stage when selection operates on z 1 and z 2 ).

Conceptual illustration of the multiple levels of causality of selection, including the environmental drivers and ecological causes of selection. These multiple levels of causality encompass both the causality of trait-fitness covariances and how local selective agents causally shape trait-fitness covariances. A hypothetical example is shown where the selective environment varies along two dimensions: predation risk (vertical axis; shown as increasing number of birds of prey, in this case kestrels, Falco tinninculus) and conspecific density (horizontal axis; shown as increasing number of voles, genus Microtus). The selective environment is typically multidimensional (White & Butlin, 2021), but for simplicity, I here illustrate only two environmental factors and agents of selection. It is assumed that higher predation risk favors smaller individuals, which is shown as weaker selection or even negative selection on body size with increasing predation pressure (from top to bottom). In contrast, higher conspecific density favors larger body size due to increased intraspecific competition, which is shown as steeper and more positive slopes of the fitness functions as one moves from the left to the right (and larger voles). Different combinations of predation and conspecific density can causally interact and shape local selective environments, resulting in different trait-fitness covariances in different populations. In this particular example, the selective environment is thus two-dimensional, but in nature selection is most likely multidimensional. The selective environment can also be described for both con- and heterospecific phenotype frequencies and the various social interactions that can arise from such interactions, sometimes in combination with path analytical tools (cf. Figure 4; see De Lisle et al., 2022; McGlothlin & Fisher, 2022; Wolf et al., 2001). This example illustrates the importance of measuring not only phenotypic traits and fitnesses, but also to quantify and (when possible) experimentally manipulate the local selective environments to gain a full understanding of selection. Silhouettes of the kestrels reproduced with permission from Rebecca Groom under the Creative Commons CC-BY 3.0 license (https://creativecommons.org/licenses/by-sa/3.0/) and voles obtained from Phylopic (http://phylopic.org/). Example inspired by Wade and Kalisz (1990).

Conceptual illustration of the multiple levels of causality of selection, including the environmental drivers and ecological causes of selection. These multiple levels of causality encompass both the causality of trait-fitness covariances and how local selective agents causally shape trait-fitness covariances. A hypothetical example is shown where the selective environment varies along two dimensions: predation risk (vertical axis; shown as increasing number of birds of prey, in this case kestrels, Falco tinninculus ) and conspecific density (horizontal axis; shown as increasing number of voles, genus Microtus ). The selective environment is typically multidimensional ( White & Butlin, 2021 ), but for simplicity, I here illustrate only two environmental factors and agents of selection. It is assumed that higher predation risk favors smaller individuals, which is shown as weaker selection or even negative selection on body size with increasing predation pressure (from top to bottom). In contrast, higher conspecific density favors larger body size due to increased intraspecific competition, which is shown as steeper and more positive slopes of the fitness functions as one moves from the left to the right (and larger voles). Different combinations of predation and conspecific density can causally interact and shape local selective environments, resulting in different trait-fitness covariances in different populations. In this particular example, the selective environment is thus two-dimensional, but in nature selection is most likely multidimensional. The selective environment can also be described for both con- and heterospecific phenotype frequencies and the various social interactions that can arise from such interactions, sometimes in combination with path analytical tools (cf. Figure 4 ; see De Lisle et al., 2022 ; McGlothlin & Fisher, 2022 ; Wolf et al., 2001 ). This example illustrates the importance of measuring not only phenotypic traits and fitnesses, but also to quantify and (when possible) experimentally manipulate the local selective environments to gain a full understanding of selection. Silhouettes of the kestrels reproduced with permission from Rebecca Groom under the Creative Commons CC-BY 3.0 license ( https://creativecommons.org/licenses/by-sa/3.0/ ) and voles obtained from Phylopic ( http://phylopic.org/ ). Example inspired by Wade and Kalisz (1990) .

Mitchell-Olds and Shaw (1987) emphasized than an observed selection gradient—even if statistically significant—would not in itself prove that the trait is target of selection, especially if the trait is correlated with other characters that are not included in the statistical analyses. They suggested that any documented selection gradient should be considered as a provisional hypothesis, in need of experimental verification. Experimental manipulations of suspected targets of selection—such as the sexually selected tail length in male widowbirds ( Andersson, 1982 ) or egg size in lizards ( Sinervo et al., 1992 )—would complement any inferred selection on unmanipulated phenotypic variation ( Figure 3 ). Alternatively, when traits could not be easily experimentally manipulated such as beak sizes in birds or body size, functional analyses ( Opedal, 2021 ), and careful natural history observations are needed before any safe conclusions could be made ( Mitchell-Olds & Shaw, 1987 ). Of particular concern are environmental covariances, such as when individuals vary in condition that independently affects both traits and fitness ( Rausher, 2000 ). Such environmental covariances can lead to a false impression of directional selection on a trait ( Price et al., 1988 ). One solution is to incorporate condition as a covariate in the selection analyses, in effect an additional trait ( Rausher, 2000 ; Stinchcombe et al., 2002 ), although this is not always feasible. One can also try to verify causal relationships between traits using a combination path analysis, causal modeling, and/or structural equation modeling ( Edelaar et al., 2022 ; Kingsolver & Schemske, 1991 ; Otsuka, 2019 ; Shipley, 2002 ). It is important to emphasize that the multiple regression approach proposed by Lande and Arnold is only a subset of all possible causal relationships between a set of traits and fitness ( Figure 4 ). The underlying assumption in the multiple regression approach is that traits act on the same level in the biological hierarchy ( Figure 4A , cf. Figure 4B–D ). Thus, the multiple regression approach is one specific causal model in a greater universe of alternative causal scenarios that can be captured by different path models ( Figure 4 ).

Verifying the causality of trait-fitness covariance relationships is not sufficient, however, for a full understanding of selection. There is also the additional causal layer: the ecology of selection ( MacColl, 2011 ; Wade & Kalisz, 1990 ). We thus also need to know why the trait-fitness covariance relationship looks like it does, i.e. what is the cause of selection? This is an ecological question: what agents or environmental factors cause this fitness-trait covariance? Natural selection and sexual selection are processes that arise due to interactions between individual phenotypes and their local selective environments ( Hull, 1980 ; MacColl, 2011 ; Wade & Kalisz, 1990 ). A full understanding of selection therefore requires not only knowledge about trait and fitnesses or even the causality of trait-fitness covariances, but also information about how ecological agents and causes of selection—such as competitors, mates, parasites, pollinators or parasites, or abiotic factors such as temperature and precipitation—give rise to trait-fitness covariances ( MacColl, 2011 ; Opedal, 2021 ; Siepielski et al., 2017 ; Svensson & Sinervo, 2000 ; Wade & Kalisz, 1990 ).

Experimentally manipulating or measuring different selective environments across multiple populations provides a many logistical hurdles ( Figure 5 ) as the selective environment is typically multidimensional ( White & Butlin, 2021 ). It is considerably more challenging than simply manipulating individual phenotypes within a local population ( Figure 3 ). The ecological causes of selection can be elucidated by using spatial or temporal—replication across multiple populations in space or time ( MacColl, 2011 ). This can sometimes be achieved, but requires large sample sizes, often in the order of thousands of individual phenotypes ( Gosden & Svensson, 2008 ; Svensson & Sinervo, 2004 ). In some systems, experimental studies could be designed that manipulate both individual phenotypes and their local selective environments simultaneously, that is “double-level” manipulations ( Sinervo & Basolo, 1996 ; Svensson & Sinervo, 2000 ). Experimental manipulations of selective agents such as removing plant herbivores ( Mauricio & Rausher, 1997 ), plant pollinators ( Sletvold et al., 2016 ), or changing the density or frequency of intra- or inter-specific competitors or predators ( Calsbeek & Cox, 2010 ; Schluter, 1994 , 2003 ; Svensson & Sinervo, 2000 ) can sometimes be carried out. In many cases, however, experimental manipulations of selective agents are practically impossible. In these cases, identifying the environmental drivers and causes of selection from temporally or spatially replicated selection studies adds to a deeper understanding of the ecology of selection ( MacColl, 2011 ; Siepielski et al., 2017 ). It is also worth noting that an important source of bias in selection studies could be density, just as has been noted in behavioral ecology ( Stamps, 2011 ): biologists measuring selection are likely to focus on high-density populations simply for practical and logistical reasons and the need for large sample sizes.

In summary, Lande and Arnold stimulated some still ongoing discussions about causality in evolutionary biology and in philosophy of biology. These discussions include question at what level selection operates and whether there are “cross-over effects” between different levels ( Heisler & Damuth, 1987 ; Okasha, 2006 ), whether natural selection is a force or only a statistical outcome of lower-level events and the fates of individual organisms ( Endler, 1986 ; Otsuka, 2016 ; Sober, 1984 ; Walsh, 2015 ; Walsh et al., 2002 ) and whether genes (“replicators”) or phenotypes (“vehicles” or “interactors”) are the true targets of selection ( Ågren, 2021 ; Dawkins, 1976 ; Hull, 1980 ; Lewontin, 1970 ). Many evolutionary biologists now view phenotypes as the true targets of selection, regardless of their heritable basis, in the spirit of Lande and Arnold (1983) .

In making their claims for their methods, Arnold, Wade and Lande do not always distinguish clearly between the analysis of adaptation and the detection of selection in progress. It is clear, however, that the design of their methods is to detect selection in progress…I believe that most evolutionists and behaviorists would say they were primarily interested in adaptation, as opposed to selection in progress, once the distinction is brought to their attention. Their primary concern is why male red deer have such big antlers, not whether there are genes now changing in frequency that affect antler size…The methods of analysis of LRS data proposed by Wade & Arnold (1984), Lande & Arnold (1983) , , and Arnold and Wade (1984a,b) seem primarily designed to study selection in progress-that is to say, gene frequencies changing now rather than adaptation. ( Grafen, 1988 ).

Did Lande and Arnold succeed in convincing Gould, Lewontin and other contemporary critics of naïve adaptationism? Not really, according to British theoretical biologist Alan Grafen ( Grafen, 1988 ).

Grafen criticized the regression approach suggested by Lande and Arnold for failing to address what he claimed that most biologists really are interested in: adaptation and the current utility of traits ( Grafen, 1988 ). He argued that their approach was more designed to detect selection in progress than to identify adaptations. Grafen criticized such a purely correlative approach, relying on unmanipulated variation in phenotypic traits and fitness, and he argued that biologists interested in adaptation should rather carry out manipulative experiments to clarify the adaptive significance of traits (if any). Grafen’s criticism in a nutshell was thus that Lande and Arnold had conflated selection in progress (an evolutionary process ) with adaptation (an optimum phenotypic state of a population) ( Grafen, 1988 ). Following the logic of G. C. Williams (1966) , he argued that fitness is a property of design , not a property of an individual, and that using too all-encompassing fitness measures such as Lifetime Reproductive Success (LRS) would not answer the question about the adaptive significance of traits that have their most important function during restricted parts of the life cycle, such as among juveniles or during mating ( Grafen, 1988 ).

Grafen used a hypothetical example of the wing spots on the hindwing of the butterfly Maniola jurtina to illustrate his reasoning. He argued that the really interesting question was the adaptive significance of these hindspots, rather than if they were currently under selection, and he suggested that biologists would gain more insights by experimentally increasing or decreasing the number of spots instead of measuring natural variation in spot number or spot size, using Lande and Arnold’s approach ( Grafen, 1988 ). That is, evolutionary biologists should focus on current utility of traits, rather than on selection in progress.

Although Grafen’s distinction between adaptation and current utility vs. selection in progress is important, it is not always that clearcut. Current utility of a trait implies that the current population trait mean (presumably located at some intermediate optimum) maximizes fitness, compared to alternative variants. This is just another way of saying that the trait is currently experiencing stabilizing selection, thus it is a claim about selection in progress! Moreover, Lande and Arnold’s regression approach was not only designed to detect directional selection, but could also reveal stabilizing and disruptive selection ( Lande & Arnold, 1983 ), so it is strange that Grafen did not embrace this complementary approach to experimental manipulations.

Behavioral ecologists in the British research tradition that Grafen represents tend to focus only on evolutionary endpoints and equilibria, asking questions like: “ Is this trait adaptive? ” but also tend to ignore the equally interesting question “ How did the trait end up here? ”. This obsession with evolutionary endpoints and the adaptive significance of traits is quite evident in the research tradition Grafen belongs to, where phenotypic models based on optimization theory and game theory are valued more highly than dynamic quantitative and population genetic models aimed to detect selection in progress. Many evolutionary biologists—the present author included—are more interested in selection in progress than if a trait is an adaptation (or not). Thus, Grafen’s value-laded statement that evolutionary biologists are more interested in whether a trait is an adaptation than they are interested in selection in progress may well reflect his own cultural and scientific bias than the majority of the evolutionary biologists, but that is ultimately an empirical question for historians and sociologists of science to investigate. The historian Tim Lewens has characterized the British research tradition in behavioral ecology and phenotypic modeling as “Neo-Palyean Biology” ( Lewens, 2019 ), referring to the natural theologian William Paley who in the pre-Darwinian times saw adaptive design everywhere in nature, which he interpreted as a sign of God’s designing ability. Paley made famous the analogy with a watchmaker, and Richard Dawkins openly expressed his admiration of Paley in his book “ The Blind Watchmaker ” ( Dawkins, 1986 ). The provocative title demonstrates how Dawkins was largely in agreement with Paley that adaptive design is the important question in evolutionary biology. Neo-Paleyan biology today, according to Lewens and Arvid Ågren, is primarily alive in Britain, and with Dawkins, Grafen, and Andy Gardner as its main representatives ( Ågren, 2021 ; Lewens, 2019 ). Neo-Paleyan biology can be characterized as a research program focused on the adaptive design and current utility of traits but with little interest in evolutionary history or selection in progress ( Reeve & Sherman, 1993 ). However, the distinction between adaptation and selection in progress largely disappears if we realize that claims about adaptation and current utility are also implicit claims about selection in progress, namely stabilizing selection around a current local optimum ( Hansen, 1997 ).

Lande and Arnold’s paper had a long-lasting impact on evolutionary biology, particularly in field ecological studies ( Figure 2 ). It stimulated several discussions about the nature and limitations of statistical tools vs. experiments and general issues about causal inference ( Figures 1 – 5 ) and it uncovered both the power but also the limitations of selection and adaptation. The main influence of their paper was providing a useful empirical tool that resulted in hundreds of field studies documenting and quantifying natural selection ( Figure 2 ). This stimulated several influential meta-analyses that have enriched our understanding about the strength and variability of phenotypic selection in natural populations ( Kingsolver & Diamond, 2011 ; Kingsolver et al., 2001b ; Siepielski et al., 2009 , 2011 , 2013 , 2017 ). Given this large body of empirical work, what remains to be done and what is the future of selection studies in natural populations? Five remaining challenges come to my mind.

First, measuring and analyzing individual phenotypes is time-consuming and a major bottleneck. New automated data collecting techniques and high-throughput phenotyping (“phenomics”) combining digital data with tools from machine learning and Artificial Intelligence, including Computer Vision can hopefully overcome some of the bottlenecks of limiting sample sizes in selection studies ( Lürig et al., 2021 ). However, formidable challenges remain to quantify fitness or fitness components in the field.

Second, our knowledge about multivariate selection—including various forms of nonlinear selection—still lags behind our knowledge about directional selection. In particular, how common is stabilizing vs. disruptive selection? How common and strong is correlational selection, i.e., selection for trait combinations, compared to selection on traits in isolation and what are the genomic, developmental, and evolutionary consequences of such selection ( Sinervo & Svensson, 2002 ; Svensson et al., 2021 )? There is still no major meta-analysis of correlational selection, largely because this form of selection is seldom quantified in field studies, in spite of statistical tools being available ( Blows, 2007 ; Blows et al., 2003 , 2004 ; Phillips & Arnold, 1989 ; Svensson et al., 2021 ).

Third, what are the demographic and life-history consequences of phenotypic selection on individuals for population growth rate, extinction risk ( Martins et al., 2018 ), or evolutionary rescue ( Bell, 2017 )? This is an area that is largely unexplored empirically, although the theoretical framework has been available for decades ( Lande, 1982 ).

Fourth, how can we better integrate ecological selection studies in the field with research on phenotypic plasticity and development? A theoretical and analytical framework is available to quantify selection on function-valued traits ( Stinchcombe & Kirkpatrick, 2012 ), such as reaction norm slopes and intercepts, but empirical studies are still few, largely because of the need of large sample sizes ( Chevin et al., 2010 ; Kingsolver et al., 2001a ; Lande, 2009 ; Svensson et al., 2020 ).

Finally, how can we connect short-term ecological studies of selection on fitness components to the macroevolutionary time scales that are the focus of phylogenetic comparative studies ( Uyeda et al., 2011 )? In particular, how are phylogenetic signatures of multiple optima that are often interpreted as stabilizing selection ( Beaulieu et al., 2012 ; Hansen, 1997 ) related to the estimates of stabilizing selection in microevolutionary studies? Solving these challenges will require close collaborations between experimental and comparative evolutionary biologists, empiricists, and theoreticians with complementary expertise.

My research has been funded by the Swedish Research Council (VR; grant no. 2020–03123).

The only data in this paper are from the citation analyses in Figure 2 . These citation data are available on Dryad Digital Repository ( https://datadryad.org/stash ): https://datadryad.org/stash/dataset/doi:10.5061/dryad.cjsxksnb6 .

E.I.S. came up with the idea to write this paper, carried out all the analyses, made all the figures and wrote the manuscript.

Conflicts of interest : The author declares no conflict of interest.

I thank Editor Tracy Chapman for encouraging me to write this paper and for kindly inviting me to submit it as a Perspective-article in Evolution . I am grateful for constructive comments by Arvid Ågren, Masahito Tsuboi, Stephen De Lisle, Laura Sophie Hildesheim and several members of “Svensson Lab” on an early version of the manuscript.

Ågren , J. A. ( 2021 ). The Gene’s-eye view of evolution . Oxford University Press .

Google Scholar

Google Preview

Allendorf , F. W. , & Hard , J. J. ( 2009 ). Human-induced evolution caused by unnatural selection through harvest of wild animals . Proceedings of the National Academy of Sciences , 106 ( Suppl 1 ), 9987 – 9994 . https://doi.org/10.1073/pnas.0901069106

Andersson , M. ( 1982 ). Female choice selects for extreme tail length in a widowbird . Nature , 299 ( 5886 ), 818 – 820 . https://doi.org/10.1038/299818a0

Antonovics , J. ( 1987 ). The evolutionary dys-synthesis: Which Bottles for which wine? American Naturalist , 129 ( 3 ), 321 – 331 . https://doi.org/10.1086/284639

Arnold , S. J. ( 1983 ). Morphology, performance and fitness . American Zoologist , 23 ( 2 ), 347 – 361 . https://doi.org/10.1093/icb/23.2.347

Arnold , S. J. , Pfrender , M. E. , & Jones , A. G. ( 2001 ). The adaptive landscape as a conceptual bridge between micro- and macroevolution . Genetica , 112–113 , 9 – 32 .

Barton , N. H. , & Turelli , M. ( 1989 ). Evolutionary quantitative genetics: How little do we know? Annual Review of Genetics , 23 , 337 – 370 . https://doi.org/10.1146/annurev.ge.23.120189.002005

Beaulieu , J. M. , Jhwueng , D. -C. , Boettiger , C. , & O’Meara , B. C. ( 2012 ). Modelling stabilizing selection: Expanding the Ornstein-Uhlenbeck model of adaptive evolution . Evolution , 66 ( 8 ), 2369 – 2383 . https://doi.org/10.1111/j.1558-5646.2012.01619.x

Bell , G. ( 2017 ). Evolutionary rescue . Annual Review of Ecology, Evolution, and Systematics , 48 ( 1 ), 605 – 627 . https://doi.org/10.1146/annurev-ecolsys-110316-023011

Blows , M. W. ( 2007 ). A tale of two matrices: Multivariate approaches in evolutionary biology . Journal of Evolutionary Biology , 20 ( 1 ), 1 – 8 . https://doi.org/10.1111/j.1420-9101.2006.01164.x

Blows , M. W. , Brooks , R. , & Kraft , P. G. ( 2003 ). Exploring complex fitness surfaces: Multiple ornamentation and polymorphism in male guppies . Evolution , 57 ( 7 ), 1622 – 1630 . https://doi.org/10.1111/j.0014-3820.2003.tb00369.x

Blows , M. W. , Chenoweth , S. F. , & Hine , E. ( 2004 ). Orientation of the genetic variance-covariance matrix and the fitness surface for multiple male sexually selected traits . American Naturalist , 163 , E329 – E340 .

Boag , P. T. , & Grant , P. R. ( 1981 ). Intense natural selection in a population of Darwin’s Finches (Geospizinae) in the Galápagos . Science , 214 ( 4516 ), 82 – 85 . https://doi.org/10.1126/science.214.4516.82

Bonnet , T. , Morrissey , M. B. , de Villemereuil , P. , Alberts , S. C. , Arcese , P. , Bailey , L. D. , Boutin , S. , Brekke , P. , Brent , L. J. N. , Camenisch , G. , Charmantier , A. , Clutton-Brock , T. H. , Cockburn , A. , Coltman , D. W. , Courtiol , A. , Davidian , E. , Evans , S. R. , Ewen , J. G. , Festa-Bianchet , M. , … Kruuk , L. E. B. ( 2022 ). Genetic variance in fitness indicates rapid contemporary adaptive evolution in wild animals . Science , 376 ( 6596 ), 1012 – 1016 . https://doi.org/10.1126/science.abk0853

Brown , C. R. , & Brown , M. B. ( 1998 ). Intense natural selection on body size and wing and tail asymmetry in Cliff Swallows during severe weather . Evolution , 52 ( 5 ), 1461 – 1475 . https://doi.org/10.1111/j.1558-5646.1998.tb02027.x

Bumpus , H.C. ( 1899 ). The elimination of the unfit as illustrated by the introduced sparrow, Passer domesticus. In Biological Lectures from the Marine Biological Laboratory of Woods Hole, Mass . pp. 209 – 228 .

Calsbeek , R. , & Cox , R. M. ( 2010 ). Experimentally assessing the relative importance of predation and competition as agents of selection . Nature , 465 ( 7298 ), 613 – 616 . https://doi.org/10.1038/nature09020

Calsbeek , R. , Gosden , T. P. , Kuchta , S. R. , & Svensson , E. I. ( 2012 ). Fluctuating selection and dynamic adaptive landscapes. In E. I. Svensson & R. Calsbeek (Eds.), The adaptive landscape in evolutionary biology . Oxford University Press .

Campbell-Staton , S. C. , Arnold , B. J. , Gonçalves , D. , Granli , P. , Poole , J. , Long , R. A. , & Pringle , R. M. ( 2021 ). Ivory poaching and the rapid evolution of tusklessness in African elephants . Science , 374 ( 6566 ), 483 – 487 . https://doi.org/10.1126/science.abe7389

Campbell-Staton , S. C. , Cheviron , Z. A. , Rochette , N. , Catchen , J. , Losos , J. B. , & Edwards , S. V. ( 2017 ). Winter storms drive rapid phenotypic, regulatory, and genomic shifts in the green anole lizard . Science , 357 ( 6350 ), 495 – 498 . https://doi.org/10.1126/science.aam5512

Chevin , L. M. , Lande , R. , & Mace , G. M. ( 2010 ). Adaptation, plasticity, and extinction in a changing environment: Towards a predictive theory . PLoS Biology , 8 ( 4 ), e1000357 . https://doi.org/10.1371/journal.pbio.1000357

Darwin , C. ( 1859 ). On the origin of species by natural selection . Murray .

Dawkins , R. ( 1976 ). The selfish gene . Oxford University Press .

Dawkins , R. ( 1986 ). The blind watchmaker . W. W. Norton & Company .

De Lisle , S. P. , Bolnick , D. I. , Brodie , E. D. III , Moore , A. J. , & McGlothlin , J. W. ( 2022 ). Interacting phenotypes and the coevolutionary process: Interspecific indirect genetic effects alter coevolutionary dynamics . Evolution , 76 ( 3 ), 429 – 444 . https://doi.org/10.1111/evo.14427

De Lisle , S. P. , & Svensson , E. I. ( 2017 ). On the standardization of fitness and traits in comparative studies of phenotypic selection . Evolution , 71 ( 10 ), 2313 – 2326 . https://doi.org/10.1111/evo.13325

Edelaar , P. , Otsuka , J. , & Luque , V. J. ( 2022 ). A generalised approach to the study and understanding of adaptive evolution . Biological Reviews , 98 ( 1 ), 352 – 375 . https://doi.org/10.1111/brv.12910

Endler , J. A. ( 1980 ). Natural selection on color patterns in Poecilia reticulata . Evolution , 34 ( 1 ), 76 – 91 . https://doi.org/10.1111/j.1558-5646.1980.tb04790.x

Endler , J. A. ( 1986 ). Natural selection in the wild . Princeton University Press .

Endler , J. A. , & McLellan , T. ( 1988 ). The processes of evolution: Toward a newer Synthesis . Annual Review of Ecology and Systematics , 19 ( 1 ), 395 – 421 . https://doi.org/10.1146/annurev.es.19.110188.002143

Falconer , D. S. ( 1989 ). An introduction to quantitative genetics . Longman .

Fisher , R. A. ( 1930 ). The Genetical theory of natural selection . Clarendon Press .

Gavrilets , S. ( 2003 ). Perspective: Models of speciation: What have we learned in 40 years? Evolution , 57 ( 10 ), 2197 – 2215 . https://doi.org/10.1111/j.0014-3820.2003.tb00233.x

Gibbs , H. L. , & Grant , P. R. ( 1987 ). Oscillating selection on Darwin’s finches . Nature , 327 ( 6122 ), 511 – 513 . https://doi.org/10.1038/327511a0

Gosden , T. P. , & Svensson , E. I. ( 2008 ). Spatial and temporal dynamics in a sexual selection mosaic . Evolution , 62 ( 4 ), 845 – 856 . https://doi.org/10.1111/j.1558-5646.2008.00323.x

Gould , S. J. , & Lewontin , R. C. ( 1979 ). The spandrels of San Marco and the Panglossian Paradigm: A critique of the adaptationist programme . Proceedings of the Royal Society London, B , 205 , 581 – 598 .

Grafen , A. ( 1988 ). On the use of data on lifetime reproductive success. In T. H. Clutton-Brock (Ed.), Reproductive success (pp. 454 – 471 ). The University of Chicago Press .

Grant , P. R. , & Grant , B. R. ( 2002 ). Unpredictable evolution in a 30-year study of Darwin’s finches . Science , 296 ( 5568 ), 707 – 711 . https://doi.org/10.1126/science.1070315

Haldane , J. B. S. ( 1937 ). The effect of variation on fitness . American Naturalist , 71 ( 735 ), 337 – 349 . https://doi.org/10.1086/280722

Haldane , J. B. S. ( 1957 ). The cost of natural selection . Journal of Genetics , 55 ( 3 ), 511 – 524 . https://doi.org/10.1007/bf02984069

Hansen , T. F. ( 1997 ). Stabilizing selection and the comparative analysis of adaptation . Evolution , 51 ( 5 ), 1341 – 1351 . https://doi.org/10.1111/j.1558-5646.1997.tb01457.x

Hansen , T. F. , & Pélabon , C. ( 2021 ). Evolvability: A quantitative-­genetics perspective . Annual Review of Ecology, Evolution, and Systematics , 52 ( 1 ), 153 – 175 . https://doi.org/10.1146/annurev-ecolsys-011121-021241

Heisler , I. L. , & Damuth , J. ( 1987 ). A method for analyzing selection in hierchically structured populations . American Naturalist , 130 ( 4 ), 582 – 602 . https://doi.org/10.1086/284732

Hendry , A. P. ( 2016 ). Eco-evolutionary dynamics . Princeton University Press .

Hendry , A. P. , & Kinnison , M. T. ( 1999 ). The pace of modern life: Measuring rates of contemporary microevolution . Evolution , 53 ( 6 ), 1637 – 1653 . https://doi.org/10.1111/j.1558-5646.1999.tb04550.x

Hereford , J. , Hansen , T. F. , & Houle , D. ( 2004 ). Comparing strengths of directional selection: How strong is strong? Evolution , 58 ( 10 ), 2133 – 2143 . https://doi.org/10.1111/j.0014-3820.2004.tb01592.x

Hull , D. L ( 1980 ). Individuality and selection . Annual Review of Ecology and Systematics , 11 ( 1 ), 311 – 332 . https://doi.org/10.1146/annurev.es.11.110180.001523

Kimura , M. ( 1983 ). The neutral theory of molecular evolution . Cambridge University Press .

Kingsolver , J. G. , & Diamond , S. E. ( 2011 ). Phenotypic selection in natural populations: What limits directional selection? American Naturalist , 177 ( 3 ), 346 – 357 . https://doi.org/10.1086/658341

Kingsolver , J. G. , Gomulkiewicz , R. , & Carter , P. A. ( 2001a ). Variation, selection and evolution of function-valued traits . Genetica , 112 , 87 – 104 .

Kingsolver , J. G. , Hoekstra , H. E. , Hoekstra , J. M. , Berrigan , D. , Vignieri , S. N. , Hill , C. E. , Hoang , A. , Gibert , P. , & Beerli , P. ( 2001b ). The strength of phenotypic selection in natural populations . American Naturalist , 157 , 245 – 261 .

Kingsolver , J. G. , & Schemske , D. W. ( 1991 ). Path analyses of selection . Trends in Ecology and Evolution , 6 ( 9 ), 276 – 280 . https://doi.org/10.1016/0169-5347(91)90004-H

Lande , R. ( 1976 ). Natural selection and random genetic drift in phenotypic evolution . Evolution , 30 ( 2 ), 314 – 334 . https://doi.org/10.1111/j.1558-5646.1976.tb00911.x

Lande , R. ( 1977 ). Statistical tests for natural selection on quantitative genetic characters . Evolution , 31 ( 2 ), 442 – 444 . https://doi.org/10.1111/j.1558-5646.1977.tb01025.x

Lande , R. ( 1979 ). Quantitative genetic analysis of multivariate evolution, applied to brain: Body size allometry . Evolution , 33 ( 1Part2 ), 402 – 416 . https://doi.org/10.1111/j.1558-5646.1979.tb04694.x

Lande , R. ( 1980a ). Genetic variation and phenotypic evolution during allopatric speciation . American Naturalist , 116 ( 4 ), 463 – 479 . https://doi.org/10.1086/283642

Lande , R. ( 1980b ). Sexual dimorphism, sexual selection, and adaptation in polygenic characters . Evolution , 34 ( 2 ), 292 – 305 . https://doi.org/10.1111/j.1558-5646.1980.tb04817.x

Lande , R. ( 1982 ). A quantitative genetic theory of life history evolution . Ecology , 63 ( 3 ), 607 – 615 . https://doi.org/10.2307/1936778

Lande , R. ( 2009 ). Adaptation to an extraordinary environment by evolution of phenotypic plasticity and genetic assimilation . Journal of Evolutionary Biology , 22 ( 7 ), 1435 – 1446 . https://doi.org/10.1111/j.1420-9101.2009.01754.x

Lande , R. , & Arnold , S. J. ( 1983 ). The measurement of selection on correlated characters . Evolution , 37 ( 6 ), 1210 – 1226 . https://doi.org/10.1111/j.1558-5646.1983.tb00236.x

Lerner , I. M. , & Dempster , E. R. ( 1948 ). Some aspects of evolutionary theory in the light of recent work on animal breeding . Evolution , 2 ( 1 ), 19 – 28 . https://doi.org/10.1111/j.1558-5646.1948.tb02728.x

Lewens , T. ( 2019 ). Neo-Paleyan biology . Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 76 , 101185 . https://doi.org/10.1016/j.shpsc.2019.101185

Lewontin , R. C ( 1970 ). The units of selection . Annual Review of Ecology and Systematics , 1 ( 1 ), 1 – 18 . https://doi.org/10.1146/annurev.es.01.110170.000245

Lewontin , R. C. ( 1978 ). Adaptation . Scientific American , 239 , 212 – 230 .

Lürig , M. D. , Donoughe , S. , Svensson , E. I. , Porto , A. , & Tsuboi , M. ( 2021 ). Computer vision, machine learning, and the promise of phenomics in ecology and evolutionary biology . Frontiers in Ecology and Evolution , 9 , 642774 .

Lynch , M. , & Walsh , B. ( 1998 ). Genetics and analysis of quantitative traits . Sinauer Associates, Inc .

MacColl , A. D. ( 2011 ). The ecological causes of evolution . Trends in Ecology and Evolution , 26 ( 10 ), 514 – 522 . https://doi.org/10.1016/j.tree.2011.06.009

Martins , M. J. F. , Puckett , T. M. , Lockwood , R. , Swaddle , J. P. , & Hunt , G. ( 2018 ). High male sexual investment as a driver of extinction in fossil ostracods . Nature , 556 ( 7701 ), 366 – 369 . https://doi.org/10.1038/s41586-018-0020-7

Mauricio , R. , & Rausher , M. D. ( 1997 ). Experimental manipulation of putative selective agents provides evidence for the role of natural enemies in the evolution of plant defense . Evolution , 51 ( 5 ), 1435 – 1444 . https://doi.org/10.1111/j.1558-5646.1997.tb01467.x

McGlothlin , J. W. , & Fisher , D. N. ( 2022 ). Social selection and the evolution of maladaptation . Journal of Heredity , 113 , 61 – 68 .

Mitchell-Olds , T. , & Shaw , R. G. ( 1987 ). Regression analysis of natural selection: Statistical inference and biological interpretation . Evolution , 41 ( 6 ), 1149 – 1161 . https://doi.org/10.1111/j.1558-5646.1987.tb02457.x

Morrissey , M. B. , & Hadfield , J. D. ( 2012 ). Directional selection in temporally replicated studies is remarkably consistent . Evolution , 66 ( 2 ), 435 – 442 . https://doi.org/10.1111/j.1558-5646.2011.01444.x

Mousseau , T. A. , & Roff , D. A. ( 1987 ). Natural selection and the heritability of fitness components . Heredity , 59 ( 2 ), 181 – 197 . https://doi.org/10.1038/hdy.1987.113

Okasha , S. ( 2006 ). Evolution and the levels of selection . Oxford University Press .

Opedal , O. ( 2021 ). A functional view reveals substantial predictability of pollinator-mediated selection . Journal of Pollination Ecology , 30 , 273 – 288 . https://doi.org/10.26786/1920-7603(2021)673

Otsuka , J. ( 2016 ). A critical review of the statisticalist debate . Biology and Philosophy , 31 ( 4 ), 459 – 482 . https://doi.org/10.1007/s10539-016-9528-0

Otsuka , Y. ( 2019 ). Ontology, causality, and methodology of evolutionary research programs. In T. Uller & K. N. Laland (Eds.), Evolutionary causation: Biological and philosophical reflections (pp. 247 – 264 ). The MIT Press .

Phillips , P. C. , & Arnold , S. J. ( 1989 ). Visualizing multivariate selection . Evolution , 43 ( 6 ), 1209 – 1222 . https://doi.org/10.1111/j.1558-5646.1989.tb02569.x

Pianka , E. R. ( 1988 ). Evolutionary ecology . Harper & Row .

Price , T. , Kirkpatrick , M. , & Arnold , S. J. ( 1988 ). Directional selection and the evolution of breeding date in birds . Science , 240 ( 4853 ), 798 – 799 . https://doi.org/10.1126/science.3363360

Price , T. D. , Brown , C. R. , & Brown , M. B. ( 2000 ). Evaluation of selection on Cliff Swallows . Evolution , 54 ( 5 ), 1824 – 1827 . https://doi.org/10.1111/j.0014-3820.2000.tb00727.x

Rausher , M. D. ( 2000 ). The measurement of selection quantitative traits: Biases due to environmental covariances and fitness . Evolution , 46 , 616 – 626 .

Reeve , H. K. , & Sherman , P. W. ( 1993 ). Adaptations and the goals of evolutionary research . Quarterly Review of Biology , 68 ( 1 ), 1 – 32 . https://doi.org/10.1086/417909

Reznick , D. , Shaw , F. H. , Rodd , F. H. , & Shaw , R. G. ( 1997 ). Evaluation of the rate of evolution in natural populations of guppies (P oecilia reticulata ) . Science , 275 ( 5308 ), 1934 – 1937 . https://doi.org/10.1126/science.275.5308.1934

Rice , W. R. , & Hostert , E. ( 1993 ). Laboratory experiments on speciation: What have we learned in 40 years? Evolution , 47 , 1637 – 1653 .

Sanderson , S. , Beausoleil , M. -O. , O’Dea , R. E. , Wood , Z. T. , Correa , C. , Frankel , V. , Gorné , L. D. , Haines , G. E. , Kinnison , M. T. , Oke , K. B. , Pelletier , F. , Pérez-Jvostov , F. , Reyes-Corral , W. D. , Ritchot , Y. , Sorbara , F. , Gotanda , K. M. , & Hendry , A. P. ( 2022 ). The pace of modern life, revisited . Molecular Ecology , 31 ( 4 ), 1028 – 1043 . https://doi.org/10.1111/mec.16299

Santangelo , J. S. , Ness , R. W. , Cohan , B. , Fitzpatrick , C. R. , Innes , S. G. , Koch , S. , Miles , L. S. , Munim , S. , Peres-Neto , P. R. , Prashad , C. , Tong , A. T. , Aguirre , W. E. , Akinwole , P. O. , Alberti , M. , Álvarez , J. , Anderson , J. T. , Anderson , J. J. , Ando , Y. , Andrew , N. R. , … Johnson , M. T. J. ( 2022 ). Global urban environmental change drives adaptation in white clover . Science , 375 ( 6586 ), 1275 – 1281 . https://doi.org/10.1126/science.abk0989

Schluter , D. ( 1994 ). Experimental evidence that competition promotes divergence in adaptive radiation . Science , 266 ( 5186 ), 798 – 801 . https://doi.org/10.1126/science.266.5186.798

Schluter , D. ( 1996 ). Adaptive radiation along genetic lines of least resistance . Evolution , 50 ( 5 ), 1766 – 1774 . https://doi.org/10.1111/j.1558-5646.1996.tb03563.x

Schluter , D. ( 2003 ). Frequency dependent natural selection during character displacement in sticklebacks . Evolution , 57 ( 5 ), 1142 – 1150 . https://doi.org/10.1111/j.0014-3820.2003.tb00323.x

Schoener , T. W. ( 2011 ). The newest synthesis: Understanding the interplay of evolutionary and ecological dynamics . Science , 331 ( 6016 ), 426 – 429 . https://doi.org/10.1126/science.1193954

Shipley , B. ( 2002 ). Cause and correlation in biology: A user’s guide to path analysis, structural equations and causal inference . Cambridge University Press .

Siepielski , A. M. , DiBattista , J. D. , & Carlson , S. M. ( 2009 ). It’s about time: The temporal dynamics of phenotypic selection in the wild . Ecology Letters , 12 ( 11 ), 1261 – 1276 . https://doi.org/10.1111/j.1461-0248.2009.01381.x

Siepielski , A. M. , DiBattista , J. D. , Evans , J. A. , & Carlson , S. M. ( 2011 ). Differences in the temporal dynamics of phenotypic selection among fitness components in the wild . Proceedings of the Royal Society B-Biological Sciences , 278 , 1572 – 1580 .

Siepielski , A. M. , Gotanda , K. M. , Morrissey , M. B. , Diamond , S. E. , DiBattista , J. D. , & Carlson , S. M. ( 2013 ). The spatial patterns of directional phenotypic selection . Ecology Letters , 16 ( 11 ), 1382 – 1392 . https://doi.org/10.1111/ele.12174

Siepielski , A. M. , Morrissey , M. B. , Buoro , M. , Carlson , S. M. , Caruso , C. M. , Clegg , S. M. , Coulson , T. , DiBattista , J. , Gotanda , K. M. , Francis , C. D. , Hereford , J. , Kingsolver , J. G. , Augustine , K. E. , Kruuk , L. E. B. , Martin , R. A. , Sheldon , B. C. , Sletvold , N. , Svensson , E. I. , Wade , M. J. , & MacColl , A. D. C. ( 2017 ). Precipitation drives global variation in natural selection . Science , 355 , 959 – 962 .

Sinervo , B. , & Basolo , A. L. ( 1996 ). Testing adaptation using phenotypic manipulation. In M. R. Rose , & G. V. Lauder (Eds.), Adaptation (pp. 149 – 185 ). Academic Press .

Sinervo , B. , Doughty , P. , Huey , R. B. , & Zamudio , K. ( 1992 ). Allometric engineering: A causal analysis of natural selection on offspring size . Science , 258 , 1927 – 1930 .

Sinervo , B. , & Svensson , E. ( 2002 ). Correlational selection and the evolution of genomic architecture . Heredity , 16 , 948 – 955 .

Sletvold , N. , Trunschke , J. , Smit , M. , Verbeek , J. , & Ågren , J. ( 2016 ). Strong pollinator-mediated selection for increased flower brightness and contrast in a deceptive orchid . Evolution , 70 ( 3 ), 716 – 724 . https://doi.org/10.1111/evo.12881

Slobodkin , L. B. ( 1961 ). Growth and regulation of animal populations . Holt, Rinehart and Winston .

Sober , E. ( 1984 ). The nature of selection: Evolutionary theory in philosophical focus . University of Chicago Press .

Stamps , J. A. ( 2011 ). Density bias in behavioral ecology . Behavioral Ecology , 22 ( 2 ), 231 – 232 . https://doi.org/10.1093/beheco/arq174

Stinchcombe , J. R. , Agrawal , A. F. , Hohenlohe , P. A. , Arnold , S. J. , & Blows , M. W. ( 2008 ). Estimating nonlinear selection gradients using quadratic regression coefficients: Double or nothing? Evolution , 62 ( 9 ), 2435 – 2440 . https://doi.org/10.1111/j.1558-5646.2008.00449.x

Stinchcombe , J. R. , & Kirkpatrick , M. ; Function-valued Traits Working Group. ( 2012 ). Genetics and evolution of function-valued traits: Understanding environmentally responsive phenotypes . Trends in Ecology and Evolution , 27 ( 11 ), 637 – 647 . https://doi.org/10.1016/j.tree.2012.07.002

Stinchcombe , J. R. , Rutter , M. T. , Burdick , D. S. , Tiffin , P. , Rausher , M. D. , & Mauricio , R. ( 2002 ). Testing for environmentally induced bias in phenotypic estimates of natural selection: Theory and practice . American Naturalist , 160 , 511 – 523 .

Svensson , E. , & Sinervo , B. ( 2000 ). Experimental excursions on adaptive landscapes: Density-dependent selection on egg size . Evolution , 54 ( 4 ), 1396 – 1403 . https://doi.org/10.1111/j.0014-3820.2000.tb00571.x

Svensson , E. I ( 2019 ). Eco-evolutionary dynamics of sexual selection and sexual conflict . Functional Ecology , 33 , 60 – 72 .

Svensson , E. I. , Arnold , S. J. , Bürger , R. , Csilléry , K. , Draghi , J. A. , Henshaw , J. M. , De Lisle , S. , Marques , D. A. , McGuigan , K. , Simon , M. N. , & Runemark , A. ( 2021 ). Correlational selection in the age of genomics . Nature Ecology and Evolution , 5 , 562 – 572 .

Svensson , E. I. , & Calsbeek , R. ( 2012 ). The adaptive landscape in evolutionary biology . Oxford University Press .

Svensson , E. I. , & Friberg , M. ( 2007 ). Selective predation on wing morphology in sympatric damselflies . American Naturalist , 170 , 101 – 112 .

Svensson , E. I. , Gomez-Llano , M. A. , & Waller , J. T. ( 2020 ). Selection on phenotypic plasticity favors thermal canalization . Proceedings of the National Academy of Sciences of the United States of America , 117 , 29767 – 29774 .

Svensson , E. I. , & Gosden , T. P. ( 2007 ). Contemporary evolution of secondary sexual traits in the wild . Functional Ecology , 16 , 422 – 433 .

Svensson , E. I. , & Sinervo , B. ( 2004 ). Spatial scale and temporal component of selection in side-blotched lizards . American Naturalist , 163 , 726 – 734 .

Uyeda , J. C. , Hansen , T. F. , Arnold , S. J. , & Pienaar , J. ( 2011 ). The million-year wait for macroevolutionary bursts . Proceedings of the National Academy of Sciences of the United States of America , 108 ( 38 ), 15908 – 15913 . https://doi.org/10.1073/pnas.1014503108

Wade , M. J. , & Kalisz , S. M. ( 1990 ). The causes of natural selection . Evolution , 44 , 1947 – 1955 .

Walsh , B. , & Blows , M. W. ( 2009 ). Abundant genetic variation plus strong selection = multivariate genetic constraints: A geometric view of adaptation . Annual Review of Ecology, Evolution, and Systematics , 40 ( 1 ), 41 – 59 . https://doi.org/10.1146/annurev.ecolsys.110308.120232

Walsh , B. , & Lynch , M. ( 2018 ). Evolution and selection of quantitative traits . Oxford University Press .

Walsh , D. , Lewins , T. , & Ariew , A. ( 2002 ). The trials of life: Natural selection and random drift . Philosophy of Science , 69 , 429 – 446 .

Walsh , D. M. ( 2015 ). Organisms, agency, and evolution . Cambridge University Press .

White , N. J. , & Butlin , R. K. ( 2021 ). Multidimensional divergent selection, local adaptation, and speciation . Evolution , 75 ( 9 ), 2167 – 2178 . https://doi.org/10.1111/evo.14312

Williams , G. C. ( 1966 ). Adaptation and natural selection . Princeton University Press .

Wolf , J. B. , Brodie , E. D. III , Cheverud , J. M. , Moore , A. J. , & Wade , M. J. ( 2001 ). Evolutionary consequences of indirect genetic effects . Trends in Ecology and Evolution , 13 , 64 – 69 .

Young , K. V. , Brodie , E. D. , & Brodie , E. D. ( 2004 ). How the horned lizard got its horns . Science , 304 ( 5667 ), 65 – 65 . https://doi.org/10.1126/science.1094790

Month: Total Views:
April 2023 18
May 2023 2,838
June 2023 351
July 2023 1,757
August 2023 601
September 2023 400
October 2023 474
November 2023 355
December 2023 163
January 2024 313
February 2024 309
March 2024 342
April 2024 317
May 2024 245
June 2024 199
July 2024 182
August 2024 179
September 2024 269
October 2024 234

Email alerts

Citing articles via.

  • Recommend to Your Librarian
  • Advertising and Corporate Services
  • Journals Career Network

Affiliations

  • Online ISSN 1558-5646
  • Print ISSN 0014-3820
  • Copyright © 2024 Society for the Study of Evolution
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Volume 2 Supplement 2

Special Issue: Transitional Fossils

  • Evolutionary Concepts
  • Open access
  • Published: 09 April 2009

Understanding Natural Selection: Essential Concepts and Common Misconceptions

  • T. Ryan Gregory 1  

Evolution: Education and Outreach volume  2 ,  pages 156–175 ( 2009 ) Cite this article

401k Accesses

137 Citations

397 Altmetric

Metrics details

Natural selection is one of the central mechanisms of evolutionary change and is the process responsible for the evolution of adaptive features. Without a working knowledge of natural selection, it is impossible to understand how or why living things have come to exhibit their diversity and complexity. An understanding of natural selection also is becoming increasingly relevant in practical contexts, including medicine, agriculture, and resource management. Unfortunately, studies indicate that natural selection is generally very poorly understood, even among many individuals with postsecondary biological education. This paper provides an overview of the basic process of natural selection, discusses the extent and possible causes of misunderstandings of the process, and presents a review of the most common misconceptions that must be corrected before a functional understanding of natural selection and adaptive evolution can be achieved.

“There is probably no more original, more complex, and bolder concept in the history of ideas than Darwin's mechanistic explanation of adaptation.” Ernst Mayr ( 1982 , p.481)

Introduction

Natural selection is a non-random difference in reproductive output among replicating entities, often due indirectly to differences in survival in a particular environment, leading to an increase in the proportion of beneficial, heritable characteristics within a population from one generation to the next. That this process can be encapsulated within a single (admittedly lengthy) sentence should not diminish the appreciation of its profundity and power. It is one of the core mechanisms of evolutionary change and is the main process responsible for the complexity and adaptive intricacy of the living world. According to philosopher Daniel Dennett ( 1995 ), this qualifies evolution by natural selection as “the single best idea anyone has ever had.”

Natural selection results from the confluence of a small number of basic conditions of ecology and heredity. Often, the circumstances in which those conditions apply are of direct significance to human health and well-being, as in the evolution of antibiotic and pesticide resistance or in the impacts of intense predation by humans (e.g., Palumbi 2001 ; Jørgensen et al. 2007 ; Darimont et al. 2009 ). Understanding this process is therefore of considerable importance in both academic and pragmatic terms. Unfortunately, a growing list of studies indicates that natural selection is, in general, very poorly understood—not only by young students and members of the public but even among those who have had postsecondary instruction in biology.

As is true with many other issues, a lack of understanding of natural selection does not necessarily correlate with a lack of confidence about one's level of comprehension. This could be due in part to the perception, unfortunately reinforced by many biologists, that natural selection is so logically compelling that its implications become self-evident once the basic principles have been conveyed. Thus, many professional biologists may agree that “[evolution] shows how everything from frogs to fleas got here via a few easily grasped biological processes ” (Coyne 2006 ; emphasis added). The unfortunate reality, as noted nearly 20 years ago by Bishop and Anderson ( 1990 ), is that “the concepts of evolution by natural selection are far more difficult for students to grasp than most biologists imagine.” Despite common assumptions to the contrary by both students and instructors, it is evident that misconceptions about natural selection are the rule, whereas a working understanding is the rare exception.

The goal of this paper is to enhance (or, as the case may be, confirm) readers' basic understanding of natural selection. This first involves providing an overview of the basis and (one of the) general outcomes of natural selection as they are understood by evolutionary biologists Footnote 1 . This is followed by a brief discussion of the extent and possible causes of difficulties in fully grasping the concept and consequences of natural selection. Finally, a review of the most widespread misconceptions about natural selection is provided. It must be noted that specific instructional tools capable of creating deeper understanding among students generally have remained elusive, and no new suggestions along these lines are presented here. Rather, this article is aimed at readers who wish to confront and correct any misconceptions that they may harbor and/or to better recognize those held by most students and other non-specialists.

The Basis and Basics of Natural Selection

Though rudimentary forms of the idea had been presented earlier (e.g., Darwin and Wallace 1858 and several others before them), it was in On the Origin of Species by Means of Natural Selection that Darwin ( 1859 ) provided the first detailed exposition of the process and implications of natural selection Footnote 2 . According to Mayr ( 1982 , 2001 ), Darwin's extensive discussion of natural selection can be distilled to five “facts” (i.e., direct observations) and three associated inferences. These are depicted in Fig.  1 .

The basis of natural selection as presented by Darwin ( 1859 ), based on the summary by Mayr ( 1982 )

Some components of the process, most notably the sources of variation and the mechanisms of inheritance, were, due to the limited available information in Darwin's time, either vague or incorrect in his original formulation. Since then, each of the core aspects of the mechanism has been elucidated and well documented, making the modern theory Footnote 3 of natural selection far more detailed and vigorously supported than when first proposed 150 years ago. This updated understanding of natural selection consists of the elements outlined in the following sections.

Overproduction, Limited Population Growth, and the “Struggle for Existence”

A key observation underlying natural selection is that, in principle, populations have the capacity to increase in numbers exponentially (or “geometrically”). This is a simple function of mathematics: If one organism produces two offspring, and each of them produces two offspring, and so on, then the total number grows at an increasingly rapid rate (1 → 2 → 4 → 8 → 16 → 32 → 64... to 2 n after n rounds of reproduction).

The enormity of this potential for exponential growth is difficult to fathom. For example, consider that beginning with a single Escherichia coli bacterium, and assuming that cell division occurs every 30 minutes, it would take less than a week for the descendants of this one cell to exceed the mass of the Earth. Of course, exponential population expansion is not limited to bacteria. As Nobel laureate Jacques Monod once quipped, “What is true for E. coli is also true for the elephant,” and indeed, Darwin ( 1859 ) himself used elephants as an illustration of the principle of rapid population growth, calculating that the number of descendants of a single pair would swell to more than 19,000,000 in only 750 years Footnote 4 . Keown ( 1988 ) cites the example of oysters, which may produce as many as 114,000,000 eggs in a single spawn. If all these eggs grew into oysters and produced this many eggs of their own that, in turn, survived to reproduce, then within five generations there would be more oysters than the number of electrons in the known universe.

Clearly, the world is not overrun with bacteria, elephants, or oysters. Though these and all other species engage in massive overproduction (or “superfecundity”) and therefore could in principle expand exponentially, in practice they do not Footnote 5 . The reason is simple: Most offspring that are produced do not survive to produce offspring of their own. In fact, most population sizes tend to remain relatively stable over the long term. This necessarily means that, on average, each pair of oysters produces only two offspring that go on to reproduce successfully—and that 113,999,998 eggs per female per spawn do not survive (see also Ridley 2004 ). Many young oysters will be eaten by predators, others will starve, and still others will succumb to infection. As Darwin ( 1859 ) realized, this massive discrepancy between the number of offspring produced and the number that can be sustained by available resources creates a “struggle for existence” in which often only a tiny fraction of individuals will succeed. As he noted, this can be conceived as a struggle not only against other organisms (especially members of the same species, whose ecological requirements are very similar) but also in a more abstract sense between organisms and their physical environments.

Variation and Inheritance

Variation among individuals is a fundamental requirement for evolutionary change. Given that it was both critical to his theory of natural selection and directly counter to much contemporary thinking, it should not be surprising that Darwin ( 1859 ) expended considerable effort in attempting to establish that variation is, in fact, ubiquitous. He also emphasized the fact that some organisms—namely relatives, especially parents and their offspring—are more similar to each other than to unrelated members of the population. This, too, he realized is critical for natural selection to operate. As Darwin ( 1859 ) put it, “Any variation which is not inherited is unimportant for us.” However, he could not explain either why variation existed or how specific characteristics were passed from parent to offspring, and therefore was forced to treat both the source of variation and the mechanism of inheritance as a “black box.”

The workings of genetics are no longer opaque. Today, it is well understood that inheritance operates through the replication of DNA sequences and that errors in this process (mutations) and the reshuffling of existing variants (recombination) represent the sources of new variation. In particular, mutations are known to be random (or less confusingly, “undirected”) with respect to any effects that they may have. Any given mutation is merely a chance error in the genetic system, and as such, its likelihood of occurrence is not influenced by whether it will turn out to be detrimental, beneficial, or (most commonly) neutral.

As Darwin anticipated, extensive variation among individuals has now been well established to exist at the physical, physiological, and behavioral levels. Thanks to the rise of molecular biology and, more recently, of genomics, it also has been possible to document variation at the level of proteins, genes, and even individual DNA nucleotides in humans and many other species.

Non-random Differences in Survival and Reproduction

Darwin saw that overproduction and limited resources create a struggle for existence in which some organisms will succeed and most will not. He also recognized that organisms in populations differ from one another in terms of many traits that tend to be passed on from parent to offspring. Darwin's brilliant insight was to combine these two factors and to realize that success in the struggle for existence would not be determined by chance, but instead would be biased by some of the heritable differences that exist among organisms. Specifically, he noted that some individuals happen to possess traits that make them slightly better suited to a particular environment, meaning that they are more likely to survive than individuals with less well suited traits. As a result, organisms with these traits will, on average, leave more offspring than their competitors.

Whereas the origin of a new genetic variant occurs at random in terms of its effects on the organism, the probability of it being passed on to the next generation is absolutely non-random if it impacts the survival and reproductive capabilities of that organism. The important point is that this is a two-step process: first, the origin of variation by random mutation, and second, the non-random sorting of variation due to its effects on survival and reproduction (Mayr 2001 ). Though definitions of natural selection have been phrased in many ways (Table  1 ), it is this non-random difference in survival and reproduction that forms the basis of the process.

Darwinian Fitness

The meaning of fitness in evolutionary biology.

In order to study the operation and effects of natural selection, it is important to have a means of describing and quantifying the relationships between genotype (gene complement), phenotype (physical and behavioral features), survival, and reproduction in particular environments. The concept used by evolutionary biologists in this regard is known as “Darwinian fitness,” which is defined most simply as a measure of the total (or relative) reproductive output of an organism with a particular genotype (Table  1 ). In the most basic terms, one can state that the more offspring an individual produces, the higher is its fitness. It must be emphasized that the term “fitness,” as used in evolutionary biology, does not refer to physical condition, strength, or stamina and therefore differs markedly from its usage in common language.

“Survival of the Fittest” is Misleading

In the fifth edition of the Origin (published in 1869), Darwin began using the phrase “survival of the fittest”, which had been coined a few years earlier by British economist Herbert Spencer, as shorthand for natural selection. This was an unfortunate decision as there are several reasons why “survival of the fittest” is a poor descriptor of natural selection. First, in Darwin's context, “fittest” implied “best suited to a particular environment” rather than “most physically fit,” but this crucial distinction is often overlooked in non-technical usage (especially when further distorted to “only the strong survive”). Second, it places undue emphasis on survival: While it is true that dead organisms do not reproduce, survival is only important evolutionarily insofar as it affects the number of offspring produced. Traits that make life longer or less difficult are evolutionarily irrelevant unless they also influence reproductive output. Indeed, traits that enhance net reproduction may increase in frequency over many generations even if they compromise individual longevity. Conversely, differences in fecundity alone can create differences in fitness, even if survival rates are identical among individuals. Third, this implies an excessive focus on organisms, when in fact traits or their underlying genes equally can be identified as more or less fit than alternatives. Lastly, this phrase is often misconstrued as being circular or tautological (Who survives? The fittest. Who are the fittest? Those who survive). However, again, this misinterprets the modern meaning of fitness, which can be both predicted in terms of which traits are expected to be successful in a specific environment and measured in terms of actual reproductive success in that environment.

Which Traits Are the Most Fit?

Directional natural selection can be understood as a process by which fitter traits (or genes) increase in proportion within populations over the course of many generations. It must be understood that the relative fitness of different traits depends on the current environment. Thus, traits that are fit now may become unfit later if the environment changes. Conversely, traits that have now become fit may have been present long before the current environment arose, without having conferred any advantage under previous conditions. Finally, it must be noted that fitness refers to reproductive success relative to alternatives here and now —natural selection cannot increase the proportion of traits solely because they may someday become advantageous. Careful reflection on how natural selection actually works should make it clear why this is so.

Natural Selection and Adaptive Evolution

Natural selection and the evolution of populations.

Though each has been tested and shown to be accurate, none of the observations and inferences that underlies natural selection is sufficient individually to provide a mechanism for evolutionary change Footnote 6 . Overproduction alone will have no evolutionary consequences if all individuals are identical. Differences among organisms are not relevant unless they can be inherited. Genetic variation by itself will not result in natural selection unless it exerts some impact on organism survival and reproduction. However, any time all of Darwin's postulates hold simultaneously—as they do in most populations—natural selection will occur. The net result in this case is that certain traits (or, more precisely, genetic variants that specify those traits) will, on average , be passed on from one generation to the next at a higher rate than existing alternatives in the population. Put another way, when one considers who the parents of the current generation were, it will be seen that a disproportionate number of them possessed traits beneficial for survival and reproduction in the particular environment in which they lived.

The important points are that this uneven reproductive success among individuals represents a process that occurs in each generation and that its effects are cumulative over the span of many generations. Over time, beneficial traits will become increasingly prevalent in descendant populations by virtue of the fact that parents with those traits consistently leave more offspring than individuals lacking those traits. If this process happens to occur in a consistent direction—say, the largest individuals in each generation tend to leave more offspring than smaller individuals—then there can be a gradual, generation-by-generation change in the proportion of traits in the population. This change in proportion and not the modification of organisms themselves is what leads to changes in the average value of a particular trait in the population. Organisms do not evolve; populations evolve.

The term “adaptation” derives from ad + aptus , literally meaning “toward + fit”. As the name implies, this is the process by which populations of organisms evolve in such a way as to become better suited to their environments as advantageous traits become predominant. On a broader scale, it is also how physical, physiological, and behavioral features that contribute to survival and reproduction (“adaptations”) arise over evolutionary time. This latter topic is particularly difficult for many to grasp, though of course a crucial first step is to understand the operation of natural selection on smaller scales of time and consequence. (For a detailed discussion of the evolution of complex organs such as eyes, see Gregory 2008b .)

On first pass, it may be difficult to see how natural selection can ever lead to the evolution of new characteristics if its primary effect is merely to eliminate unfit traits. Indeed, natural selection by itself is incapable of producing new traits, and in fact (as many readers will have surmised), most forms of natural selection deplete genetic variation within populations. How, then, can an eliminative process like natural selection ever lead to creative outcomes?

To answer this question, one must recall that evolution by natural selection is a two-step process. The first step involves the generation of new variation by mutation and recombination, whereas the second step determines which randomly generated variants will persist into the next generation. Most new mutations are neutral with respect to survival and reproduction and therefore are irrelevant in terms of natural selection (but not, it must be pointed out, to evolution more broadly). The majority of mutations that have an impact on survival and reproductive output will do so negatively and, as such, will be less likely than existing alternatives to be passed on to subsequent generations. However, a small percentage of new mutations will turn out to have beneficial effects in a particular environment and will contribute to an elevated rate of reproduction by organisms possessing them. Even a very slight advantage is sufficient to cause new beneficial mutations to increase in proportion over the span of many generations.

Biologists sometimes describe beneficial mutations as “spreading” or “sweeping” through a population, but this shorthand is misleading. Rather, beneficial mutations simply increase in proportion from one generation to the next because, by definition, they happen to contribute to the survival and reproductive success of the organisms carrying them. Eventually, a beneficial mutation may be the only alternative left as all others have ultimately failed to be passed on. At this point, that beneficial genetic variant is said to have become “fixed” in the population.

Again, mutation does not occur in order to improve fitness—it merely represents errors in genetic replication. This means that most mutations do not improve fitness: There are many more ways of making things worse than of making them better. It also means that mutations will continue to occur even after previous beneficial mutations have become fixed. As such, there can be something of a ratcheting effect in which beneficial mutations arise and become fixed by selection, only to be supplemented later by more beneficial mutations which, in turn, become fixed. All the while, neutral and deleterious mutations also occur in the population, the latter being passed on at a lower rate than alternatives and often being lost before reaching any appreciable frequency.

Of course, this is an oversimplification—in species with sexual reproduction, multiple beneficial mutations may be brought together by recombination such that the fixation of beneficial genes need not occur sequentially. Likewise, recombination can juxtapose deleterious mutations, thereby hastening their loss from the population. Nonetheless, it is useful to imagine the process of adaptation as one in which beneficial mutations arise continually (though perhaps very infrequently and with only minor positive impacts) and then accumulate in the population over many generations.

The process of adaptation in a population is depicted in very basic form in Fig.  2 . Several important points can be drawn from even such an oversimplified rendition:

Mutations are the source of new variation. Natural selection itself does not create new traits; it only changes the proportion of variation that is already present in the population. The repeated two-step interaction of these processes is what leads to the evolution of novel adaptive features.

Mutation is random with respect to fitness. Natural selection is, by definition, non-random with respect to fitness. This means that, overall, it is a serious misconception to consider adaptation as happening “by chance”.

Mutations occur with all three possible outcomes: neutral, deleterious, and beneficial. Beneficial mutations may be rare and deliver only a minor advantage, but these can nonetheless increase in proportion in the population over many generations by natural selection. The occurrence of any particular beneficial mutation may be very improbable, but natural selection is very effective at causing these individually unlikely improvements to accumulate. Natural selection is an improbability concentrator.

No organisms change as the population adapts. Rather, this involves changes in the proportion of beneficial traits across multiple generations.

The direction in which adaptive change occurs is dependent on the environment. A change in environment can make previously beneficial traits neutral or detrimental and vice versa.

Adaptation does not result in optimal characteristics. It is constrained by historical, genetic, and developmental limitations and by trade-offs among features (see Gregory 2008b ).

It does not matter what an “ideal” adaptive feature might be—the only relevant factor is that variants that happen to result in greater survival and reproduction relative to alternative variants are passed on more frequently. As Darwin wrote in a letter to Joseph Hooker (11 Sept. 1857), “I have just been writing an audacious little discussion, to show that organic beings are not perfect, only perfect enough to struggle with their competitors.”

The process of adaptation by natural selection is not forward-looking, and it cannot produce features on the grounds that they might become beneficial sometime in the future. In fact, adaptations are always to the conditions experienced by generations in the past.

A highly simplified depiction of natural selection ( Correct ) and a generalized illustration of various common misconceptions about the mechanism ( Incorrect ). Properly understood, natural selection occurs as follows: ( A ) A population of organisms exhibits variation in a particular trait that is relevant to survival in a given environment. In this diagram, darker coloration happens to be beneficial, but in another environment, the opposite could be true. As a result of their traits, not all individuals in Generation 1 survive equally well, meaning that only a non-random subsample ultimately will succeed in reproducing and passing on their traits ( B ). Note that no individual organisms in Generation 1 change, rather the proportion of individuals with different traits changes in the population. The individuals who survive from Generation 1 reproduce to produce Generation 2. ( C ) Because the trait in question is heritable, this second generation will (mostly) resemble the parent generation. However, mutations have also occurred, which are undirected (i.e., they occur at random in terms of the consequences of changing traits), leading to both lighter and darker offspring in Generation 2 as compared to their parents in Generation 1. In this environment, lighter mutants are less successful and darker mutants are more successful than the parental average. Once again, there is non-random survival among individuals in the population, with darker traits becoming disproportionately common due to the death of lighter individuals ( D ). This subset of Generation 2 proceeds to reproduce. Again, the traits of the survivors are passed on, but there is also undirected mutation leading to both deleterious and beneficial differences among the offspring ( E ). ( F ) This process of undirected mutation and natural selection (non-random differences in survival and reproductive success) occurs over many generations, each time leading to a concentration of the most beneficial traits in the next generation. By Generation N , the population is composed almost entirely of very dark individuals. The population can now be said to have become adapted to the environment in which darker traits are the most successful. This contrasts with the intuitive notion of adaptation held by most students and non-biologists. In the most common version, populations are seen as uniform, with variation being at most an anomalous deviation from the norm ( X ). It is assumed that all members within a single generation change in response to pressures imposed by the environment ( Y ). When these individuals reproduce, they are thought to pass on their acquired traits. Moreover, any changes that do occur due to mutation are imagined to be exclusively in the direction of improvement ( Z ). Studies have revealed that it can be very difficult for non-experts to abandon this intuitive interpretation in favor of a scientifically valid understanding of the mechanism. Diagrams based in part on Bishop and Anderson ( 1990 )

Natural Selection Is Elegant, Logical, and Notoriously Difficult to Grasp

The extent of the problem.

In its most basic form, natural selection is an elegant theory that effectively explains the obviously good fit of living things to their environments. As a mechanism, it is remarkably simple in principle yet incredibly powerful in application. However, the fact that it eluded description until 150 years ago suggests that grasping its workings and implications is far more challenging than is usually assumed.

Three decades of research have produced unambiguous data revealing a strikingly high prevalence of misconceptions about natural selection among members of the public and in students at all levels, from elementary school pupils to university science majors (Alters 2005 ; Bardapurkar 2008 ; Table  2 ) Footnote 7 . A finding that less than 10% of those surveyed possess a functional understanding of natural selection is not atypical. It is particularly disconcerting and undoubtedly exacerbating that confusions about natural selection are common even among those responsible for teaching it Footnote 8 . As Nehm and Schonfeld ( 2007 ) recently concluded, “one cannot assume that biology teachers with extensive backgrounds in biology have an accurate working knowledge of evolution, natural selection, or the nature of science.”

Why is Natural Selection so Difficult to Understand?

Two obvious hypotheses present themselves for why misunderstandings of natural selection are so widespread. The first is that understanding the mechanism of natural selection requires an acceptance of the historical fact of evolution, the latter being rejected by a large fraction of the population. While an improved understanding of the process probably would help to increase overall acceptance of evolution, surveys indicate that rates of acceptance already are much higher than levels of understanding. And, whereas levels of understanding and acceptance may be positively correlated among teachers (Vlaardingerbroek and Roederer 1997 ; Rutledge and Mitchell 2002 ; Deniz et al. 2008 ), the two parameters seem to be at most only very weakly related in students Footnote 9 (Bishop and Anderson 1990 ; Demastes et al. 1995 ; Brem et al. 2003 ; Sinatra et al. 2003 ; Ingram and Nelson 2006 ; Shtulman 2006 ). Teachers notwithstanding, “it appears that a majority on both sides of the evolution-creation debate do not understand the process of natural selection or its role in evolution” (Bishop and Anderson 1990 ).

The second intuitive hypothesis is that most people simply lack formal education in biology and have learned incorrect versions of evolutionary mechanisms from non-authoritative sources (e.g., television, movies, parents). Inaccurate portrayals of evolutionary processes in the media, by teachers, and by scientists themselves surely exacerbate the situation (e.g., Jungwirth 1975a , b , 1977 ; Moore et al. 2002 ). However, this alone cannot provide a full explanation, because even direct instruction on natural selection tends to produce only modest improvements in students' understanding (e.g., Jensen and Finley 1995 ; Ferrari and Chi 1998 ; Nehm and Reilly 2007 ; Spindler and Doherty 2009 ). There also is evidence that levels of understanding do not differ greatly between science majors and non-science majors (Sundberg and Dini 1993 ). In the disquieting words of Ferrari and Chi ( 1998 ), “misconceptions about even the basic principles of Darwin's theory of evolution are extremely robust, even after years of education in biology.”

Misconceptions are well known to be common with many (perhaps most) aspects of science, including much simpler and more commonly encountered phenomena such as the physics of motion (e.g., McCloskey et al. 1980 ; Halloun and Hestenes 1985 ; Bloom and Weisberg 2007 ). The source of this larger problem seems to be a significant disconnect between the nature of the world as reflected in everyday experience and the one revealed by systematic scientific investigation (e.g., Shtulman 2006 ; Sinatra et al. 2008 ). Intuitive interpretations of the world, though sufficient for navigating daily life, are usually fundamentally at odds with scientific principles. If common sense were more than superficially accurate, scientific explanations would be less counterintuitive, but they also would be largely unnecessary.

Conceptual Frameworks Versus Spontaneous Constructions

It has been suggested by some authors that young students simply are incapable of understanding natural selection because they have not yet developed the formal reasoning abilities necessary to grasp it (Lawson and Thompson 1988 ). This could be taken to imply that natural selection should not be taught until later grades; however, those who have studied student understanding directly tend to disagree with any such suggestion (e.g., Clough and Wood-Robinson 1985 ; Settlage 1994 ). Overall, the issue does not seem to be a lack of logic (Greene 1990 ; Settlage 1994 ), but a combination of incorrect underlying premises about mechanisms and deep-seated cognitive biases that influence interpretations.

Many of the misconceptions that block an understanding of natural selection develop early in childhood as part of “naïve” but practical understandings of how the world is structured. These tend to persist unless replaced with more accurate and equally functional information. In this regard, some experts have argued that the goal of education should be to supplant existing conceptual frameworks with more accurate ones (see Sinatra et al. 2008 ). Under this view, “Helping people to understand evolution...is not a matter of adding on to their existing knowledge, but helping them to revise their previous models of the world to create an entirely new way of seeing” (Sinatra et al. 2008 ). Other authors suggest that students do not actually maintain coherent conceptual frameworks relating to complex phenomena, but instead construct explanations spontaneously using intuitions derived from everyday experience (see Southerland et al. 2001 ). Though less widely accepted, this latter view gains support from the observation that naïve evolutionary explanations given by non-experts may be tentative and inconsistent (Southerland et al. 2001 ) and may differ depending on the type of organisms being considered (Spiegel et al. 2006 ). In some cases, students may attempt a more complex explanation but resort to intuitive ideas when they encounter difficulty (Deadman and Kelly 1978 ). In either case, it is abundantly clear that simply describing the process of natural selection to students is ineffective and that it is imperative that misconceptions be confronted if they are to be corrected (e.g., Greene 1990 ; Scharmann 1990 ; Settlage 1994 ; Ferrari and Chi 1998 ; Alters and Nelson 2002 ; Passmore and Stewart 2002 ; Alters 2005 ; Nelson 2007 ).

A Catalog of Common Misconceptions

Whereas the causes of cognitive barriers to understanding remain to be determined, their consequences are well documented. It is clear from many studies that complex but accurate explanations of biological adaptation typically yield to naïve intuitions based on common experience (Fig.  2 ; Tables  2 and 3 ). As a result, each of the fundamental components of natural selection may be overlooked or misunderstood when it comes time to consider them in combination, even if individually they appear relatively straightforward. The following sections provide an overview of the various, non-mutually exclusive, and often correlated misconceptions that have been found to be most common. All readers are encouraged to consider these conceptual pitfalls carefully in order that they may be avoided. Teachers, in particular, are urged to familiarize themselves with these errors so that they may identify and address them among their students.

Teleology and the “Function Compunction”

Much of the human experience involves overcoming obstacles, achieving goals, and fulfilling needs. Not surprisingly, human psychology includes a powerful bias toward thoughts about the “purpose” or “function” of objects and behaviors—what Kelemen and Rosset ( 2009 ) dub the “human function compunction.” This bias is particularly strong in children, who are apt to see most of the world in terms of purpose; for example, even suggesting that “rocks are pointy to keep animals from sitting on them” (Kelemen 1999a , b ; Kelemen and Rosset 2009 ). This tendency toward explanations based on purpose (“teleology”) runs very deep and persists throughout high school (Southerland et al. 2001 ) and even into postsecondary education (Kelemen and Rosset 2009 ). In fact, it has been argued that the default mode of teleological thinking is, at best, suppressed rather than supplanted by introductory scientific education. It therefore reappears easily even in those with some basic scientific training; for example, in descriptions of ecological balance (“fungi grow in forests to help decomposition”) or species survival (“finches diversified in order to survive”; Kelemen and Rosset 2009 ).

Teleological explanations for biological features date back to Aristotle and remain very common in naïve interpretations of adaptation (e.g., Tamir and Zohar 1991 ; Pedersen and Halldén 1992 ; Southerland et al. 2001 ; Sinatra et al. 2008 ; Table  2 ). On the one hand, teleological reasoning may preclude any consideration of mechanisms altogether if simply identifying a current function for an organ or behavior is taken as sufficient to explain its existence (e.g., Bishop and Anderson 1990 ). On the other hand, when mechanisms are considered by teleologically oriented thinkers, they are often framed in terms of change occurring in response to a particular need (Table  2 ). Obviously, this contrasts starkly with a two-step process involving undirected mutations followed by natural selection (see Fig.  2 and Table  3 ).

Anthropomorphism and Intentionality

A related conceptual bias to teleology is anthropomorphism, in which human-like conscious intent is ascribed either to the objects of natural selection or to the process itself (see below). In this sense, anthropomorphic misconceptions can be characterized as either internal (attributing adaptive change to the intentional actions of organisms) or external (conceiving of natural selection or “Nature” as a conscious agent; e.g., Kampourakis and Zogza 2008 ; Sinatra et al. 2008 ).

Internal anthropomorphism or “intentionality” is intimately tied to the misconception that individual organisms evolve in response to challenges imposed by the environment (rather than recognizing evolution as a population-level process). Gould ( 1980 ) described the obvious appeal of such intuitive notions as follows:

Since the living world is a product of evolution, why not suppose that it arose in the simplest and most direct way? Why not argue that organisms improve themselves by their own efforts and pass these advantages to their offspring in the form of altered genes—a process that has long been called, in technical parlance, the “inheritance of acquired characters.” This idea appeals to common sense not only for its simplicity but perhaps even more for its happy implication that evolution travels an inherently progressive path, propelled by the hard work of organisms themselves.

The penchant for seeing conscious intent is often sufficiently strong that it is applied not only to non-human vertebrates (in which consciousness, though certainly not knowledge of genetics and Darwinian fitness, may actually occur), but also to plants and even to single-celled organisms. Thus, adaptations in any taxon may be described as “innovations,” “inventions,” or “solutions” (sometimes “ingenious” ones, no less). Even the evolution of antibiotic resistance is characterized as a process whereby bacteria “learn” to “outsmart” antibiotics with frustrating regularity. Anthropomorphism with an emphasis on forethought is also behind the common misconception that organisms behave as they do in order to enhance the long-term well-being of their species. Once again, a consideration of the actual mechanics of natural selection should reveal why this is fallacious.

All too often, an anthropomorphic view of evolution is reinforced with sloppy descriptions by trusted authorities (Jungwirth 1975a , b , 1977 ; Moore et al. 2002 ). Consider this particularly egregious example from a website maintained by the National Institutes of Health Footnote 10 :

As microbes evolve, they adapt to their environment. If something stops them from growing and spreading—such as an antimicrobial—they evolve new mechanisms to resist the antimicrobials by changing their genetic structure. Changing the genetic structure ensures that the offspring of the resistant microbes are also resistant.

Fundamentally inaccurate descriptions such as this are alarmingly common. As a corrective, it is a useful exercise to translate such faulty characterizations into accurate language Footnote 11 . For example, this could read:

Bacteria that cause disease exist in large populations, and not all individuals are alike. If some individuals happen to possess genetic features that make them resistant to antibiotics, these individuals will survive the treatment while the rest gradually are killed off. As a result of their greater survival, the resistant individuals will leave more offspring than susceptible individuals, such that the proportion of resistant individuals will increase each time a new generation is produced. When only the descendants of the resistant individuals are left, the population of bacteria can be said to have evolved resistance to the antibiotics.

Use and Disuse

Many students who manage to avoid teleological and anthropomorphic pitfalls nonetheless conceive of evolution as involving change due to use or disuse of organs. This view, which was developed explicitly by Jean-Baptiste Lamarck but was also invoked to an extent by Darwin ( 1859 ), emphasizes changes to individual organisms that occur as they use particular features more or less. For example, Darwin ( 1859 ) invoked natural selection to explain the loss of sight in some subterranean rodents, but instead favored disuse alone as the explanation for loss of eyes in blind, cave-dwelling animals: “As it is difficult to imagine that eyes, though useless, could be in any way injurious to animals living in darkness, I attribute their loss wholly to disuse.” This sort of intuition remains common in naïve explanations for why unnecessary organs become vestigial or eventually disappear. Modern evolutionary theory recognizes several reasons that may account for the loss of complex features (e.g., Jeffery 2005 ; Espinasa and Espinasa 2008 ), some of which involve direct natural selection, but none of which is based simply on disuse.

Soft Inheritance

Evolution involving changes in individual organisms, whether based on conscious choice or use and disuse, would require that characteristics acquired during the lifetime of an individual be passed on to offspring Footnote 12 , a process often termed “soft inheritance.” The notion that acquired traits can be transmitted to offspring remained a common assumption among thinkers for more than 2,000 years, including into Darwin's time (Zirkle 1946 ). As is now understood, inheritance is actually “hard,” meaning that physical changes that occur during an organism's lifetime are not passed to offspring. This is because the cells that are involved in reproduction (the germline) are distinct from those that make up the rest of the body (the somatic line); only changes that affect the germline can be passed on. New genetic variants arise through mutation and recombination during replication and will often only exert their effects in offspring and not in the parents in whose reproductive cells they occur (though they could also arise very early in development and appear later in the adult offspring). Correct and incorrect interpretations of inheritance are contrasted in Fig.  3 .

A summary of correct ( left ) and incorrect ( right ) conceptions of heredity as it pertains to adaptive evolutionary change. The panels on the left display the operation of “hard inheritance”, whereas those on the right illustrate naïve mechanisms of “soft inheritance”. In all diagrams, a set of nine squares represents an individual multicellular organism and each square represents a type of cell of which the organisms are constructed. In the left panels, the organisms include two kinds of cells: those that produce gametes (the germline, black ) and those that make up the rest of the body (the somatic line, white ). In the top left panel , all cells in a parent organism initially contain a gene that specifies white coloration marked W ( A ). A random mutation occurs in the germline, changing the gene from one that specifies white to one that specifies gray marked G ( B ). This mutant gene is passed to the egg ( C ), which then develops into an offspring exhibiting gray coloration ( D ). The mutation in this case occurred in the parent (specifically, in the germline) but its effects did not become apparent until the next generation. In the bottom left panel , a parent once again begins with white coloration and the white gene in all of its cells ( H ). During its lifetime, the parent comes to acquire a gray coloration due to exposure to particular environmental conditions ( I ). However, because this does not involve any change to the genes in the germline, the original white gene is passed into the egg ( J ), and the offspring exhibits none of the gray coloration that was acquired by its parent ( K ). In the top right panel , the distinction between germline and somatic line is not understood. In this case, a parent that initially exhibits white coloration ( P ) changes during its lifetime to become gray ( Q ). Under incorrect views of soft inheritance, this altered coloration is passed on to the egg ( R ), and the offspring is born with the gray color acquired by its parent ( S ). In the bottom right panel , a more sophisticated but still incorrect view of inheritance is shown. Here, traits are understood to be specified by genes, but no distinction is recognized between the germline and somatic line. In this situation, a parent begins with white coloration and white-specifying genes in all its cells ( W ). A mutation occurs in one type of body cells to change those cells to gray ( X ). A mixture of white and gray genes is passed on to the egg ( Y ), and the offspring develops white coloration in most cells but gray coloration in the cells where gray-inducing mutations arose in the parent ( Z ). Intuitive ideas regarding soft inheritance underlie many misconceptions of how adaptive evolution takes place (see Fig.  2 )

Studies have indicated that belief in soft inheritance arises early in youth as part of a naïve model of heredity (e.g., Deadman and Kelly 1978 ; Kargbo et al. 1980 ; Lawson and Thompson 1988 ; Wood-Robinson 1994 ). That it seems intuitive probably explains why the idea of soft inheritance persisted so long among prominent thinkers and why it is so resistant to correction among modern students. Unfortunately, a failure to abandon this belief is fundamentally incompatible with an appreciation of evolution by natural selection as a two-step process in which the origin of new variation and its relevance to survival in a particular environment are independent considerations.

Nature as a Selecting Agent

Thirty years ago, widely respected broadcaster Sir David Attenborough ( 1979 ) aptly described the challenge of avoiding anthropomorphic shorthand in descriptions of adaptation:

Darwin demonstrated that the driving force of [adaptive] evolution comes from the accumulation, over countless generations, of chance genetical changes sifted by the rigors of natural selection. In describing the consequences of this process it is only too easy to use a form of words that suggests that the animals themselves were striving to bring about change in a purposeful way–that fish wanted to climb onto dry land, and to modify their fins into legs, that reptiles wished to fly, strove to change their scales into feathers and so ultimately became birds.

Unlike many authors, Attenborough ( 1979 ) admirably endeavored to not use such misleading terminology. However, this quote inadvertently highlights an additional challenge in describing natural selection without loaded language. In it, natural selection is described as a “driving force” that rigorously “sifts” genetic variation, which could be misunderstood to imply that it takes an active role in prompting evolutionary change. Much more seriously, one often encounters descriptions of natural selection as a processes that “chooses” among “preferred” variants or “experiments with” or “explores” different options. Some expressions, such as “favored” and “selected for” are used commonly as shorthand in evolutionary biology and are not meant to impart consciousness to natural selection; however, these too may be misinterpreted in the vernacular sense by non-experts and must be clarified.

Darwin ( 1859 ) himself could not resist slipping into the language of agency at times:

It may be said that natural selection is daily and hourly scrutinizing, throughout the world, every variation, even the slightest; rejecting that which is bad, preserving and adding up all that is good; silently and insensibly working, whenever and wherever opportunity offers, at the improvement of each organic being in relation to its organic and inorganic conditions of life. We see nothing of these slow changes in progress, until the hand of time has marked the long lapse of ages, and then so imperfect is our view into long past geological ages, that we only see that the forms of life are now different from what they formerly were.

Perhaps recognizing the ease with which such language can be misconstrued, Darwin ( 1868 ) later wrote that “The term ‘Natural Selection’ is in some respects a bad one, as it seems to imply conscious choice; but this will be disregarded after a little familiarity.” Unfortunately, more than “a little familiarity” seems necessary to abandon the notion of Nature as an active decision maker.

Being, as it is, the simple outcome of differences in reproductive success due to heritable traits, natural selection cannot have plans, goals, or intentions, nor can it cause changes in response to need. For this reason, Jungwirth ( 1975a , b , 1977 ) bemoaned the tendency for authors and instructors to invoke teleological and anthropomorphic descriptions of the process and argued that this served to reinforce misconceptions among students (see also Bishop and Anderson 1990 ; Alters and Nelson 2002 ; Moore et al. 2002 ; Sinatra et al. 2008 ). That said, a study of high school students by Tamir and Zohar ( 1991 ) suggested that older students can recognize the distinction between an anthropomorphic or teleological formulation (i.e., merely a convenient description) versus an anthropomorphic/teleological explanation (i.e., involving conscious intent or goal-oriented mechanisms as causal factors; see also Bartov 1978 , 1981 ). Moore et al. ( 2002 ), by contrast, concluded from their study of undergraduates that “students fail to distinguish between the relatively concrete register of genetics and the more figurative language of the specialist shorthand needed to condense the long view of evolutionary processes” (see also Jungwirth 1975a , 1977 ). Some authors have argued that teleological wording can have some value as shorthand for describing complex phenomena in a simple way precisely because it corresponds to normal thinking patterns, and that contrasting this explicitly with accurate language can be a useful exercise during instruction (Zohar and Ginossar 1998 ). In any case, biologists and instructors should be cognizant of the risk that linguistic shortcuts may send students off track.

Source Versus Sorting of Variation

Intuitive models of evolution based on soft inheritance are one-step models of adaptation: Traits are modified in one generation and appear in their altered form in the next. This is in conflict with the actual two-step process of adaptation involving the independent processes of mutation and natural selection. Unfortunately, many students who eschew soft inheritance nevertheless fail to distinguish natural selection from the origin of new variation (e.g., Greene 1990 ; Creedy 1993 ; Moore et al. 2002 ). Whereas an accurate understanding recognizes that most new mutations are neutral or harmful in a given environment, such naïve interpretations assume that mutations occur as a response to environmental challenges and therefore are always beneficial (Fig.  2 ). For example, many students may believe that exposure to antibiotics directly causes bacteria to become resistant, rather than simply changing the relative frequencies of resistant versus non-resistant individuals by killing off the latter Footnote 13 . Again, natural selection itself does not create new variation, it merely influences the proportion of existing variants. Most forms of selection reduce the amount of genetic variation within populations, which may be counteracted by the continual emergence of new variation via undirected mutation and recombination.

Typological, Essentialist, and Transformationist Thinking

Misunderstandings about how variation arises are problematic, but a common failure to recognize that it plays a role at all represents an even a deeper concern. Since Darwin ( 1859 ), evolutionary theory has been based strongly on “population” thinking that emphasizes differences among individuals. By contrast, many naïve interpretations of evolution remain rooted in the “typological” or “essentialist” thinking that has existed since the ancient Greeks (Mayr 1982 , 2001 ; Sinatra et al. 2008 ). In this case, species are conceived of as exhibiting a single “type” or a common “essence,” with variation among individuals representing anomalous and largely unimportant deviations from the type or essence. As Shtulman ( 2006 ) notes, “human beings tend to essentialize biological kinds and essentialism is incompatible with natural selection.” As with many other conceptual biases, the tendency to essentialize seems to arise early in childhood and remains the default for most individuals (Strevens 2000 ; Gelman 2004 ; Evans et al. 2005 ; Shtulman 2006 ).

The incorrect belief that species are uniform leads to “transformationist” views of adaptation in which an entire population transforms as a whole as it adapts (Alters 2005 ; Shtulman 2006 ; Bardapurkar 2008 ). This contrasts with the correct, “variational” understanding of natural selection in which it is the proportion of traits within populations that changes (Fig.  2 ). Not surprisingly, transformationist models of adaptation usually include a tacit assumption of soft inheritance and one-step change in response to challenges. Indeed, Shtulman ( 2006 ) found that transformationists appeal to “need” as a cause of evolutionary change three times more often than do variationists.

Events and Absolutes Versus Processes and Probabilities

A proper understanding of natural selection recognizes it as a process that occurs within populations over the course of many generations. It does so through cumulative, statistical effects on the proportion of traits differing in their consequences for reproductive success. This contrasts with two major errors that are commonly incorporated into naïve conceptions of the process:

Natural selection is mistakenly seen as an event rather than as a process (Ferrari and Chi 1998 ; Sinatra et al. 2008 ). Events generally have a beginning and end, occur in a specific sequential order, consist of distinct actions, and may be goal-oriented. By contrast, natural selection actually occurs continually and simultaneously within entire populations and is not goal-oriented (Ferrari and Chi 1998 ). Misconstruing selection as an event may contribute to transformationist thinking as adaptive changes are thought to occur in the entire population simultaneously. Viewing natural selection as a single event can also lead to incorrect “saltationist” assumptions in which complex adaptive features are imagined to appear suddenly in a single generation (see Gregory 2008b for an overview of the evolution of complex organs).

Natural selection is incorrectly conceived as being “all or nothing,” with all unfit individuals dying and all fit individuals surviving. In actuality, it is a probabilistic process in which some traits make it more likely—but do not guarantee—that organisms possessing them will successfully reproduce. Moreover, the statistical nature of the process is such that even a small difference in reproductive success (say, 1%) is enough to produce a gradual increase in the frequency of a trait over many generations.

Concluding Remarks

Surveys of students at all levels paint a bleak picture regarding the level of understanding of natural selection. Though it is based on well-established and individually straightforward components, a proper grasp of the mechanism and its implications remains very rare among non-specialists. The unavoidable conclusion is that the vast majority of individuals, including most with postsecondary education in science, lack a basic understanding of how adaptive evolution occurs.

While no concrete solutions to this problem have yet been found, it is evident that simply outlining the various components of natural selection rarely imparts an understanding of the process to students. Various alternative teaching strategies and activities have been suggested, and some do help to improve the level of understanding among students (e.g., Bishop and Anderson 1986 ; Jensen and Finley 1995 , 1996 ; Firenze 1997 ; Passmore and Stewart 2002 ; Sundberg 2003 ; Alters 2005 ; Scharmann 1990 ; Wilson 2005 ; Nelson 2007 , 2008 ; Pennock 2007 ; Kampourakis and Zogza 2008 ). Efforts to integrate evolution throughout biology curricula rather than segregating it into a single unit may also prove more effective (Nehm et al. 2009 ), as may steps taken to make evolution relevant to everyday concerns (e.g., Hillis 2007 ).

At the very least, it is abundantly clear that teaching and learning natural selection must include efforts to identify, confront, and supplant misconceptions. Most of these derive from deeply held conceptual biases that may have been present since childhood. Natural selection, like most complex scientific theories, runs counter to common experience and therefore competes—usually unsuccessfully—with intuitive ideas about inheritance, variation, function, intentionality, and probability. The tendency, both outside and within academic settings, to use inaccurate language to describe evolutionary phenomena probably serves to reinforce these problems.

Natural selection is a central component of modern evolutionary theory, which in turn is the unifying theme of all biology. Without a grasp of this process and its consequences, it is simply impossible to understand, even in basic terms, how and why life has become so marvelously diverse. The enormous challenge faced by biologists and educators in correcting the widespread misunderstanding of natural selection is matched only by the importance of the task.

For a more advanced treatment, see Bell ( 1997 , 2008 ) or consult any of the major undergraduate-level evolutionary biology or population genetics textbooks.

The Origin was, in Darwin's words, an “abstract” of a much larger work he had initially intended to write. Much of the additional material is available in Darwin ( 1868 ) and Stauffer ( 1975 ).

See Gregory ( 2008a ) for a discussion regarding the use of the term “theory” in science.

Ridley ( 2004 ) points out that Darwin's calculations require overlapping generations to reach this exact number, but the point remains that even in slow-reproducing species the rate of potential production is enormous relative to actual numbers of organisms.

Humans are currently undergoing a rapid population expansion, but this is the exception rather than the rule. As Darwin ( 1859 ) noted, “Although some species may now be increasing, more or less rapidly, in numbers, all cannot do so, for the world would not hold them.”

It cannot be overemphasized that “evolution” and “natural selection” are not interchangeable. This is because not all evolution occurs by natural selection and because not all outcomes of natural selection involve changes in the genetic makeup of populations. A detailed discussion of the different types of selection is beyond the scope of this article, but it can be pointed out that the effect of “stabilizing selection” is to prevent directional change in populations.

Instructors interested in assessing their own students' level of understanding may wish to consult tests developed by Bishop and Anderson ( 1986 ), Anderson et al. ( 2002 ), Beardsley ( 2004 ), Shtulman ( 2006 ), or Kampourakis and Zogza ( 2009 ).

Even more alarming is a recent indication that one in six teachers in the USA is a young Earth creationist, and that about one in eight teaches creationism as though it were a valid alternative to evolutionary science (Berkman et al. 2008 ).

Strictly speaking, it is not necessary to understand how evolution occurs to be convinced that it has occurred because the historical fact of evolution is supported by many convergent lines of evidence that are independent of discussions about particular mechanisms. Again, this represents the important distinction between evolution as fact and theory. See Gregory ( 2008a ).

http://www3.niaid.nih.gov/topics/antimicrobialResistance/Understanding/history.htm , accessed February 2009.

One should always be wary of the linguistic symptoms of anthropomorphic misconceptions, which usually include phrasing like “so that” (versus “because”) or “in order to” (versus “happened to”) when explaining adaptations (Kampourakis and Zogza 2009 ).

It must be noted that the persistent tendency to label the inheritance of acquired characteristics as “Lamarckian” is false: Soft inheritance was commonly accepted long before Lamarck's time (Zirkle 1946 ). Likewise, mechanisms involving organisms' conscious desires to change are often incorrectly attributed to Lamarck. For recent critiques of the tendency to describe various misconceptions as Lamarckian, see Geraedts and Boersma ( 2006 ) and Kampourakis and Zogza ( 2007 ). It is unfortunate that these mistakenly attributed concepts serve as the primary legacy of Lamarck, who in actuality made several important contributions to biology (a term first used by Lamarck), including greatly advancing the classification of invertebrates (another term he coined) and, of course, developing the first (albeit ultimately incorrect) mechanistic theory of evolution. For discussions of Lamarck's views and contributions to evolutionary biology, see Packard ( 1901 ), Burkhardt ( 1972 , 1995 ), Corsi ( 1988 ), Humphreys ( 1995 , 1996 ), and Kampourakis and Zogza ( 2007 ). Lamarck's works are available online at http://www.lamarck.cnrs.fr/index.php?lang=en .

One may wonder how this misconception is reconciled with the common admonition by medical doctors to complete each course of treatment with antibiotics even after symptoms disappear—would this not provide more opportunities for bacteria to “develop” resistance by prolonging exposure?

Alters B. Teaching biological evolution in higher education. Boston: Jones and Bartlett; 2005.

Google Scholar  

Alters BJ, Nelson CE. Teaching evolution in higher education. Evolution. 2002;56:1891–901.

Anderson DL, Fisher KM, Norman GJ. Development and evaluation of the conceptual inventory of natural selection. J Res Sci Teach. 2002;39:952–78. doi: 10.1002/tea.10053 .

Asghar A, Wiles JR, Alters B. Canadian pre-service elementary teachers' conceptions of biological evolution and evolution education. McGill J Educ. 2007;42:189–209.

Attenborough D. Life on earth. Boston: Little, Brown and Company; 1979.

Banet E, Ayuso GE. Teaching of biological inheritance and evolution of living beings in secondary school. Int J Sci Edu 2003;25:373–407.

Bardapurkar A. Do students see the “selection” in organic evolution? A critical review of the causal structure of student explanations. Evo Edu Outreach. 2008;1:299–305. doi: 10.1007/s12052-008-0048-5 .

Barton NH, Briggs DEG, Eisen JA, Goldstein DB, Patel NH. Evolution. Cold Spring Harbor: Cold Spring Harbor Laboratory Press; 2007.

Bartov H. Can students be taught to distinguish between teleological and causal explanations? J Res Sci Teach. 1978;15:567–72. doi: 10.1002/tea.3660150619 .

Bartov H. Teaching students to understand the advantages and disadvantages of teleological and anthropomorphic statements in biology. J Res Sci Teach. 1981;18:79–86. doi: 10.1002/tea.3660180113 .

Beardsley PM. Middle school student learning in evolution: are current standards achievable? Am Biol Teach. 2004;66:604–12. doi: 10.1662/0002-7685(2004)066[0604:MSSLIE]2.0.CO;2 .

Bell G. The basics of selection. New York: Chapman & Hall; 1997.

Bell G. Selection: the mechanism of evolution. 2nd ed. Oxford: Oxford University Press; 2008.

Berkman MB, Pacheco JS, Plutzer E. Evolution and creationism in America's classrooms: a national portrait. PLoS Biol. 2008;6:e124. doi: 10.1371/journal.pbio.0060124 .

Bishop BA, Anderson CW. Evolution by natural selection: a teaching module (Occasional Paper No. 91). East Lansing: Institute for Research on Teaching; 1986.

Bishop BA, Anderson CW. Student conceptions of natural selection and its role in evolution. J Res Sci Teach. 1990;27:415–27. doi: 10.1002/tea.3660270503 .

Bizzo NMV. From Down House landlord to Brazilian high school students: what has happened to evolutionary knowledge on the way? J Res Sci Teach. 1994;31:537–56.

Bloom P, Weisberg DS. Childhood origins of adult resistance to science. Science. 2007;316:996–7. doi: 10.1126/science.1133398 .

CAS   Google Scholar  

Brem SK, Ranney M, Schindel J. Perceived consequences of evolution: college students perceive negative personal and social impact in evolutionary theory. Sci Educ. 2003;87:181–206. doi: 10.1002/sce.10105 .

Brumby M. Problems in learning the concept of natural selection. J Biol Educ. 1979;13:119–22.

Brumby MN. Misconceptions about the concept of natural selection by medical biology students. Sci Educ. 1984;68:493–503. doi: 10.1002/sce.3730680412 .

Burkhardt RW. The inspiration of Lamarck's belief in evolution. J Hist Biol. 1972;5:413–38. doi: 10.1007/BF00346666 .

Burkhardt RW. The spirit of system. Cambridge: Harvard University Press; 1995.

Chinsamy A, Plaganyi E. Accepting evolution. Evolution. 2007;62:248–54.

Clough EE, Wood-Robinson C. How secondary students interpret instances of biological adaptation. J Biol Educ. 1985;19:125–30.

Corsi P. The age of Lamarck. Berkeley: University of California Press; 1988.

Coyne JA. Selling Darwin. Nature. 2006;442:983–4. doi: 10.1038/442983a .

Creedy LJ. Student understanding of natural selection. Res Sci Educ. 1993;23:34–41. doi: 10.1007/BF02357042 .

Curry A. Creationist beliefs persist in Europe. Science. 2009;323:1159. doi: 10.1126/science.323.5918.1159 .

Darimont CT, Carlson SM, Kinnison MT, Paquet PC, Reimchen TE, Wilmers CC. Human predators outpace other agents of trait change in the wild. Proc Natl Acad Sci U S A. 2009;106:952–4. doi: 10.1073/pnas.0809235106 .

Darwin C. On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. London: John Murray; 1859.

Darwin, C. The variation of animals and plants under domestication. London: John Murray; 1868.

Darwin C, Wallace AR. On the tendency of species to form varieties; and on the perpetuation of varieties and species by natural means of selection. Proc Linn Soc. 1858;3:46–62.

Deadman JA, Kelly PJ. What do secondary school boys understand about evolution and heredity before they are taught the topic? J Biol Educ. 1978;12:7–15.

Demastes SS, Settlage J, Good R. Students' conceptions of natural selection and its role in evolution: cases of replication and comparison. J Res Sci Teach. 1995;32:535–50. doi: 10.1002/tea.3660320509 .

Deniz H, Donelly LA, Yilmaz I. Exploring the factors related to acceptance of evolutionary theory among Turkish preservice biology teachers: toward a more informative conceptual ecology for biological evolution. J Res Sci Teach. 2008;45:420–43. doi: 10.1002/tea.20223 .

Dennett DC. Darwin's dangerous idea. New York: Touchstone Books; 1995.

Espinasa M, Espinasa L. Losing sight of regressive evolution. Evo Edu Outreach. 2008;1:509–16. doi: 10.1007/s12052-008-0094-z .

Evans EM, Mull MS, Poling DA, Szymanowski K. Overcoming an essentialist bias: from metamorphosis to evolution. In Biennial meeting of the Society for Research in Child Development , Atlanta, GA; 2005.

Evans EM, Spiegel A, Gram W, Frazier BF, Thompson S, Tare M, Diamond J. A conceptual guide to museum visitors’ understanding of evolution. In Annual Meeting of the American Education Research Association , San Francisco; 2006.

Ferrari M, Chi MTH. The nature of naive explanations of natural selection. Int J Sci Educ. 1998;20:1231–56. doi: 10.1080/0950069980201005 .

Firenze R. Lamarck vs. Darwin: dueling theories. Rep Natl Cent Sci Educ. 1997;17:9–11.

Freeman S, Herron JC. Evolutionary analysis. 4th ed. Upper Saddle River: Prentice Hall; 2007.

Futuyma DJ. Evolution. Sunderland: Sinauer; 2005.

Gelman SA. Psychological essentialism in children. Trends Cogn Sci. 2004;8:404–9. doi: 10.1016/j.tics.2004.07.001 .

Geraedts CL, Boersma KT. Reinventing natural selection. Int J Sci Educ. 2006;28:843–70. doi: 10.1080/09500690500404722 .

Gould SJ. Shades of Lamarck. In: The Panda's Thumb. New York: Norton; 1980. p. 76–84.

Greene ED. The logic of university students' misunderstanding of natural selection. J Res Sci Teach. 1990;27:875–85. doi: 10.1002/tea.3660270907 .

Gregory TR. Evolution as fact, theory, and path. Evo Edu Outreach. 2008a;1:46–52. doi: 10.1007/s12052-007-0001-z .

Gregory TR. The evolution of complex organs. Evo Edu Outreach. 2008b;1:358–89. doi: 10.1007/s12052-008-0076-1 .

Gregory TR. Artificial selection and domestication: modern lessons from Darwin's enduring analogy. Evo Edu Outreach. 2009;2:5–27. doi: 10.1007/s12052-008-0114-z .

Hall BK, Hallgrimsson B. Strickberger's evolution. 4th ed. Sudbury: Jones and Bartlett; 2008.

Halldén O. The evolution of the species: pupil perspectives and school perspectives. Int J Sci Educ. 1988;10:541–52. doi: 10.1080/0950069880100507 .

Halloun IA, Hestenes D. The initial knowledge state of college physics students. Am J Phys. 1985;53:1043–55. doi: 10.1119/1.14030 .

Hillis DM. Making evolution relevant and exciting to biology students. Evolution. 2007;61:1261–4. doi: 10.1111/j.1558-5646.2007.00126.x .

Humphreys J. The laws of Lamarck. Biologist. 1995;42:121–5.

Humphreys J. Lamarck and the general theory of evolution. J Biol Educ. 1996;30:295–303.

Ingram EL, Nelson CE. Relationship between achievement and students' acceptance of evolution or creation in an upper-level evolution course. J Res Sci Teach. 2006;43:7–24. doi: 10.1002/tea.20093 .

Jeffery WR. Adaptive evolution of eye degeneration in the Mexican blind cavefish. J Heredity. 2005;96:185–96. doi: 10.1093/jhered/esi028 .

Jensen MS, Finley FN. Teaching evolution using historical arguments in a conceptual change strategy. Sci Educ. 1995;79:147–66. doi: 10.1002/sce.3730790203 .

Jensen MS, Finley FN. Changes in students' understanding of evolution resulting from different curricular and instructional strategies. J Res Sci Teach. 1996;33:879–900. doi: 10.1002/(SICI)1098-2736(199610)33:8<879::AID-TEA4>3.0.CO;2-T .

Jiménez-Aleixandre MP. Thinking about theories or thinking with theories?: a classroom study with natural selection. Int J Sci Educ. 1992;14:51–61. doi: 10.1080/0950069920140106 .

Jiménez-Aleixandre MP, Fernández-Pérez J. Selection or adjustment? Explanations of university biology students for natural selection problems. In: Novak, JD. Proceedings of the Second International Seminar on Misconceptions and Educational Strategies in Science and Mathematics, vol II. Ithaca: Department of Education, Cornell University; 1987;224–32.

Jørgensen C, Enberg K, Dunlop ES, Arlinghaus R, Boukal DS, Brander K, et al. Managing evolving fish stocks. Science. 2007;318:1247–8. doi: 10.1126/science.1148089 .

Jungwirth E. The problem of teleology in biology as a problem of biology-teacher education. J Biol Educ. 1975a;9:243–6.

Jungwirth E. Preconceived adaptation and inverted evolution. Aust Sci Teachers J. 1975b;21:95–100.

Jungwirth E. Should natural phenomena be described teleologically or anthropomorphically?—a science educator’s view. J Biol Educ. 1977;11:191–6.

Kampourakis K, Zogza V. Students’ preconceptions about evolution: how accurate is the characterization as “Lamarckian” when considering the history of evolutionary thought? Sci Edu 2007;16:393–422.

Kampourakis K, Zogza V. Students’ intuitive explanations of the causes of homologies and adaptations. Sci Educ. 2008;17:27–47. doi: 10.1007/s11191-007-9075-9 .

Kampourakis K, Zogza V. Preliminary evolutionary explanations: a basic framework for conceptual change and explanatory coherence in evolution. Sci Educ. 2009; in press.

Kardong KV. An introduction to biological evolution. 2nd ed. Boston: McGraw Hill; 2008.

Kargbo DB, Hobbs ED, Erickson GL. Children's beliefs about inherited characteristics. J Biol Educ. 1980;14:137–46.

Kelemen D. Why are rocks pointy? Children's preference for teleological explanations of the natural world. Dev Psychol. 1999a;35:1440–52. doi: 10.1037/0012-1649.35.6.1440 .

Kelemen D. Function, goals and intention: children's teleological reasoning about objects. Trends Cogn Sci. 1999b;3:461–8. doi: 10.1016/S1364-6613(99)01402-3 .

Kelemen D, Rosset E. The human function compunction: teleological explanation in adults. Cognition. 2009;111:138–43. doi: 10.1016/j.cognition.2009.01.001 .

Keown D. Teaching evolution: improved approaches for unprepared students. Am Biol Teach. 1988;50:407–10.

Lawson AE, Thompson LD. Formal reasoning ability and misconceptions concerning genetics and natural selection. J Res Sci Teach. 1988;25:733–46. doi: 10.1002/tea.3660250904 .

MacFadden BJ, Dunckel BA, Ellis S, Dierking LD, Abraham-Silver L, Kisiel J, et al. Natural history museum visitors' understanding of evolution. BioScience. 2007;57:875–82.

Mayr E. The growth of biological thought. Cambridge: Harvard University Press; 1982.

Mayr E. What evolution Is. New York: Basic Books; 2001.

McCloskey M, Caramazza A, Green B. Curvilinear motion in the absence of external forces: naïve beliefs about the motion of objects. Science. 1980;210:1139–41. doi: 10.1126/science.210.4474.1139 .

Moore R, Mitchell G, Bally R, Inglis M, Day J, Jacobs D. Undergraduates' understanding of evolution: ascriptions of agency as a problem for student learning. J Biol Educ. 2002;36:65–71.

Nehm RH, Reilly L. Biology majors' knowledge and misconceptions of natural selection. BioScience. 2007;57:263–72. doi: 10.1641/B570311 .

Nehm RH, Schonfeld IS. Does increasing biology teacher knowledge of evolution and the nature of science lead to greater preference for the teaching of evolution in schools? J Sci Teach Educ. 2007;18:699–723. doi: 10.1007/s10972-007-9062-7 .

Nehm RH, Poole TM, Lyford ME, Hoskins SG, Carruth L, Ewers BE, et al. Does the segregation of evolution in biology textbooks and introductory courses reinforce students' faulty mental models of biology and evolution? Evo Edu Outreach. 2009;2: In press.

Nelson CE. Teaching evolution effectively: a central dilemma and alternative strategies. McGill J Educ. 2007;42:265–83.

Nelson CE. Teaching evolution (and all of biology) more effectively: strategies for engagement, critical reasoning, and confronting misconceptions. Integr Comp Biol. 2008;48:213–25. doi: 10.1093/icb/icn027 .

Packard AS. Lamarck, the founder of evolution: his life and work with translations of his writings on organic evolution. New York: Longmans, Green, and Co; 1901.

Palumbi SR. Humans as the world's greatest evolutionary force. Science. 2001;293:1786–90. doi: 10.1126/science.293.5536.1786 .

Passmore C, Stewart J. A modeling approach to teaching evolutionary biology in high schools. J Res Sci Teach. 2002;39:185–204. doi: 10.1002/tea.10020 .

Pedersen S, Halldén O. Intuitive ideas and scientific explanations as parts of students' developing understanding of biology: the case of evolution. Eur J Psychol Educ. 1992;9:127–37.

Pennock RT. Learning evolution and the nature of science using evolutionary computing and artificial life. McGill J Educ. 2007;42:211–24.

Prinou L, Halkia L, Skordoulis C. What conceptions do Greek school students form about biological evolution. Evo Edu Outreach. 2008;1:312–7. doi: 10.1007/s12052-008-0051-x .

Ridley M. Evolution. 3rd ed. Malden: Blackwell; 2004.

Robbins JR, Roy P. The natural selection: identifying & correcting non-science student preconceptions through an inquiry-based, critical approach to evolution. Am Biol Teach. 2007;69:460–6. doi: 10.1662/0002-7685(2007)69[460:TNSICN]2.0.CO;2 .

Rose MR, Mueller LD. Evolution and ecology of the organism. Upper Saddle River: Prentice Hall; 2006.

Rutledge ML, Mitchell MA. High school biology teachers' knowledge structure, acceptance & teaching of evolution. Am Biol Teach. 2002;64:21–7. doi: 10.1662/0002-7685(2002)064[0021:HSBTKS]2.0.CO;2 .

Scharmann LC. Enhancing an understanding of the premises of evolutionary theory: the influence of a diversified instructional strategy. Sch Sci Math. 1990;90:91–100.

Settlage J. Conceptions of natural selection: a snapshot of the sense-making process. J Res Sci Teach. 1994;31:449–57.

Shtulman A. Qualitative differences between naïve and scientific theories of evolution. Cognit Psychol. 2006;52:170–94. doi: 10.1016/j.cogpsych.2005.10.001 .

Sinatra GM, Southerland SA, McConaughy F, Demastes JW. Intentions and beliefs in students' understanding and acceptance of biological evolution. J Res Sci Teach. 2003;40:510–28. doi: 10.1002/tea.10087 .

Sinatra GM, Brem SK, Evans EM. Changing minds? Implications of conceptual change for teaching and learning about biological evolution. Evo Edu Outreach. 2008;1:189–95. doi: 10.1007/s12052-008-0037-8 .

Southerland SA, Abrams E, Cummins CL, Anzelmo J. Understanding students' explanations of biological phenomena: conceptual frameworks or p-prims? Sci Educ. 2001;85:328–48. doi: 10.1002/sce.1013 .

Spiegel AN, Evans EM, Gram W, Diamond J. Museum visitors' understanding of evolution. Museums Soc Issues. 2006;1:69–86.

Spindler LH, Doherty JH. Assessment of the teaching of evolution by natural selection through a hands-on simulation. Teach Issues Experiments Ecol. 2009;6:1–20.

Stauffer RC (editor). Charles Darwin's natural selection: being the second part of his big species book written from 1856 to 1858. Cambridge, UK: Cambridge University Press; 1975.

Stearns SC, Hoekstra RF. Evolution: an introduction. 2nd ed. Oxford, UK: Oxford University Press; 2005.

Strevens M. The essentialist aspect of naive theories. Cognition. 2000;74:149–75. doi: 10.1016/S0010-0277(99)00071-2 .

Sundberg MD. Strategies to help students change naive alternative conceptions about evolution and natural selection. Rep Natl Cent Sci Educ. 2003;23:1–8.

Sundberg MD, Dini ML. Science majors vs nonmajors: is there a difference? J Coll Sci Teach. 1993;22:299–304.

Tamir P, Zohar A. Anthropomorphism and teleology in reasoning about biological phenomena. Sci Educ. 1991;75:57–67. doi: 10.1002/sce.3730750106 .

Tidon R, Lewontin RC. Teaching evolutionary biology. Genet Mol Biol. 2004;27:124–31. doi: 10.1590/S1415-475720054000100021 .

Vlaardingerbroek B, Roederer CJ. Evolution education in Papua New Guinea: trainee teachers' views. Educ Stud. 1997;23:363–75. doi: 10.1080/0305569970230303 .

Wilson DS. Evolution for everyone: how to increase acceptance of, interest in, and knowledge about evolution. PLoS Biol. 2005;3:e364. doi: 10.1371/journal.pbio.0030364 .

Wood-Robinson C. Young people's ideas about inheritance and evolution. Stud Sci Educ. 1994;24:29–47. doi: 10.1080/03057269408560038 .

Zirkle C. The early history of the idea of the inheritance of acquired characters and of pangenesis. Trans Am Philos Soc. 1946;35:91–151. doi: 10.2307/1005592 .

Zohar A, Ginossar S. Lifting the taboo regarding teleology and anthropomorphism in biology education—heretical suggestions. Sci Educ. 1998;82:679–97. doi: 10.1002/(SICI)1098-237X(199811)82:6<679::AID-SCE3>3.0.CO;2-E .

Download references

Author information

Authors and affiliations.

Department of Integrative Biology, University of Guelph, Guelph, Ontario, N1G 2W1, Canada

T. Ryan Gregory

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to T. Ryan Gregory .

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article.

Gregory, T.R. Understanding Natural Selection: Essential Concepts and Common Misconceptions. Evo Edu Outreach 2 , 156–175 (2009). https://doi.org/10.1007/s12052-009-0128-1

Download citation

Received : 14 March 2009

Accepted : 16 March 2009

Published : 09 April 2009

Issue Date : June 2009

DOI : https://doi.org/10.1007/s12052-009-0128-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Evolution: Education and Outreach

ISSN: 1936-6434

gene selection essay

REVIEW article

Machine learning based computational gene selection models: a survey, performance evaluation, open issues, and future research directions.

\r\nNivedhitha Mahendran

  • 1 School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
  • 2 Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan

Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes’ help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.

Introduction

Deoxy-ribonucleic Acid (DNA) is a hereditary material containing the genetic information, usually found in the cell’s nucleus. The information inside the DNA is made up of a code consisting of four bases, namely, Adenine, Guanine, Cytosine, and Thymine. Adenine pairs with Thymine and Cytosine with Guanine to form base pairs. The base pairs, along with their respective sugar and phosphate molecules, form a Nucleotide. The Nucleotide forms a double helical structure, which looks like a ladder. Gene is the fundamental unit of heredity and is built-up of DNA. Genes are responsible for determining characteristics such as height, color, and many others. Some of the genes manufacture proteins, and some do not. According to the Human Genome Project, there are approximately around 25,000 genes in humans.

There are two copies of genes in every human; one passed on from the parent; almost all the genes are the same, except a few, less than 1% called the Alleles. They determine the unique physical features of a person. Genes manufacture proteins, and proteins, in turn, say what the cell should do (cell functions). The flow starts with DNA, RNA, and then the proteins. The flow of information determines the type of proteins being produced. The process in which the information contained in DNA is transformed into instructions to form proteins and other biochemical molecules is called gene expression. Gene expression assists the cells to react appropriately to the changing environment. The gene expression involves two critical steps in manufacturing the proteins, Transcription and Translation ( Raut et al., 2010 ).

• Transcription: The DNA present in the gene will be copied to form an RNA known as the messenger RNA (mRNA). RNA is similar to DNA; however, it has a single-strand, and instead of Thymine, it has Uracil (U).

• Translation: The messages carried from the transcription by the mRNA will be read by the transfer RNA (tRNA) in the Translation phase. The mRNA can read three letters at a time, which constitutes one Amino acid (Amino acids are the building blocks of proteins).

Proteins play a significant role in cell functioning. Gene expression controls everything, such as when to produce protein, when not to, volume, i.e., increasing or decreasing the amount, etc. It is a kind of on/off switch. When this process does not happen as it is supposed to be, genetic disorders, tumors occur. A detailed study of the gene expression will help find the essential biomarkers that cause genetic disorders and tumors.

There are many techniques available to capture the gene expressions such as Northern blot, RNA protection assay, Reverse Transcription – Polymerase Chain Reaction (RT - PCR), Serial Analysis of Gene Expression (SAGE), Subtractive Hybridization, DNA Microarrays, Second Generation Sequencing (NGS) and many others. Among these, the most widely used these days is DNA Microarray ( Raut et al., 2010 ; Wang and van der Laan, 2011 ). The DNA microarray technology manages to capture gene expressions of thousands of genes simultaneously. However, the Microarray result is enormous, with a high dimension, which makes the analysis challenging. Thus, it is necessary to perform gene selection to handle the high dimensional problem by removing the redundant and irrelevant genes. There are many computation techniques used in the field of bioinformatics been carried out over the years, such as Pattern Recognition, Data Mining, and many others to manage the high dimensional issue, yet ineffective ( Raut et al., 2010 ).

Hence, in recent years, Machine Learning, which is a part of Artificial Intelligence, has gained the researchers’ attention in genomics and gene expression. Machine Learning is the part of Data Science; its primary purpose is to enable a model to train and learn to make decisions on its own in the future. Machine Learning is commonly categorized as Supervised, Unsupervised, and Semi-supervised or Semi-unsupervised learning. The Supervised involves the labeled data; unsupervised learning involves unlabeled data, and the Semi-supervised or Semi-unsupervised involves handling both labeled and unlabeled data. Machine Learning flows through Pre-processing and Classification or Clustering. In gene expression microarray data, machine learning-based feature selection approaches like gene selection approaches will help to select the required genes from the lot.

Feature selection helps in preserving the informative attributes. Feature selection is primarily applied to the high-dimensional data; in simple terms, feature selection is a dimensionality reduction technique ( Kira and Rendell, 1992 ). Feature selection assists significantly in the fields, which have too many features and relatively scarce samples, for instance, RNA sequencing and DNA Microarray ( Ang et al., 2015b ).

The primary intent that feature selection got famous in the recent past is to extract the informative subset of features from the original feature space ( Ang et al., 2015b ). Feature selection techniques aids in overcoming the scare of model overfitting, handling the dimension, better interpretation of the feature space, maximizes prediction accuracy, and maximizes the model training time ( Halperin et al., 2005 ; Sun et al., 2019b ). The outcome of Feature selection is the optimal number of features that are relevant to the given class label, which contributes to the process of prediction.

One more technique for dimensionality reduction is Feature Extraction. Feature Selection is part of Feature Extraction ( Cárdenas-Ovando et al., 2019 ). It is the process of transforming the original feature space into a prominent space, which can be a linear or non-linear combination of the original feature space ( Anter and Ali, 2020 ). The major drawback of using Feature Extraction is that it alters the original feature space; eventually, the data interpretability is lost. Also, the transformation is usually expensive ( Bermingham et al., 2015 ).

Gene expression is the flow of genetic information from Deoxy-ribose Nucleic Acid (DNA) to Ribose Nucleic Acid (RNA) to protein or other biomolecule syntheses. Gene expression data is a biological representation of various transcriptions and other chemicals found inside a cell at a given time. As data is recorded directly from DNA, through various experiments, a pertinent computational technique will reveal deep insights about the disease or disorder in the cell, eventually the organism in which the cell belongs ( Koul and Manvi, 2020 ).

On the one hand, the gene expression data is highly dimensional; also, on the other, the sample size is incompetent. The high dimensionality in the data is due to the vast number of values generated for every gene in a genome in the order of thousands. Advanced technologies, for instance, Microarray, assists in analyzing thousands of proteins in a gene in a particular sample. However, the issue with Microarray is that it is expensive ( Wahid et al., 2020 ).

However, the data with vast feature space will have redundant features with unnecessary information that will lead to overfitting, significantly affecting the model’s performance. The primary purpose of implementing the Feature selection or gene selection on gene expression data is to choose the most regulating genes and eliminate the redundant genes that do not contribute to the target class ( Pearson et al., 2019 ).

The gene expression data are usually unlabeled, labeled, or semi-labeled, which leads to the necessity of the concepts of Unsupervised, Supervised, and Semi-supervised feature selection. Unlabeled data has no prior information about the functionalities, whereas it validates the gene selection based on data distribution, variance, and separability. Labeled data consists of meaningful class labels and information about the functionalities. Then gene selection will be performed based on the relevance and importance score of the labeled features. Semi-supervised or Semi-unsupervised combines a small amount of unlabeled data with labeled data and vice versa, which acts as additional information ( Yang et al., 2019 ). This paper discusses the importance of feature selection or gene selection to have an improved result. This paper’s remaining sections discuss the background and development of feature selection, the steps involved in feature selection, a detailed discussion on various works on gene selection in the literature, the open issues, and future research directions concerning the gene expression data and conclusion.

The feature selection methods can be categorized into Supervised, Unsupervised, and Semi-supervised learning models. The survey works in the literature concentrate on either one of the models; for example ( Kumar et al., 2017 ), focuses only on the supervised gene selection methods. Some works also concentrate on one particular feature selection strategy; for example ( Lazar et al., 2012 ), focuses on filter-based techniques. Table 1 shows the comparison of existing reviews with the current survey. Our study categorizes the feature selection strategy into supervised, unsupervised, and semi-supervised methods and discusses the existing approaches in those categories. Also, we have done a detailed discussion of their performances.

www.frontiersin.org

Table 1. Comparison of existing reviews with the current survey.

Gene Selection – Background and Development

Gene Selection is the technique applied to the gene expression dataset, such as DNA Microarray, to reduce the number of genes, which are redundant and less expressive or less informative. Gene Selection has its base in the Machine Learning-based Feature Selection technique, which significantly suits the applications that involve thousands of features ( Dashtban and Balafar, 2017 ). Gene Selection techniques are applied mainly for two reasons: finding the informative and expressive genes and removing the original space’s redundant genes. Theoretically, an increase in the number of genes will bring down the model’s performance and compromise the generalization by overfitting. The present works on Gene Selection concentrate mainly on finding the relevant genes, and there is limited research in removing the noise and redundant genes ( Wang et al., 2005 ).

For significant results, it is critical to concentrate on relevancy, redundancy, and complementarity. A gene is considered as relevant when it has necessary information (individually or combined with other genes) about the given class, for example, tumorous or not. According to Yu and Liu (2004) , the feature subset can be classified into strongly relevant, weakly relevant, and irrelevant in technical terms. The weakly irrelevant can again be classified into weakly relevant and redundant features and weakly relevant and non-redundant features. Most of the informative features can be found under strongly relevant and weakly relevant, and non-redundant features ( Vergara and Estévez, 2014 ). The same approach is followed in the Gene Selection from the gene expression data. Figure 1 shows the representation of the Gene Selection approach.

www.frontiersin.org

Figure 1. Representation of Gene Selection approach.

Many works in literature ( Hu et al., 2010 ; Hoque et al., 2014 ; Sun and Xu, 2014 ) aim to remove redundancy and relevancy from the data with the Mutual Information algorithm’s help in Gene Expression. Many variations in Mutual Information are implemented to tackle these two issues. Along with these two issues, there is one more issue, which many of the existing works fail to address, complementarity. Complementarity is the degree of feature interaction between a gene subset and an individual gene in a given class.

To solve the issues mentioned above, commonly, two approaches are followed in the literature, one is analyzing individual genes, and the other is finding an optimal subset. In analyzing individual genes, the genes are ranked based on their importance scores; genes with a similar score (redundant) and genes with the least score (irrelevant) below a given threshold will be removed. In finding an optimal subset, a search for a minimal subset of genes will be done, satisfying specific criteria and eliminating redundant and irrelevant genes.

In applications such as Text and Genomic Microarray analysis, the central issue is the “Curse of Dimensionality,” where finding the optimal subset of genes is considered an NP-hard problem. Effective learning will be achieved only when the model is trained with relevant and non-redundant genes. However, with an increase in the genes’ dimension, the possible number of optimal gene subsets will also increase exponentially.

In machine learning, feature space is defined as the space associated with a feature vector distributed all over the sample in an n-dimensional space. Moreover, to reduce the dimensionality of such feature space, feature extraction, or feature selection techniques can be used. Feature Selection is a part of the Feature Extraction technique. However, in feature selection, a subset from the original feature space will be formed, whereas, in feature extraction, a new set of feature space will be created that seems to capture the necessary information from the original feature space ( Jović et al., 2015 ). The most commonly used feature extraction techniques are Principle Component Analysis (PCA), Independent Component Analysis (ICA), Expectation-Maximization (EM), and Linear Discriminant Analysis (LDA). Some examples of Feature Selection techniques are RELIEF, Conditional Mutual Information Maximization (CMIM), Correlation Coefficient, Information Gain, and Lasso ( Khalid et al., 2014 ).

The major drawback of using Feature extraction is that the data’s interpretability will be lost in the transformation. Also, the transformation itself will be expensive sometimes ( Khalid et al., 2014 ). Therefore, in this paper, we will discuss various Feature Selection techniques used in Gene Selection, which is less expensive and preserves the data’s interpretability.

The Gene Selection based on machine learning can be classified into three types, Supervised, Unsupervised, and Semi-Supervised. Supervised Gene Selection utilizes the genes that are labeled already ( Filippone et al., 2006 ). The input and output labels are known in advance in this method. However, the data continues to grow and overwhelm the process, leading to data mislabeling, making it unreliable. The main issue in deploying Supervised Gene Selection is overfitting, which can be caused by selecting irrelevant or sometimes eliminating the most relevant gene ( Ang et al., 2015b ).

Unsupervised Gene Selection, unlike Supervised, will not have any labels to guide the selection process ( Filippone et al., 2005 ). The data used in Unsupervised Gene Selection is unlabelled. That makes it unbiased and serves as an effective way to find the necessary insights into the classification process ( Ye and Sakurai, 2017 ). The main issue in Unsupervised Gene Selection is that it does not consider the interaction among the Genes (correlation), making the resultant gene subset insignificant in the discrimination task ( Acharya et al., 2017 ).

Semi-supervised or Semi-unsupervised Gene Selection is like an add-on to the Supervised and Unsupervised Gene Selection. A Gene Selection is considered semi-supervised when most of the data is labeled, and a Gene Selection is said to be Semi-unsupervised when most of the data is unlabelled. The labeled data in the Semi-supervised or unsupervised is used to increase the distance between the data points that belongs to different classes, whereas the unlabelled data will help identify the geometrical structure of the feature space ( Sheikhpour et al., 2017 ). Figure 2 illustrates the overview of the process involved in Gene Selection.

www.frontiersin.org

Figure 2. An overview of Gene Selection process.

Steps Involved in Feature Selection

Search direction.

The first stage involved in Feature Selection is to choose a search direction, which serves as a starting point to the process. There are three commonly used search directions:

• Forward Search: In Forward Search, the Search will be started with an empty set, and features are added one by one ( Mohapatra et al., 2016 ).

• Backward Search: Search will be started with the whole set of genes, and the genes will be eliminated one by one with each iteration.

• Bi-directional: Search involves the advantages of Forward Search and Backward Search. The Search starts from both directions by either adding or removing a gene with each iteration ( Abinash and Vasudevan, 2018 ). Other than these, Random Search is also used as a search direction ( Wang et al., 2016 ).

Search Strategy

A good search strategy should attain fast convergence and provide an optimal solution with efficient computational cost and good global search ability ( Halperin et al., 2005 ). There are three most widely used searching strategies:

• Sequential: follows a particular order in finding the best feature subset, for instance, Sequential Forward Search, where the search will be carried out from the start to the end ( Chen and Yao, 2017 ). This strategy is prone to feature interaction and has the risk of attaining local minima ( Wang et al., 2016 ). Examples: Floating Forward or Backward, Linear Forward Search, Beam Search, Greedy Forward Selection, and Backward Elimination.

• Exponential: It is a full-scale search; it guarantees an optimal solution but proves to be expensive. This approach finds all possible feature subsets to choose an optimal subset, which is computationally upscale, especially in high-dimensional datasets such as the Gene Expression Microarray dataset. Some of the examples for Exponential Search are, Exhaustive Search and Branch-and-bound.

• Heuristic Search: It is performed based on a cost measure or a heuristic function, which iteratively improves the solution. Heuristic Search does not always ensure an optimal solution, but it offers an acceptable solution with reasonable time, cost, and memory space ( Ruiz et al., 2005 ). Some examples of Heuristic Search are Best-First Search, Depth-First Search, A ∗ Search, Breadth-First Search, and Lowest-Cost-First Search ( Russell and Norvig, 2016 ).

Evaluation Criteria

There are currently four types of evaluation methods used widely; they are Filter, Wrapper, Embedded, and Hybrid. Hybrid and Embedded methods are the recent developments in Gene Selection.

(a) Filter Feature Selection Approach:

Filter helps in identifying the specific abilities of features depending on the inherent properties of the data. The best among the features are identified with relevance score and threshold criteria ( Hancer et al., 2018 ). The features with a low relevance score will be eliminated.

The significant advantages of filter techniques are that they are not dependent on the classifiers, fast and straightforward in terms of computation, and scaled to the immensely dimensioned dataset ( Ang et al., 2015b ). The common disadvantage is that they consider the data’s univariate features, which means the features are processed individually ( Saeys et al., 2007 ). As a result, there are high chances of ignoring the feature dependencies, which leads to the classifiers’ poor performance compared to other feature selection approaches. Many multivariate filter techniques are introduced to avoid this to some extent ( Brumpton and Ferreira, 2016 ; Djellali et al., 2017 ; Zhou et al., 2017 ; Rouhi and Nezamabadi-pour, 2018 ).

The examples for filter techniques are Pearson Correlation, Fisher Score, Model-based Ranking, and Mutual Information ( Lazar et al., 2012 ) were done in a detailed survey on the filter techniques applied to Gene Expression Microarray data. Figure 3 is the representation of the process involved in the filter approach in gene selection.

www.frontiersin.org

Figure 3. Flow diagram – Filter Feature Selection Approach.

(b) Wrapper Feature Selection Approach:

Unlike the filter approaches, the wrapper approaches wrap the feature subset selection process around the black box’s induction algorithm. Once the search procedure for a feature subspace is defined, various feature subsets will be generated, and the classification algorithm is used to evaluate the selected feature subsets ( Blanco et al., 2004 ). With this approach, it is possible to select features tailored for the induction algorithm ( Jadhav et al., 2018 ). The classification algorithm’s evaluation measures will be optimized while eliminating the features, hence offering better accuracy than the filter approach ( Inza et al., 2004 ; Mohamed et al., 2016 ).

The significant advantage of using a wrapper approach, as both feature subset generation and the induction algorithm are wrapped together; the model will have the ability to track the feature dependencies ( Rodrigues et al., 2014 ). The common drawback is that it becomes computationally intensive for datasets with high dimensions ( Mohamed et al., 2016 ). Examples of Wrapper techniques are Hill Climbing, Forward Selection, and Backward Elimination. Figure 4 is the representation of the process involved in the wrapper approach.

www.frontiersin.org

Figure 4. Flow diagram – Wrapper Feature Selection Approach.

(c) Embedded Feature Selection Approach:

In a way, embedded approaches resemble the wrapper approaches, as both depend on the learning algorithm ( Hernandez et al., 2007 ). However, the embedded methods are less computationally intensive than the wrapper methods. The link between the learning algorithm and the feature selection is more robust in embedded methods than the wrapper methods ( Huerta et al., 2010 ). In the embedded methods, the feature selection is made as a part of the classification algorithm; in other terms, the algorithm will have its built-in approaches to select the essential features ( Hira and Gillies, 2015 ).

In the literature, it is mentioned that embedded methods combine the benefits of filter and wrapper methods to improve accuracy. The significant difference between other gene selection approaches and embedded approaches is how the genes are selected and the interaction with the learning algorithm ( Chandrashekar and Sahin, 2014 ; Vanjimalar et al., 2018 ). Some examples of embedded approaches are ID3, RF, CART, LASSO, L1 Regression, and C4.5. Figure 5 is the representation of the process involved in the embedded approach.

www.frontiersin.org

Figure 5. Flow diagram – Embedded Feature Selection Approach.

(d) Hybrid Feature Selection Approach:

Hybrid methods, as the name suggests, is a combination of two different techniques. Here, it can be two different feature selection approaches or different methods with similar criterion or two different strategies. In most cases, the filter and wrapper approaches are combined to form a hybrid approach ( Apolloni et al., 2016 ; Liu et al., 2019 ). It strives to utilize the benefits of two methods by combining their compatible strengths. Hybrid methods offer better accuracy and computational complexity than the filter and wrapper methods. Also, it is less susceptible to overfitting ( Almugren and Alshamlan, 2019 ). Figure 6 is the representation of the process involved in the hybrid approach.

www.frontiersin.org

Figure 6. Flow diagram – Hybrid Feature Selection Approach.

Stopping Criteria

The stopping criteria are a kind of threshold used to inform the classifier when to stop selecting the features ( Wang et al., 2005 ). Appropriate stopping criteria will refrain a model from overfitting, thus offer better results, which are computationally cost-effective ( Ang et al., 2015b ). Some of the commonly used stopping criteria are as follows:

(1) When the search reaches a specific bound, the bound can be several iterations or many features.

(2) The results do not improve with a deletion (or addition) of another feature.

(3) An optimal subset is found. A subset is said to optimal when the classifier’s error rate is less than the preferred threshold.

Evaluating the Results

There are many performance evaluation metrics available in the literature to evaluate and validate the classifier results. In the classification case, i.e., predicting using the categorical attribute, the commonly used error estimation methods are Confusion Matrix, Cross-Validation, and Receiver Optimizer Characteristics (ROC). In the case of regression, i.e., predicting using the continuous attribute, the commonly used error estimation methods are Mean Absolute Error (MAE), Mean Squared Error (MSE), and Coefficient of Determination (R2).

(a) Confusion Matrix : In the case of Multi-class problems, a confusion matrix is the best option to evaluate the classification model ( Handelman et al., 2019 ). For instance, there are four possible results in a binary classification problem with which the model can be evaluated, True Positive, classified correctly, False Positive, erroneous classification, False Negative, erroneously rejected, and True Negative rejected correctly ( Braga-Neto et al., 2004 ). Confusion Matrix offers measures such as Accuracy, Precision, Sensitivity, Specificity, and FMeasure to validate the results of a classifier.

(b) Cross-Validation (CV): It is the process of partitioning the available data into k-sets. Here, k can be any integer depending on the number of folds one needs for the classification or regression task (for instance, k = 10, k = 20, etc.) ( Schaffer, 1993 ; Braga-Neto et al., 2004 ). CV is most commonly used on the Regression and Classification approaches ( Chandrashekar and Sahin, 2014 ). The main advantage of using CV is that it offers unbiased error estimation, although sometimes it is variable ( Bergmeir and Benítez, 2012 ).

(c) Receiver Optimization Characteristics (ROC): ROC graphs and curves are commonly used for visualizing the performance of the classifiers and select the one showing better performance ( Landgrebe and Duin, 2008 ). As the researches these days are increasingly concentrated on the classification errors and unbalanced class distribution, ROC has gained a lot of attention ( Flach, 2016 ). It is the depiction of the trade-offs between the Sensitivity or benefits (TPR) and the Specificity or costs (FPR) ( Fawcett, 2006 ).

(d) Root Mean Square Error (RMSE): RMSE is a metric commonly used to measure the residuals’ standard deviation or prediction scores. In other words, the deviation in predictions from the regression line. It is given by Elavarasan et al. (2018) ,

Where, x i – Actual or Observed Values.

x i ¯ – Predicted Values.

n – Total number of sample.

(e) Mean Absolute Error: It is the standard measure of the residuals’ average magnitude (prediction errors), neglecting their directions. It is given by Elavarasan et al. (2018) .

Where, x i – Actual or Observed Values,

(f) Determination Coefficient (R 2 ): It is the measure to estimate how much one variable impacts other variables. It is the change in the percentage of one variable concerning the other. It is given by Elavarasan et al. (2018) .

Where, x – first set of values data,

y – the second set of values in the data.

R – Coefficient of determination.

Machine Learning Based Gene Selection Approaches

Supervised gene selection.

Supervised Gene Selection involves the data with labeled attributes. Most of the studies done in recent years have concentrated mainly on enhancing and improving the existing supervised gene selection methods.

For instance, Devi Arockia Vanitha et al. (2016) enhanced the Mutual Information (MI) filter method for selecting the informative gene. Also, Joe’s Normalized Mutual Information, an improved version of the standard existing MI approach, was implemented by Maldonado and López (2018) . Filter approaches are independent of the classifiers used. Hence, many works are focused on developing filter technologies. For instance, a novel filter approach is mainly based on the Hilbert-Schmidt Independence Criterion (SHS) and motivate by Singular Value Decomposition (SVD). Table 2 shows some of the filter-based gene selection techniques used in the literature to select informative genes.

www.frontiersin.org

Table 2. Filter-based Supervised Gene Selection.

The wrapper approach is computationally intensive than other feature selection approaches. Works on the wrapper feature selection approach are less because of the issue mentioned above. So, most of the research on the wrapper is focused on improving the computational cost. For instance, Wang A. et al. (2017) , Wang H. et al. (2017) implemented a wrapper-based gene selection with Markov Blanket, which reduces the computation time. Many approaches try to enhance the most widely used Support Vector Machine – Recursive Feature Elimination (SVM-RFE), such as Shukla et al. (2018) , implemented Support Vector Machine – Bayesian t-test – Recursive Feature Elimination (SVM-BT-RFE), where Bayesian t-test is combined with SVM-RFE to improve the results. Table 3 shows the works done in recent years on Wrapper-based Supervised Gene Selection.

www.frontiersin.org

Table 3. Wrapper-based Supervised Gene Selection.

Hybrid Feature Selection is usually the combination of other approaches, mostly filter and wrapper approaches are made into hybrids. For instance, Liao et al. (2014) , implemented a filter-wrapper based hybrid approach utilizing the Laplacian score and Sequential Forward and Backward Selection. Also, various works are going on in combining the nature-inspired algorithm. For example, Alshamlan et al. (2015) , implemented a Genetic Bee Colony, combining the Genetic Algorithm and Artificial Bee Colony for gene selection. A hybrid of the Salp Swarm Algorithm (SSA) and multi-objective spotted hyena optimizer are implemented in Sharma and Rani (2019) . The SSA focuses on diversity, and MOSHO concentrates on convergence. Table 4 consists of the recent works done on Hybrid-based Supervised Gene Selection approaches.

www.frontiersin.org

Table 4. Hybrid Supervised Gene Selection.

Ensemble Feature Selection is a combination of the outputs from different expert feature selection approaches. Ghosh et al. (2019a; 2019b ), combines the outputs of ReliefF, Chi-square, and Symmetrical Uncertainty (SU) with Union and Intersection of top “n” features. Seijo-Pardo et al. (2016) , used a ranking aggregation method to various aggregate ranks from Chi-square, InfoGain, mRmR, and ReliefF. Table 5 shows the different Ensemble-based Supervised Gene Selection approaches used in recent years.

www.frontiersin.org

Table 5. Ensemble-based Supervised Gene Selection.

Embedded methods merge the benefits of filter and wrapper methods, where the learning algorithm has a built-in feature selection approach. Ghosh et al. (2019b) , implemented a Recursive Memetic Algorithm (RMA) with a wrapper-based approach embedded in it. Also, Guo et al. (2017) , used L1 Regularization, along with a feature extraction method for selecting the informative genes. Table 6 shows the various Embedded-based Supervised Gene Selection approaches developed in recent years.

www.frontiersin.org

Table 6. Embedded-based Supervised Gene Selection.

Unsupervised Gene Selection

Unsupervised Gene Selection involves data without any labels. Compared to Supervised Gene Selection, works on Unsupervised are less.

There are many novel works done on filter-based unsupervised gene selection, such as Solorio-Fernández et al. (2017) , proposed a filter method for both non-numerical and numerical data. It is a combination of kernel approach and spectrum-based feature evaluation. Also, Liu et al. (2018) , developed a Deep Sparse Filtering model considering the deep structures, enhancing the results. Many studies on nature-inspired gene selection and the ( Guo et al., 2017 ) implemented the MGSACO to minimize redundancy, thereby increasing the dataset ’ s relevancy. One another issue with high-dimensional data is dependency maximization. The work in Boucheham et al. (2015) implemented the Hilbert-Schmidt Independence Criterion to eliminate the most dependent genes to handle dependency maximization. Table 7 is the collection of works done in recent years on Filter-based Unsupervised Gene Selection approaches.

www.frontiersin.org

Table 7. Filter-based Unsupervised Gene Selection.

Filter-based gene selection approaches are not dependent on the learning model; on the contrary, wrapper methods are entirely dependent on the learning model. The dependency makes it complicated and has a high computational cost. Hence, the study on wrapper methods is less concentrated. Same with the unsupervised wrapper gene selection, which is less focused. Xu et al. (2017) , has implemented SVM-RFE, a wrapper-based gene selection, on unlabeled data to distinguish high-risk and low-risk cancer patients. Table 8 is an example of a wrapper-based Unsupervised Gene Selection approach.

www.frontiersin.org

Table 8. Wrapper-based Unsupervised Gene Selection.

Hybrid Unsupervised gene selection is also focused on in the literature as much as the filter approach. Li and Wang (2016) , developed a two-stage gene selection approach; it applies the matrix factorization and minimum loss principle. A coarse-fine hybrid gene selection on unlabelled data shows better results than a few other approaches compared to the study. Filter-wrapper hybrid approaches are equally focused on supervised as well as unsupervised gene selection. For instance, Solorio-Fernández et al. (2017) , implemented a Laplacian Score Ranking, a filter approach, and Normalised Calinski-Harabasz (LS-WNCH), a wrapper approach as hybrid unsupervised gene selection. It includes the properties of spectral feature selection. Table 9 shows the hybrid-based Unsupervised Gene Selection approaches.

www.frontiersin.org

Table 9. Hybrid Unsupervised Gene Selection.

Ensemble and embedded approaches are studied less than the filter and hybrid methods. Elghazel and Aussem (2013), implemented a Random Cluster Ensemble with k-means as the clustering model. The ECE was constructed with different bootstrap samples at every ensemble partitions. They have also calculated out-of-bag feature importance at every ensemble. Li et al. (2017) , developed a Reconstruction-based unsupervised feature selection model, an embedded approach. The model has a filter-based approach embedded in the k-means clustering. Table 10 is the example for Ensemble-based, and Embedded-based Unsupervised Gene Selection approaches.

www.frontiersin.org

Table 10. Ensemble and embedded Unsupervised Gene Selection.

Semi-Supervised Gene Selection

Semi-supervised gene selection is yet to be explored research area. There are not many works done as much as supervised or unsupervised gene selection. Semi-Supervised or Semi-Unsupervised consists of both labeled and unlabelled data.

Li et al. (2018) , combined the benefits of the spectral graph and Mutual Information to develop a Semi-Supervised Maximum Discriminative Local Margin (SemiMM). It takes care of variance, local structure, and MI all at the same time. SVM is used widely in supervised and unsupervised gene selection approaches; in semi-supervised, Ang et al. (2015b) , implemented a semi-supervised SVM-RFE (S3VM) for selecting the informative genes, and it proves to be successful. Chakraborty and Maulik (2014) , developed a hybrid model; Kernalised Fuzzy Rough Set (KFRS) and S3VM are combined to select the relevant features. The results show that the proposed algorithm is capable of choosing useful biomarkers from the dataset. A semi-supervised embedded approach, Joint Semi-Supervised Feature Selection (JSFS), was developed with a Bayesian approach. The model automatically chooses the informative features and also trains the classifier.

Rajeswari and Gunasekaran (2015) , developed an ensemble-based semi-supervised gene selection to improve the quality of the cluster model. Modified Double Selection based Semi-Supervised Cluster Ensemble (MDSVM-SSCE) assists in selecting the most relevant genes. Table 11 shows the Semi-Supervised Gene Selection approaches developed in recent years.

www.frontiersin.org

Table 11. Semi-Supervised Gene Selection approaches.

Performance Analysis and Discussion on the Reviewed Literature

In the literature, the top three datasets used widely are Prostate, Leukemia, and Colon. Tables 12 – 14 shows the respective proposed models’ performance on the datasets mentioned above, along with the number of genes selected.

www.frontiersin.org

Table 12. Performance analysis of prostate dataset.

www.frontiersin.org

Table 13. Performance analysis on Leukemia dataset.

www.frontiersin.org

Table 14. Performance analysis of colon dataset.

All three gene selection methods discussed in this paper has its own merits and demerits. From the literature, it is clear that the Supervised Gene Selection is researched the most in recent years, and the Semi-supervised the least. Even though the Semi-Supervised potential is not tapped upon yet, it seems to be the better one among the three. It takes the advantages of Supervised and Unsupervised Gene Selection approaches. It has both labeled and unlabelled data; thus, it combines both the approaches’ benefits, eventually achieving better results. It considers the overlapping genes and handles it with the Unsupervised Gene Selection approach (unlabelled data) and learn and train the learning model with great accuracy and precision with the help of Supervised Gene Selection approaches (labeled data). Figures 7 – 10 show that the Supervised Gene Selection performs way better than the other two. Still, it might be because there are considerably significantly fewer works in Unsupervised and Semi-Supervised Gene Selection. The abbreviations for the acronyms used in the plot can be found in Table 15 . There are several opportunities still untapped in these two areas. We can also notice that many works are concentrated more on Filter approaches as they are simple and computationally effective. However, hybrid approaches are upcoming and promising.

www.frontiersin.org

Figure 7. Performance analysis of Supervised Gene Selection Models – Part A.

www.frontiersin.org

Figure 8. Performance analysis of Supervised Gene Selection Models – Part B.

www.frontiersin.org

Figure 9. Performance analysis of Unsupervised Gene Selection Models.

www.frontiersin.org

Figure 10. Performance analysis of Semi-Supervised Gene Selection Models.

www.frontiersin.org

Table 15. Acronyms.

As for the evaluation criteria, in recent years, filter-based approaches are more focused much. Filter methods function independently of the learning model; thus, it is less computationally intensive. As it is less complicated, many researchers target the filter-based approaches in selecting informative genes. Wrapper-based approaches are the least concentrated upon; it is dependent and designed to support the learning model. Wrapper approaches are usually time-consuming and generate high computational overhead. Though other methods are concentrated equally, the hybrid approach proves to be better among the others. Hybrid is a combination of two or more approaches. The most commonly used hybrid method is the Filter-Wrapper combination. In the Hybrid approach, the limitations of the individual approaches are compensated; in other words, it inherits the benefits of two methods. Further, this will minimize computational cost. Hybrid approaches seem to provide better accuracy and reduce over-fitting risks. Apparently, hybrid methods are most suited for high-dimensional datasets such as the gene expression microarray from the literature.

Apart from the discussed literature, many other works focused on nature-inspired and meta-heuristic algorithms in diagnosing cancer. A bio-inspired algorithm is proposed by Dashtban et al. (2018) using the BAT algorithm with more refined and effective multi-objectives. Also, they have proposed a novel local search strategy. Another such BAT inspired algorithm with two-staged gene selection is proposed in Alomari et al. (2017) , wherein the first stage is a filter (Minimum Redundancy and Maximum Relevance) and the second stage is the wrapper consisting of BAT and SVM. Other than that, considerable works are done in Particle Swarm Optimization (PSO) by improving and enhancing the existing algorithm. In Jain et al. (2018) , the authors implemented a two-phased hybrid gene selection method, combining the improved PSO (iPSO) and Correlation-based Feature Selection (CFS). The proposed method controls the early convergence problem. A recursive PSO is implemented in Prasad et al. (2018) ; it tries to refine the feature space into more fine-grained. They have also combined existing filter-based feature selection methods with the recursive PSO. KNN and PSO are implemented in Kar et al. (2015) to handle the uncertainty involved in choosing the k-value in KNN. In Han et al. (2015) , the authors proposed a Binary PSO (BPSO) to improve the interpretability of the gene selected and improve the prediction accuracy of the model. In Shreem et al. (2014) , a nature-inspired algorithm Harmony Search Algorithm (HAS) is embedded with Markov Blanket, which focuses on symmetrical uncertainty Sharbaf et al. (2016) implemented an Ant Colony Optimization based gene selection (ACO) along with Cellular Learning Automata (CLA) as a wrapper method. In another approach ( Lai et al., 2016 ), a hybrid combining filter and wrapper approaches is implemented using Information Gain (IG) and improved Swarm Optimization to find the optimal gene subset. Information Gain (IG) is also implemented along with SVM in Gao et al. (2017) to remove the redundant genes. There are works done in gene selection using the Genetic algorithms with different variations from the existing one. One such work combines the Genetic algorithm and Fuzzy in Nguyen et al. (2015) , integrating the two approaches to finding out the optimal gene subset. Genetic Algorithm is also combined with learning automata (GALA) in Motieghader et al. (2017) , which improves the time complexity in selecting the gene subset. Statistically, significant models are also implemented, such as the entropy-based measure and rough sets ( Chen et al., 2017 ) and ( Xiao et al., 2014 ; Sun et al., 2019c ), testing the statistical significance with p-value and fold change. Decision tree and random forest variances are also worked on, such as the four-state-of art Random forest ( Kursa, 2014 ), decision tree along with PSO ( Chen et al., 2014 ), and a guided regularized Random Forest ( Deng and Runger, 2013 ). Various works are focus on improving the interpretability of the features and reducing the feature space with improvements in the existing models ( Zibakhsh and Abadeh, 2013 ; Cai et al., 2014 ; García and Sánchez, 2015 ; Chen et al., 2016 ; Tang et al., 2018 ; Cleofas-Sánchez et al., 2019 ).

Machine Learning techniques are widely used in modern-day research in the field of bioinformatics. The Machine Learning algorithms are available under different criteria, such as the logic-based algorithms (E.g., Decision Trees, Random Forest), perceptron-based algorithms (Neural Network, Multi-layered Perceptron), and Statistical Learning (Naïve Bayes) ( Kotsiantis et al., 2007 ). The classification or prediction models used commonly in the literature discussed in this paper mostly include SVM, KNN, Random Forest, Decision Tree, Naïve Bayes, and Logistic Regression. SVM consists of support vectors that assist in classifying a disease or disorder. The classification depends on the formation of a hyperplane that divides binary classes. The SVM locates the hyperplane with the help of the kernel function. A most important advantage of using SVM is to tackle the outliers ( Brown et al., 2000 ). KNN works on the assumption that the instances within a dataset will be close to one another. Although KNN is easy to understand and implement the algorithm, it lacks the fundamental principle in choosing the value of k. Also, it is sensitive to the distance or similarity function used. Decision Tree is made up of nodes and branches, used mainly because of their effectiveness and speed in calculations. Decision Trees are highly prone to overfitting and underfitting of the data ( Czajkowski and Kretowski, 2019 ). Random Forests are the ensemble of Decision Tree. Naïve Bayes is the statistical classification model. Based on the Bayes Theorem, it works on the assumption that all the features in the dataset are independent and equal.

In general, for continuous and multi-dimensional features, neural networks and SVM show better performance. Whereas, in the case of the categorical or discrete features, the logic-based algorithms, such as the rule learners and decision trees, perform better. SVM and others will need a large sample size to produce high accuracy, but Naïve Bayes works on a small dataset. The training time varies for each algorithm; for example, Naïve Bayes trains quickly because of their single pass of the entries. Also, it does not need much storage space during training and testing. On the contrary, during training, KNN based models require huge storage space and more than that during the testing phase.

In terms of interpretability, the logic-based models are interpreted easily, whereas SVM and neural networks are difficult to interpret. They also have the highest number of parameters, which need optimization and tuning. One algorithm cannot outperform the other. One way to determine the type of algorithm to use is to validate the models and estimate their accuracy and choose the one with better accuracy. Recently, combining the algorithms are proposed to enhance individual algorithm performances. However, the gene expression data has the issue of High Dimension and Low Sample Size (HDLSS), for which machine learning models are less suited. Hence, the Deep Learning and Deep Belief Networks are being researched in recent days and a multi-omics dataset.

In the performance evaluation metrics, the commonly used ones are the Classification Accuracy, Least One Out Cross Validation (LOOCV), k-Fold Cross-Validation, and ROC. Among these, several works use the Classification Accuracy. However, many performance metrics need concentration, such as sensitivity, sensibility, and similarity measures.

Open Issues in Gene Expression Data

The gene expression is a biological process; DNA instructions are transformed into a functional product called the proteins. The cells in a living organism do not need proteins all the time. Certain complex molecular mechanisms must turn the genes on and off. If that does not happen, diseases and disorders will follow.

Deoxy-ribonucleic Acid Microarray is a technology used widely in biomedical research to analyze gene expression to discover the disease or disorder, classify, and predict. The DNA microarray data is also used to predict the responses of a drug or therapies given. There are different types of DNA microarray, such as cDNA (complementary Deoxy-Ribose Nucleic Acid), SNP (Single Nucleotide Polymorphism), and CNV (Copy Number Validation) microarrays ( Arevalillo and Navarro, 2013 ). cDNA is a DNA without introns and formed from a single-stranded RNA. SNP is the variations that can be found only at a single point in a DNA sequence. CNV is a condition where parts of a genome will be repeated, and the repetition will vary from one individual to another. There are many advanced technologies available to analyze gene expression. Most widely used are cDNA bi-color glass slide and Affymetrix GeneChip.

Many challenges and limitations need to be addressed to extract the required knowledge from the gene expression with great precision. The significant difficulties are as follows ( Chan et al., 2016 ; Li and Wang, 2016 ; Li et al., 2018 ):

(a) Curse of Dimensionality: The major issue that is researched upon in machine learning is the overfitting of a learning model. The work in García et al. (2017) discusses the curse of dimensionality in detail. Microarray is generally high-dimensional data, ranging from hundreds to thousands and more features. Microarray data prove to be hectic in managing. To handle such huge volumes of data, advanced storage systems are required ( Mramor et al., 2005 ; Abdulla and Khasawneh, 2020 ).

(b) The gap between the Researchers and Biologists: There is a huge gap among the researchers, biologists and medical practitioners, which led to many unexplored areas in the genomic studies. The opportunity of finding the best techniques and approaches are very less because of the aforementioned gap.

(c) Redundant and Mislabelled Data: Data imbalance and mislabelled data is the most prevailing issue in the Microarray data because of the irregular scanning. The Microarray dataset usually has class imbalance issue, i.e., one class will dominate the entire dataset. When the learning model is trained on a mislabelled and imbalanced data, it will greatly affect the generalization ability of the learning model. Same as the abovementioned issues, redundant and irrelevant data are also the main concern in determining the efficiency of the feature set ( Lakshmanan and Jenitha, 2020 ; Rouhi and Nezamabadi-Pour, 2020 ).

(d) Difficulty in Retrieving the Biological Information: There are many clinical challenges in retrieving the biological information. The main aim of genomic studies is to discover the significant changes in the gene expression, clinically or biologically. The difficulty is that not everyone will possess high-ended equipment to capture significant changes. Also, in some of the biological processes, the changes in the expression are very subtle and difficult to be identified with analytical methods. Due to the different range of approaches regarding the experimental design, data access, study and batch of reagents used, the data may be erroneous and biased.

Some of the future directions with which the research in this area can be proceeded are as follows:

(a) Enhanced Models for Better Diagnosis of Rare Genetic Disorders:

There are various genetic disorders classified under Monogenic and Polygenic disorders. Monogenic disorders are caused because of modifications in a single gene and inherited genetically. It is rare. Unlike Monogenic, Polygenic are commonly occurring and caused because of modifications in several genes. The genetic illnesses of such types are overwhelming in the recent years. Machine Learning classification and prediction models will diagnose the disorders with great accuracy.

(b) Cancer Prognosis and Prediction:

Cancer is a heterogeneous disease, which is considered to have various subtypes. It is critical to diagnose early to further assist the patients clinically. The importance of grouping high and low risk patients had led to various researches in bioinformatics and machine learning applications. The ability of machine learning models such as Support Vector Machine (SVM), Artificial Neural Networks (ANN) and Bayesian Networks (BN) in the development of classification and predictive models for accurate decisions have to be explored.

(c) Collaborative Platforms in Gene Expressions:

The individual models in Machine Learning will yield better results when applied on gene expression data. However, hybrid methods prove to be successful at many instances. Along with hybrid methods, more research should be done in combining different gene expression data and clinical reports. It is difficult and exhaustive, yet it will offer greater results.

(d) Analyzing Drug Response in Gene Expression Data:

Predicting a drug response to any genetic disorder or disease is an important step. Many recent efforts in analyzing the sensitivity and response to cancer or other diseases are commendable. Still, the main problem in developing a model for drug response is the high dimension and less sample size. The feature selection techniques in Machine Learning assist in reducing the dimensions and improve the accuracy in predicting the drug response.

Gene expression Microarray is a high-dimensional database with less sample size. It needs powerful techniques to handle it and preserve the informative genes by minimizing the redundancy and dependency. This paper discusses the works done in the recent years in the gene expression microarray dataset. The papers are selected from the past six years, the focus is mainly on the supervised, unsupervised and semi-supervised based feature selection in the gene expression data. Further, under those three learning methods, we have chosen papers that concentrate on filter, wrapper, hybrid, embedded and ensemble based gene selection. This study lists out the significant difficulties faced in handling such huge dimensional datasets. To overcome the dimension issues, the gene selection must be made carefully. Although there are a lot of works done in the literature on the gene expression microarray data, there are many open opportunities that need attention. The researches have mainly focused on supervised gene selection with a filter as evaluation methods. The potentials of unsupervised and semi-supervised techniques are yet to be tapped. The semi-supervised technique works with the benefits of supervised and unsupervised techniques combined. Hence, the chances of improved accuracy is high in semi-supervised. The only aim of almost all the works is to achieve higher accuracy the focus on sensitivity, specificity, stability and similarity is scarce. As equally important as the dimensionality issue is the misclassification or mislabelled data. There is a promising future for overcoming these two issues. Another important direction for improvement in gene selection is to develop more ensemble and hybrid evaluation methods. As discussed in the literature, works on hybrid and ensemble are considerably less when compared to filter and wrapper approaches. Hybrid and ensemble methods are capable of providing more accurate results. Apparently, it needs further developments. Research must be done in joint analysis, to combine the clinical reports and the gene expression data. It will help in analyzing various aspects and will offer a different perspective. It would serve as a major breakthrough, yet hectic and exhaustive.

Author Contributions

PDRV and C-YC did the conceptualization and supervised the data. C-YC carried out the funding acquisition. NM, PDRV, KS, and C-YC investigated the data and performed the methodology. C-YC and KS carried out the project administration and validated the data. NM, PDRV, and KS wrote, reviewed, and edited the manuscript. All authors contributed to the article and approved the submitted version.

This work was financially supported by the “Intelligent Recognition Industry Service Research Center” from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abdulla, M., and Khasawneh, M. T. (2020). G-Forest: an ensemble method for cost-sensitive feature selection in gene expression microarrays. Artif. Intell. Med. 108:101941. doi: 10.1016/j.artmed.2020.101941

PubMed Abstract | CrossRef Full Text | Google Scholar

Abinash, M. J., and Vasudevan, V. (2018). “A Study on Wrapper-Based Feature Selection Algorithm for Leukemia Dataset,” in Proceedings of the Intelligent Engineering Informatics , (New York, NY: Springer), 311–321. doi: 10.1007/978-981-10-7566-7_31

CrossRef Full Text | Google Scholar

Acharya, S., Saha, S., and Nikhil, N. (2017). Unsupervised gene selection using biological knowledge: application in sample clustering. BMC Bioinform. 18:513. doi: 10.1186/s12859-017-1933-0

Algamal, Z. Y., and Lee, M. H. (2015). Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Exp. Syst. Appl. 42, 9326–9332. doi: 10.1016/j.eswa.2015.08.016

Almugren, N., and Alshamlan, H. (2019). A survey on hybrid feature selection methods in microarray gene expression data for cancer classification. IEEE Access 7, 78533–78548. doi: 10.1109/access.2019.2922987

Alomari, O. A., Khader, A. T., Al-Betar, M. A., and Abualigah, L. M. (2017). Gene selection for cancer classification by combining minimum redundancy maximum relevancy and bat-inspired algorithm. Int. J. Data Min. Bioinform. 9, 32–51. doi: 10.1504/ijdmb.2017.10009480

Alshamlan, H. M., Badr, G. H., and Alohali, Y. A. (2015). Genetic Bee Colony (GBC) algorithm: A new gene selection method for microarray cancer classification. Comput. Biol. Chem. 56, 49–60. doi: 10.1016/j.compbiolchem.2015.03.001

Ang, J. C., Haron, H., and Hamed, H. N. A. (2015a). “Semi-supervised SVM-based feature selection for cancer classification using microarray gene expression data,” in Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems , (Cham: Springer), 468–477. doi: 10.1007/978-3-319-19066-2_45

Ang, J. C., Mirzal, A., Haron, H., and Hamed, H. N. A. (2015b). Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE Trans. Comp. Biol. Bioinform. 13, 971–989. doi: 10.1109/tcbb.2015.2478454

Anter, A. M., and Ali, M. (2020). Feature selection strategy based on hybrid crow search optimization algorithm integrated with chaos theory and fuzzy c-means algorithm for medical diagnosis problems. Soft Comp. 24, 1565–1584. doi: 10.1007/s00500-019-03988-3

Apolloni, J., Leguizamón, G., and Alba, E. (2016). Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments. Appl. Soft Comp. 38, 922–932. doi: 10.1016/j.asoc.2015.10.037

Arevalillo, J. M., and Navarro, H. (2013). Exploring correlations in gene expression microarray data for maximum predictive–minimum redundancy biomarker selection and classification. Comput. Biol. Med. 43, 1437–1443. doi: 10.1016/j.compbiomed.2013.07.005

Bergmeir, C., and Benítez, J. M. (2012). On the use of cross-validation for time series predictor evaluation. Inform. Sci. 191, 192–213. doi: 10.1016/j.ins.2011.12.028

Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., et al. (2015). Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci. Rep. 5:10312.

Google Scholar

Blanco, R., Larrañaga, P., Inza, I., and Sierra, B. (2004). Gene selection for cancer classification using wrapper approaches. Int. J. Patt. Recogn. Artif. Intell. 18, 1373–1390. doi: 10.1142/s0218001404003800

Boucheham, A., Batouche, M., and Meshoul, S. (2015). “An ensemble of cooperative parallel metaheuristics for gene selection in cancer classification,” in Proceedings of the International Conference on Bioinformatics and Biomedical Engineering , (Cham: Springer), 301–312. doi: 10.1007/978-3-319-16480-9_30

Braga-Neto, U., Hashimoto, R., Dougherty, E. R., Nguyen, D. V., and Carroll, R. J. (2004). Is cross-validation better than resubstitution for ranking genes? Bioinformatics 20, 253–258. doi: 10.1093/bioinformatics/btg399

Brahim, A. B., and Limam, M. (2018). Ensemble feature selection for high dimensional data: a new method and a comparative study. Adv. Data Anal. Class. 12, 937–952. doi: 10.1007/s11634-017-0285-y

Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., et al. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. 97, 262–267. doi: 10.1073/pnas.97.1.262

Brumpton, B. M., and Ferreira, M. A. (2016). Multivariate eQTL mapping uncovers functional variation on the X-chromosome associated with complex disease traits. Hum. Genet. 135, 827–839. doi: 10.1007/s00439-016-1674-6

Ca, D. A. V., and Mc, V. (2015). Gene expression data classification using support vector machine and mutual information-based gene selection. Proc. Comp. Sci. 47, 13–21. doi: 10.1016/j.procs.2015.03.178

Cai, H., Ruan, P., Ng, M., and Akutsu, T. (2014). Feature weight estimation for gene selection: a local hyperlinear learning approach. BMC Bioinformatics 15:70. doi: 10.1186/1471-2105-15-70

Cárdenas-Ovando, R. A., Fernández-Figueroa, E. A., Rueda-Zárate, H. A., Noguez, J., and Rangel-Escareño, C. (2019). A feature selection strategy for gene expression time series experiments with hidden Markov models. PLoS One 14:e0223183. doi: 10.1371/journal.pone.0223183

Chakraborty, D., and Maulik, U. (2014). Identifying cancer biomarkers from microarray data using feature selection and semisupervised learning. IEEE J. Transl. Eng. Health Med. 2, 1–11. doi: 10.1109/jtehm.2014.2375820

Chan, W. H., Mohamad, M. S., Deris, S., Zaki, N., Kasim, S., Omatu, S., et al. (2016). Identification of informative genes and pathways using an improved penalized support vector machine with a weighting scheme. Comput. Biol. Med. 77, 102–115. doi: 10.1016/j.compbiomed.2016.08.004

Chandrashekar, G., and Sahin, F. (2014). A survey on feature selection methods. Comp. Electr. Eng. 40, 16–28. doi: 10.1016/j.compeleceng.2013.11.024

Chen, H., Zhang, Y., and Gutman, I. (2016). A kernel-based clustering method for gene selection with gene expression data. J. Biomed. Inform. 62, 12–20. doi: 10.1016/j.jbi.2016.05.007

Chen, K. H., Wang, K. J., Tsai, M. L., Wang, K. M., Adrian, A. M., Cheng, W. C., et al. (2014). Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinform. 15:49. doi: 10.1186/1471-2105-15-8

Chen, Y., and Yao, S. (2017). Sequential search with refinement: model and application with click-stream data. Manag. Sci. 63, 4345–4365. doi: 10.1287/mnsc.2016.2557

Chen, Y., Zhang, Z., Zheng, J., Ma, Y., and Xue, Y. (2017). Gene selection for tumor classification using neighborhood rough sets and entropy measures. J. Biomed. Inform. 67, 59–68. doi: 10.1016/j.jbi.2017.02.007

Chinnaswamy, A., and Srinivasan, R. (2016). “Hybrid feature selection using correlation coefficient and particle swarm optimization on microarray gene expression data,” in Proceedings of the Innovations in bio-inspired computing and applications , (Cham: Springer), 229–239. doi: 10.1007/978-3-319-28031-8_20

Cleofas-Sánchez, L., Sánchez, J. S., and García, V. (2019). Gene selection and disease prediction from gene expression data using a two-stage hetero-associative memory. Prog. Artif. Intell. 8, 63–71. doi: 10.1007/s13748-018-0148-6

Czajkowski, M., and Kretowski, M. (2019). Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach. Exp. Syst. Appl. 137, 392–404. doi: 10.1016/j.eswa.2019.07.019

Dashtban, M., and Balafar, M. (2017). Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts. Genomics 109, 91–107. doi: 10.1016/j.ygeno.2017.01.004

Dashtban, M., Balafar, M., and Suravajhala, P. (2018). Gene selection for tumor classification using a novel bio-inspired multi-objective approach. Genomics 110, 10–17. doi: 10.1016/j.ygeno.2017.07.010

Deng, H., and Runger, G. (2013). Gene selection with guided regularized random forest. Pattern Recogn. 46, 3483–3489. doi: 10.1016/j.patcog.2013.05.018

Devi Arockia Vanitha, C., Devaraj, D., and Venkatesulu, M. (2016). Multiclass cancer diagnosis in microarray gene expression profile using mutual information and support vector machine. Intell. Data Anal. 20, 1425–1439. doi: 10.3233/IDA-150203

Djellali, H., Guessoum, S., Ghoualmi-Zine, N., and Layachi, S. (2017). “Fast correlation based filter combined with genetic algorithm and particle swarm on feature selection,” in Proceedings of the 2017 5th International Conference on Electrical Engineering-Boumerdes (ICEE-B) , (Piscataway, NJ: IEEE), 1–6.

Elavarasan, D., Vincent, D. R., Sharma, V., Zomaya, A. Y., and Srinivasan, K. (2018). Forecasting yield by integrating agrarian factors and machine learning models: a survey. Comp. Electr. Agricult. 155, 257–282. doi: 10.1016/j.compag.2018.10.024

Elghazel, H., and Aussem, A. (2015). Unsupervised feature selection with ensemble learning. Machine Learn. 98, 157–180. doi: 10.1007/s10994-013-5337-8

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874. doi: 10.1016/j.patrec.2005.10.010

Filippone, M., Masulli, F., and Rovetta, S. (2005). “Unsupervised gene selection and clustering using simulated annealing,” in International Workshop on Fuzzy Logic and Applications , (Berlin: Springer), 229–235. doi: 10.1007/11676935_28

Filippone, M., Masulli, F., and Rovetta, S. (2006). “Supervised classification and gene selection using simulated annealing,” in Proceedings of the 2006 IEEE International Joint Conference on Neural Network Proceedings , (Piscataway, NJ: IEEE), 3566–3571.

Flach, P. A. (2016). ROC analysis. In Encyclopedia of Machine Learning and Data Mining. New York, NY: Springer, 1–8.

Gangeh, M. J., Zarkoob, H., and Ghodsi, A. (2017). Fast and scalable feature selection for gene expression data using hilbert-schmidt independence criterion. IEEE Trans. Comp. Biol. Bioinform. 14, 167–181. doi: 10.1109/tcbb.2016.2631164

Gao, L., Ye, M., Lu, X., and Huang, D. (2017). Hybrid method based on information gain and support vector machine for gene selection in cancer classification. Genom. Prot. Bioinform. 15, 389–395. doi: 10.1016/j.gpb.2017.08.002

García, V., and Sánchez, J. S. (2015). Mapping microarray gene expression data into dissimilarity spaces for tumor classification. Inform. Sci. 294, 362–375. doi: 10.1016/j.ins.2014.09.064

García, V., Sánchez, J. S., Cleofas-Sánchez, L., Ochoa-Domínguez, H. J., and López-Orozco, F. (2017). “An insight on the ‘large G, small n’ problem in gene-expression microarray classification,” in Proceedings of the 8th Iberian Conference on Pattern Recognition and Image Analysis , Faro, 483–490. doi: 10.1007/978-3-319-58838-4_53

Ghosh, M., Adhikary, S., Ghosh, K. K., Sardar, A., Begum, S., and Sarkar, R. (2019a). Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med. Biol. Eng. Comp. 57, 159–176. doi: 10.1007/s11517-018-1874-4

Ghosh, M., Begum, S., Sarkar, R., Chakraborty, D., and Maulik, U. (2019b). Recursive memetic algorithm for gene selection in microarray data. Exp. Syst. Appl. 116, 172–185. doi: 10.1016/j.eswa.2018.06.057

Guo, S., Guo, D., Chen, L., and Jiang, Q. (2017). A L1-regularized feature selection method for local dimension reduction on microarray data. Comput. Biol. Chem. 67, 92–101. doi: 10.1016/j.compbiolchem.2016.12.010

Halperin, E., Kimmel, G., and Shamir, R. (2005). Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics 21(Suppl._1), i195–i203.

Han, F., Yang, C., Wu, Y. Q., Zhu, J. S., Ling, Q. H., Song, Y. Q., et al. (2015). A gene selection method for microarray data based on binary PSO encoding gene-to-class sensitivity information. IEEE Trans. Comp. Biol. Bioinform. 14, 85–96. doi: 10.1109/tcbb.2015.2465906

Hancer, E., Xue, B., and Zhang, M. (2018). Differential evolution for filter feature selection based on information theory and feature ranking. Knowl. Based Syst. 140, 103–119. doi: 10.1016/j.knosys.2017.10.028

Handelman, G. S., Kok, H. K., Chandra, R. V., Razavi, A. H., Huang, S., Brooks, M., et al. (2019). Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods. AJR 212, 38–43. doi: 10.2214/ajr.18.20224

Hasri, N. N. M., Wen, N. H., Howe, C. W., Mohamad, M. S., Deris, S., and Kasim, S. (2017). Improved support vector machine using multiple SVM-RFE for cancer classification. Int. J. Adv. Sci. Eng. Inform. Technol. 7, 1589–1594. doi: 10.18517/ijaseit.7.4-2.3394

Hernandez, J. C. H., Duval, B., and Hao, J. K. (2007). “A genetic embedded approach for gene selection and classification of microarray data,” in Proceedings of the European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics , (Berlin: Springer), 90–101. doi: 10.1007/978-3-540-71783-6_9

Hira, Z. M., and Gillies, D. F. (2015). A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform. 2015:198363.

Hoque, N., Bhattacharyya, D. K., and Kalita, J. K. (2014). MIFS-ND: A mutual information-based feature selection method. Exp.Syst. Appl. 41, 6371–6385. doi: 10.1016/j.eswa.2014.04.019

Hu, Q., Pan, W., An, S., Ma, P., and Wei, J. (2010). An efficient gene selection technique for cancer recognition based on neighborhood mutual information. Int. J. Machine Learn. Cybernet. 1, 63–74. doi: 10.1007/s13042-010-0008-6

Huerta, E. B., Hernández, J. C. H., Caporal, R. M., Cruz, J. F. R., and Montiel, L. A. H. (2010). An efficient embedded gene selection method for microarray gene expression data. Res. Comp. Sci. 50, 289–299.

Inza, I., Larrañaga, P., Blanco, R., and Cerrolaza, A. J. (2004). Filter versus wrapper gene selection approaches in DNA microarray domains. Artif. Intell. Med. 31, 91–103. doi: 10.1016/j.artmed.2004.01.007

Jadhav, S., He, H., and Jenkins, K. (2018). Information gain directed genetic algorithm wrapper feature selection for credit rating. Appl. Soft Comp. 69, 541–553. doi: 10.1016/j.asoc.2018.04.033

Jain, I., Jain, V. K., and Jain, R. (2018). Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification. Appl. Soft Comp. 62, 203–215. doi: 10.1016/j.asoc.2017.09.038

Jiang, B., Wu, X., Yu, K., and Chen, H. (2019). Joint semi-supervised feature selection and classification through Bayesian approach. Proc. AAAI Conf. Artif. Intell. 33, 3983–3990. doi: 10.1609/aaai.v33i01.33013983

Jović, A., Brkić, K., and Bogunović, N. (2015). “A review of feature selection methods with applications,” in Proceedings of the 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) , (Piscataway, NJ: IEEE), 1200–1205.

Kar, S., Sharma, K. D., and Maitra, M. (2015). Gene selection from microarray gene expression data for classification of cancer subgroups employing PSO and adaptive K-nearest neighborhood technique. Exp. Syst. Appl. 42, 612–627. doi: 10.1016/j.eswa.2014.08.014

Khalid, S., Khalil, T., and Nasreen, S. (2014). “A survey of feature selection and feature extraction techniques in machine learning,” in Proceedings of the Science and Information Conference , (Piscataway, NJ: IEEE), 372–378.

Kira, K., and Rendell, L. A. (1992). “A practical approach to feature selection,” in Proceedings of the Machine Learning , (Burlington, MA: Morgan Kaufmann), 249–256. doi: 10.1016/b978-1-55860-247-2.50037-1

Kotsiantis, S. B., Zaharakis, I., and Pintelas, P. (2007). Supervised machine learning: A review of classification techniques. Emerg. Artif. Intell. Appl. Comp. Eng. 160, 3–24.

Koul, N., and Manvi, S. S. (2020). “Machine-Learning Algorithms for Feature Selection from Gene Expression Data,” in Statistical Modelling and Machine Learning Principles for Bioinformatics Techniques, Tools, and Applications , eds K. G. Srinivasa, G. M. Siddesh, and S. R. Manisekhar (Singapore: Springer), 151–161. doi: 10.1007/978-981-15-2445-5_10

Kumar, C. A., Sooraj, M. P., and Ramakrishnan, S. (2017). A comparative performance evaluation of supervised feature selection algorithms on microarray datasets. Proc. Comp. Sci. 115, 209–217. doi: 10.1016/j.procs.2017.09.127

Kursa, M. B. (2014). Robustness of random forest-based gene selection methods. BMC Bioinform. 15:8.

Lai, C. M., Yeh, W. C., and Chang, C. Y. (2016). Gene selection using information gain and improved simplified swarm optimization. Neurocomputing 218, 331–338. doi: 10.1016/j.neucom.2016.08.089

Lakshmanan, B., and Jenitha, T. (2020). Optimized feature selection and classification in microarray gene expression cancer data. Ind. J. Public Health Res. Dev. 11, 347–352. doi: 10.37506/v11/i1/2020/ijphrd/193842

Landgrebe, T. C., and Duin, R. P. (2008). Efficient multiclass ROC approximation by decomposition via confusion matrix perturbation analysis. IEEE Trans. Patt. Anal. Machine Intell. 30, 810–822. doi: 10.1109/tpami.2007.70740

Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter, C., et al. (2012). A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE Trans. Comp. Biol. Bioinform. 9, 1106–1119. doi: 10.1109/tcbb.2012.33

Li, J., Tang, J., and Liu, H. (2017). “Reconstruction-based Unsupervised Feature Selection: An Embedded Approach,” in Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI , Macao2159–2165.

Li, J., and Wang, F. (2016). Towards unsupervised gene selection: a matrix factorization framework. IEEE Trans. Comp. Biol. Bioinform. 14, 514–521. doi: 10.1109/tcbb.2016.2591545

Li, Z., Liao, B., Cai, L., Chen, M., and Liu, W. (2018). Semi-supervised maximum discriminative local margin for gene selection. Sci. Rep. 8, 1–11. doi: 10.15373/22778179/oct2013/41

Liaghat, S., and Mansoori, E. G. (2016). Unsupervised selection of informative genes in microarray gene expression data. Int. J. Appl. Pattern Recogn. 3, 351–367. doi: 10.1504/ijapr.2016.082237

Liang, Y., Chai, H., Liu, X. Y., Xu, Z. B., Zhang, H., and Leung, K. S. (2016). Cancer survival analysis using semi-supervised learning method based on cox and aft models with l 1/2 regularization. BMC Med. Genom. 9:11. doi: 10.1201/b16589

Liao, B., Jiang, Y., Liang, W., Zhu, W., Cai, L., and Cao, Z. (2014). Gene selection using locality sensitive Laplacian score. IEEE Trans. Comp. Biol. Bioinform. 11, 1146–1156. doi: 10.1109/tcbb.2014.2328334

Liu, H., Zhou, M., and Liu, Q. (2019). An embedded feature selection method for imbalanced data classification. IEEE J. Autom. Sin. 6, 703–715. doi: 10.1109/jas.2019.1911447

Liu, J., Cheng, Y., Wang, X., Zhang, L., and Wang, Z. J. (2018). Cancer characteristic gene selection via sample learning based on deep sparse filtering. Sci. Rep. 8, 1–13.

Maldonado, S., and López, J. (2018). Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification. Appl. Soft Comp. 67, 94–105. doi: 10.1016/j.asoc.2018.02.051

Manbari, Z., AkhlaghianTab, F., and Salavati, C. (2019). Hybrid fast unsupervised feature selection for high-dimensional data. Exp. Syst. Appl. 124, 97–118. doi: 10.1016/j.eswa.2019.01.016

Mazumder, D. H., and Veilumuthu, R. (2019). An enhanced feature selection filter for classification of microarray cancer data. ETRI J. 41, 358–370. doi: 10.4218/etrij.2018-0522

Mishra, S., and Mishra, D. (2015). SVM-BT-RFE: An improved gene selection framework using Bayesian T-test embedded in support vector machine (recursive feature elimination) algorithm. Karbala Int. J. Modern Sci. 1, 86–96. doi: 10.1016/j.kijoms.2015.10.002

Mohamed, E., El Houby, E. M., Wassif, K. T., and Salah, A. I. (2016). Survey on different methods for classifying gene expression using microarray approach. Int. J. Comp. Appl. 975:8887.

Mohapatra, P., Chakravarty, S., and Dash, P. K. (2016). Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system. Swarm Evol. Comp. 28, 144–160. doi: 10.1016/j.swevo.2016.02.002

Motieghader, H., Najafi, A., Sadeghi, B., and Masoudi-Nejad, A. (2017). A hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata. Inform. Med. Unlocked 9, 246–254. doi: 10.1016/j.imu.2017.10.004

Mramor, M., Leban, G., Demšar, J., and Zupan, B. (2005). “Conquering the curse of dimensionality in gene expression cancer diagnosis: tough problem, simple models,” in Proceedings of the Conference on Artificial Intelligence in Medicine in Europe (Berlin: Springer), 514–523. doi: 10.1007/11527770_68

Nguyen, T., Khosravi, A., Creighton, D., and Nahavandi, S. (2015). Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification. PLoS One 10:e012036. doi: 10.1371/journal.pone.0120364

Pearson, W., Tran, C. T., Zhang, M., and Xue, B. (2019). “Multi-Round Random Subspace Feature Selection for Incomplete Gene Expression Data,” in Proceedings of the 2019 IEEE Congress on Evolutionary Computation (CEC) , (Piscataway, NJ: IEEE), 2544–2551.

Prasad, Y., Biswas, K. K., and Hanmandlu, M. (2018). A recursive PSO scheme for gene selection in microarray data. Appl. Soft Comp. 71, 213–225. doi: 10.1016/j.asoc.2018.06.019

Rajeswari, R., and Gunasekaran, G. (2015). Semi -Supervised Tumor Data Clustering via Spectral Biased Normalized Cuts. (Salem: IJERT).

Raut, S. A., Sathe, S. R., and Raut, A. (2010). “Bioinformatics: Trends in gene expression analysis,” in Proceedings of the 2010 International Conference on Bioinformatics and Biomedical Technology , (Chengdu: IEEE), 97–100.

Rodrigues, D., Pereira, L. A., Nakamura, R. Y., Costa, K. A., Yang, X. S., Souza, A. N., et al. (2014). A wrapper approach for feature selection based on bat algorithm and optimum-path forest. Exp. Syst. Appl. 41, 2250–2258. doi: 10.1016/j.eswa.2013.09.023

Rouhi, A., and Nezamabadi-pour, H. (2018). “Filter-based feature selection for microarray data using improved binary gravitational search algorithm,” in Proceedings of the 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC) , (Piscataway, NJ: IEEE), 1–6.

Rouhi, A., and Nezamabadi-Pour, H. (2020). Feature Selection in High-Dimensional Data. In Optimization, Learning, and Control for Interdependent Complex Networks. Cham: Springer, 85–128.

Ruiz, R., Riquelme, J. C., and Aguilar-Ruiz, J. S. (2005). Heuristic search over a ranking for feature selection. In International Work-Conference on Artificial Neural Networks. Berlin: Springer, 742–749.

Russell, S. J., and Norvig, P. (2016). Artificial Intelligence: A Modern Approach. London: Pearson Education Limited.

Saeys, Y., Inza, I., and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517. doi: 10.1093/bioinformatics/btm344

Schaffer, C. (1993). Selecting a classification method by cross-validation. Machine Learn. 13, 135–143. doi: 10.1007/bf00993106

Seijo-Pardo, B., Bolón-Canedo, V., and Alonso-Betanzos, A. (2016). “. Using a feature selection ensemble on DNA microarray datasets,” in Proceedings of the ESANN 2016 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , Bruges.

Shanab, A. A., Khoshgoftaar, T. M., and Wald, R. (2014). “Evaluation of wrapper-based feature selection using hard, moderate, and easy bioinformatics data,” in Proceedings of the 2014 IEEE International Conference on Bioinformatics and Bioengineering , (Piscataway, NJ: IEEE), 149–155.

Sharbaf, F. V., Mosafer, S., and Moattar, M. H. (2016). A hybrid gene selection approach for microarray data classification using cellular learning automata and ant colony optimization. Genomics 107, 231–238. doi: 10.1016/j.ygeno.2016.05.001

Sharma, A., and Rani, R. (2019). C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods. Comput. Methods Prog. 178, 219–235. doi: 10.1016/j.cmpb.2019.06.029

Sheikhpour, R., Sarram, M. A., Gharaghani, S., and Chahooki, M. A. Z. (2017). A survey on semi-supervised feature selection methods. Pattern Recogn. 64, 141–158. doi: 10.1016/j.patcog.2016.11.003

Shreem, S. S., Abdullah, S., and Nazri, M. Z. A. (2014). Hybridising harmony search with a Markov blanket for gene selection problems. Inform. Sci. 258, 108–121. doi: 10.1016/j.ins.2013.10.012

Shukla, A. K., Singh, P., and Vardhan, M. (2018). A hybrid gene selection method for microarray recognition. Biocybernet. Biomed. Eng. 38, 975–991. doi: 10.1016/j.bbe.2018.08.004

Shukla, A. K., and Tripathi, D. (2019). Identification of potential biomarkers on microarray data using distributed gene selection approach. Math . Biosci. 315:108230. doi: 10.1016/j.mbs.2019.108230

Solorio-Fernández, S., Carrasco-Ochoa, J. A., and Martínez-Trinidad, J. F. (2016). A new hybrid filter–wrapper feature selection method for clustering based on ranking. Neurocomputing 214, 866–880. doi: 10.1016/j.neucom.2016.07.026

Solorio-Fernández, S., Martínez-Trinidad, J. F., and Carrasco-Ochoa, J. A. (2017). A new unsupervised spectral feature selection method for mixed data: a filter approach. Pattern Recogn. 72, 314–326. doi: 10.1016/j.patcog.2017.07.020

Sun, L., Kong, X., Xu, J., Zhai, R., and Zhang, S. (2019a). A hybrid gene selection method based on ReliefF and ant colony optimization algorithm for tumor classification. Sci. Rep. 9:8978.

Sun, L., and Xu, J. (2014). Feature selection using mutual information based uncertainty measures for tumor classification. Biomed. Mater. Eng. 24, 763–770. doi: 10.3233/bme-130865

Sun, L., Zhang, X., Qian, Y., Xu, J., and Zhang, S. (2019b). Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification. Inform. Sci. 502, 18–41. doi: 10.1016/j.ins.2019.05.072

Sun, L., Zhang, X. Y., Qian, Y. H., Xu, J. C., Zhang, S. G., and Tian, Y. (2019c). Joint neighborhood entropy-based gene selection method with fisher score for tumor classification. Appl. Intell. 49, 1245–1259. doi: 10.1007/s10489-018-1320-1

Tabakhi, S., Najafi, A., Ranjbar, R., and Moradi, P. (2015). Gene selection for microarray data classification using a novel ant colony optimization. Neurocomputing 168, 1024–1036. doi: 10.1016/j.neucom.2015.05.022

Tang, C., Cao, L., Zheng, X., and Wang, M. (2018). Gene selection for microarray data classification via subspace learning and manifold regularization. Med. Biol. Eng. Comp. 56, 1271–1284. doi: 10.1007/s11517-017-1751-6

Vanjimalar, S., Ramyachitra, D., and Manikandan, P. (2018). “A Review on Feature Selection Techniques for Gene Expression Data,” in Proceedings of the 2 018 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), (Piscataway, NJ: IEEE), 1–4.

Vergara, J. R., and Estévez, P. A. (2014). A review of feature selection methods based on mutual information. Neural Comp. Appl. 24, 175–186. doi: 10.1007/s00521-013-1368-0

Wahid, A., Khan, D. M., Iqbal, N., Khan, S. A., Ali, A., Khan, M., et al. (2020). Feature selection and classification for gene expression data using novel correlation based overlapping score method via Chou’s 5-steps rule. Chemometr. Intell. Lab. Syst. 199:103958. doi: 10.1016/j.chemolab.2020.103958

Wang, A., An, N., Yang, J., Chen, G., Li, L., and Alterovitz, G. (2017). Wrapper-based gene selection with Markov blanket. Comput. Biol. Med. 81, 11–23. doi: 10.1016/j.compbiomed.2016.12.002

Wang, H., Jing, X., and Niu, B. (2017). A discrete bacterial algorithm for feature selection in classification of microarray gene expression cancer data. Knowl. Based Syst. 126, 8–19. doi: 10.1016/j.knosys.2017.04.004

Wang, H., and van der Laan, M. J. (2011). Dimension reduction with gene expression data using targeted variable importance measurement. BMC Bioinformatics 12:312.

Wang, L., Wang, Y., and Chang, Q. (2016). Feature selection methods for big data bioinformatics: A survey from the search perspective. Methods 111, 21–31. doi: 10.1016/j.ymeth.2016.08.014

Wang, Y., Tetko, I. V., Hall, M. A., Frank, E., Facius, A., Mayer, K. F., et al. (2005). Gene selection from microarray data for cancer classification—a machine learning approach. Comput. Biol. Chem. 29, 37–46. doi: 10.1016/j.compbiolchem.2004.11.001

Xiao, Y., Hsiao, T. H., Suresh, U., Chen, H. I. H., Wu, X., Wolf, S. E., et al. (2014). A novel significance score for gene selection and ranking. Bioinformatics 30, 801–807. doi: 10.1093/bioinformatics/btr671

Xu, G., Zhang, M., Zhu, H., and Xu, J. (2017). A 15-gene signature for prediction of colon cancer recurrence and prognosis based on SVM. Gene 604, 33–40. doi: 10.1016/j.gene.2016.12.016

Xu, J., Sun, L., Gao, Y., and Xu, T. (2014). An ensemble feature selection technique for cancer recognition. Biomed. Mater. Eng. 24, 1001–1008. doi: 10.3233/bme-130897

Yang, J., Zhou, J., Zhu, Z., Ma, X., and Ji, Z. (2016). Iterative ensemble feature selection for multiclass classification of imbalanced microarray data. J. Biol. Res. Thessaloniki 23:13.

Yang, Y., Yin, P., Luo, Z., Gu, W., Chen, R., and Wu, Q. (2019). Informative Feature Clustering and Selection for Gene Expression Data. IEEE Access 7, 169174–169184. doi: 10.1109/access.2019.2952548

Ye, X., and Sakurai, T. (2017). “Unsupervised Feature Learning for Gene Selection in Microarray Data Analysis,” in Proceedings of the 1st International Conference on Medical and Health Informatics 2017 , ed. Y.-C. Ho (New York, NY: Association for Computing Machinery), 101–106.

Yu, L., and Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224.

Zare, M., Eftekhari, M., and Aghamollaei, G. (2019). Supervised feature selection via matrix factorization based on singular value decomposition. Chemometr. Intell. Lab. Syst. 185, 105–113. doi: 10.1016/j.chemolab.2019.01.003

Zhang, Y., Deng, Q., Liang, W., and Zou, X. (2018). An efficient feature selection strategy based on multiple support vector machine technology with gene expression data. Biomed. Res. Int. 2018:7538204.

Zhou, Y., Wang, P., Wang, X., Zhu, J., and Song, P. X. K. (2017). Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis. Genet. Epidemiol. 41, 70–80. doi: 10.1002/gepi.22018

Zibakhsh, A., and Abadeh, M. S. (2013). Gene selection for cancer tumor detection using a novel memetic algorithm with a multi-view fitness function. Eng. Appl. Artif. Intell. 26, 1274–1281. doi: 10.1016/j.engappai.2012.12.009

Keywords : gene selection, machine learning, microarray gene expression, supervised gene selection, unsupervised gene selection

Citation: Mahendran N, Durai Raj Vincent PM, Srinivasan K and Chang C-Y (2020) Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front. Genet. 11:603808. doi: 10.3389/fgene.2020.603808

Received: 08 September 2020; Accepted: 29 October 2020; Published: 10 December 2020.

Reviewed by:

Copyright © 2020 Mahendran, Durai Raj Vincent, Srinivasan and Chang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: P. M. Durai Raj Vincent, [email protected] ; Chuan-Yu Chang, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

IMAGES

  1. Genetics and Evolution: Mutation, Selection, Gene Flow and Drift

    gene selection essay

  2. PPT

    gene selection essay

  3. A New Gene Selection Algorithm using Fuzzy-Rough Set Theory for Tumor

    gene selection essay

  4. The Gene Therapy Development and Purpose

    gene selection essay

  5. Schematic of genetic selection. (A) Description of the genetic

    gene selection essay

  6. Selection, Mutation and Other Evolution Drivers

    gene selection essay

VIDEO

  1. Aquatic Superhumans: How the Bajau Survive 13 Minutes Without Air! #shorts

  2. English selection question 2024 ll English ll +2 2nd year ll English selection essay 2024 chse exam

  3. +2 board exam 2024 chse odisha ll English selection question 2024 report and essay chse odisha

  4. The power of a genetic isolate: The FinnGen study

  5. July 19, 2013: When is Hub Gene Selection Better than Standard Meta Analysis?

  6. The History of Globalisation

COMMENTS

  1. Unveiling recent and ongoing adaptive selection in human ...

    The Essay starts with the latest methodological innovations in inference of positive selection at individual genomic loci, and then discusses techniques for detecting aggregate selection signals across genetic loci that collectively influence a quantitative trait.

  2. Why and How to Switch to Genomic Selection: Lessons From ...

    The development of a genomic breeding program requires two steps: (1) in a reference population, individuals are genotyped and phenotyped, and a statistical model is then built to estimate SNP effects on phenotypes and develop corresponding prediction equations; and (2) new candidates for selection may or may not be phenotyped but are always gen...

  3. Genetics and the understanding of selection | Nature Reviews ...

    This discussion has shown how an understanding of genetics transformed the understanding of selection: genetics rescued selection as a concept, then explained how seemingly counterintuitive...

  4. Detecting signatures of selection on gene expression

    The linear selection differential estimates positive versus negative directional selection, while the quadratic selection differential estimates disruptive versus stabilizing selection.

  5. Recent and ongoing selection in the human genome | Nature ...

    Identifying regions of the human genome that have been subject to selection is key to understanding our evolution, and provides insights into the genetic basis of disease.

  6. Gene selection and classification of microarray data using ...

    Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.

  7. Human enhancement: Genetic engineering and evolution

    Genetic engineering opens new possibilities for biomedical enhancement requiring ethical, societal and practical considerations to evaluate its implications for human biology, human evolution and our natural environment. In this Commentary, we consider human enhancement, and in particular, we explore genetic enhancement in an evolutionary context.

  8. Phenotypic selection in natural populations: what have we ...

    Natural selection acts on phenotypes, regardless of their genetic basis, and produces immediate phenotypic effects within a generation that can be measured without recourse to principles of heredity or evolution.

  9. Understanding Natural Selection: Essential Concepts and ...

    In order to study the operation and effects of natural selection, it is important to have a means of describing and quantifying the relationships between genotype (gene complement), phenotype (physical and behavioral features), survival, and reproduction in particular environments.

  10. Machine Learning Based Computational Gene Selection Models: A ...

    The Gene Selection based on machine learning can be classified into three types, Supervised, Unsupervised, and Semi-Supervised. Supervised Gene Selection utilizes the genes that are labeled already (Filippone et al., 2006). The input and output labels are known in advance in this method.