Quantitative Methods for Plant Breeding
(0 reviews)
Walter Suza, Iowa State University
Kendall Lamkey, Iowa State University
Copyright Year: 2023
Publisher: Iowa State University Digital Press
Language: English
Formats Available
Conditions of use.
Table of Contents
- About the PBEA Series
- Chapter 1: Basic Principles
- Chapter 2: Distributions and Probability
- Chapter 3: Central Limit Theorem, Confidence Intervals, and Hypothesis Tests
- Chapter 4: Categorical Data - Binary
- Chapter 5: Categorical Data Multivariate
- Chapter 6: Continuous Data
- Chapter 7: Linear Correlation, Regression and Prediction
- Chapter 8: The Analysis of Variance (ANOVA)
- Chapter 9: Two Factor ANOVAs
- Chapter 10: Mean Comparisons
- Chapter 11: Randomized Complete Block Design
- Chapter 12: Data Transformation
- Chapter 13: Multiple Regression
- Chapter 14: Nonlinear Regression
- Chapter 15: Multivariate Analysis
- Algebra Review Guide
- Contributors
- Applied Learning Activities
Ancillary Material
About the book.
This open textbook covers common statistics used in agriculture research, including experimental design in plant breeding and genetics, as well as the analysis of variance, regression, and correlation.
About the Contributors
Suza is an Adjunct Associate Professor at Iowa State University. He teaches courses on Genetics and Crop Physiology in the Department of Agronomy. In addition to co-developing courses for the ISU Distance MS in Plant Breeding Program, Suza also served as the director of Plant Breeding e-Learning in Africa Program (PBEA) for 8 years. With PBEA, Suza helped provide access to open educational resources on topics related to the genetic improvement of crops. His research is on the metabolism and physiology of plant sterols. Suza holds a Ph.D. in the plant sciences area (with emphasis in molecular physiology) from the University of Nebraska-Lincoln.
Lamkey is the Associate Dean for Facilities and Operations for the College of Agriculture and Life Sciences at Iowa State University. He works in collaboration with the dean, associate deans, department chairs, college-level centers, and other unit leaders to ensure that operations directly advance the mission of the college and that resources are deployed wisely and efficiently. Previously, he served as the chair for the Department of Agronomy at Iowa State University, where, in addition to advocating for research and the PBEA program, he oversaw the Agronomy Department’s educational direction, its faculty, and Agronomy Extension and Outreach. Dr. Lamkey is a corn breeder and quantitative geneticist and conducts research on the quantitative genetics of selection response, inbreeding depression, and heterosis. He holds a Ph.D. in plant breeding from Iowa State University and a master’s in plant breeding from the University of Illinois. Lamkey is a fellow of the American Society of Agronomy and the Crop Science Society of America and has served as an associate editor, technical editor, and editor for Crop Science .
Contribute to this Page
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Published: 20 May 2022
The grand challenge of breeding by design
Nature Plants volume 8 , pages 451–452 ( 2022 ) Cite this article
3342 Accesses
1 Citations
9 Altmetric
Metrics details
- Agriculture
- Plant breeding
With plentiful knowledge of gene function and the development of technologies like gene editing, breeders are fully equipped to address grand challenges and eliminate various forms of hunger.
Human society has never stop worrying about food. The recent outbreak of war in Ukraine is predicted to exacerbate famine in the global South, which receives much of the grain exported from that region, and factors like pandemics or climate change affect food supply everywhere. To feed the world, breeding is an ever important topic. In this issue ter Steeg et al. discuss breeding and factors that affect the decision-making of a commercial breeder 1 .
Globally, breeding is a collaborative effort among plant scientists and breeders in both public and private sectors. While there are clear biological and economic restrictions on breeding in both sectors, new opportunities and challenges also emerge that lead predictably to a new breeding landscape. Crop breeding is based on genetic variations, and while this can be created artificially, naturally occurring variations continue to be key resources.
Global institutes, like those under the Consortium of International Agricultural Research Centres (CGIAR), are constantly working to improve their germplasm collections. In this issue, Ramirez-Villegas et al. analyse the landraces of 25 major crops in ex situ repositories and identify the gaps remaining to be filled 2 . In a previous work, they reported the conservation status of the wild relatives of 81 crops 3 . Efforts like this expand the availability of natural variations and directly benefit breeders.
Artificial mutations induced by physical or chemical mutagens tend to be random and hard to use in breeding. Now, with targeted gene-editing approaches, a large number of agronomically desirable variations can be generated quickly. Many researchers have used CRISPR–Cas to target known agronomic genes, modifying coding sequences or gene expression to obtain useful mutations. For example, tomato lines have been edited for the desired architecture 4 and wheat and potato varieties have been edited for better resistance to pathogens 5 , 6 . Base editing has generated new alleles for herbicide resistance in rice and wheat 7 , 8 . Gene editing can tweak not only nuclear genomes but also those of cellular organelles 9 , 10 , 11 . These new methods will markedly enrich the allelic spectrum that can be employed in future breeding.
Breeders are gradually abandoning the conventional breeding model — which is random and reliant on breeders’ experience — and adopting the concept of ‘prior design’. Large-scale genome sequencing and genome-wide association studies, as well as extensive functional genomic studies, have made designed breeding a realistic possibility. By stacking genes known to control grain quality and yield traits, scientists have developed high-yield and superior-quality rice varieties 12 . On the basis of the known function of the MLO gene in the pathogenesis of powdery mildew, mlo wheat with resistance was swiftly developed by gene editing 5 . Knowledge of genes controlling self-incompatibility in potato allowed the generation of diploid self-compatible potato lines, making potato breeding more efficient and propagation through seeds instead of tubers possible 1 , 13 .
In a paper published by Nature Plants last year, researchers showed in wheat breeding trials that climate change in recent decades significantly increased cross-interactions of the varieties tested across different environments, making breeders’ jobs far more difficult 14 . However, they also discovered that germplasm developed under heat stress was better adapted and more stable, indicating that targeting breeding to specific stress environments should help breed climate resilient varieties. Research on plant stress tolerance has identified many genes that confer resistance to environmental stresses, and thus many targets for editing to create climate-resilient crops. At the same time abundant stress-resistance alleles already exist in the wild gene pool 15 , highlighting the importance of landraces and crop wild relatives.
Rising labour costs are another challenge for sustainable agriculture. Although the development of mechanized technologies greatly reduces labour input, there are still many crops or crop varieties not suitable for mechanized farming. Breeding varieties more suitable for mechanized planting and harvesting is one solution; however, scientists in the perennial crop community have been working on another strategy: transforming annual crops to perennial crops. Annual crops require repetitively sowing seeds, planting and ploughing each season, while perennial agriculture is closer to natural ecological systems, minimizing the inputs of labours and fertilizers. The Land Institute in the United States has developed a perennial intermediate wheatgrass ( Thinopyrum intermedium ), trademark name Kernza, that has made its way to niche markets 16 . A group of scientists in Yunnan University, China, has developed and released perennial rice varieties that showed continuous high yield across multiple years without the need to replant each season 17 .
Grafting represents another route to perennialization. Recently, scientists in Huazhong Agricultural University grafted an aubergine scion to a woody Solanum root stock, making ‘eggplant trees’ that bear aubergine fruits for 3 years and achieve enhanced yield per season 18 .
The advantages of orphan crops have been re-recognized recently. They tend to be rich in nutrients, including micronutrients, and are able to grow in suboptimal conditions 19 . Without abundant genetic resources for conventional breeding, gene-editing-based approaches are particularly useful for orphan crops, especially when they are related to species that are major crops already extensively studied. More research projects have been initiated for breeding in orphan crops 20 , 21 .
There are many challenges to feeding the world; however, progress in technologies and public knowledge gives cause for optimism. There are a few factors that limit the participation of commercial breeders 1 , but with more genetic resources becoming available every month, the world is becoming a breeder’s oyster.
ter Steeg, E. M. S., Struik, P. C., Visser, R. G. F. & Lindhout, P. Nat. Plants https://doi.org/10.1038/s41477-022-01142-w (2022).
Article PubMed Google Scholar
Ramirez-Villegas, J. et al. Nat. Plants https://doi.org/10.1038/s41477-022-01144-8 (2022).
Castañeda-Álvarez, N. P. et al. Nat. Plants 2 , 16022 (2016).
Article Google Scholar
Rodriguez-Leal, D., Lemmon, Z. H., Man, J., Bartlett, M. E. & Lippman, Z. B. Cell 171 , 470–480 e478 (2017).
Article CAS Google Scholar
Li, S. et al. Nature 602 , 455–460 (2022).
Kieu, N. P., Lenman, M., Wang, E. S., Petersen, B. L. & Andreasson, E. Sci. Rep. 11 , 4487 (2021).
Xu, R., Liu, X., Li, J., Qin, R. & Wei, P. Nat. Plants 7 , 888–892 (2021).
Li, C. et al. Genome Biol. 19 , 59 (2018).
Kang, B. C. et al. Nat. Plants 7 , 899–905 (2021).
Nakazato, I. et al. Nat. Plants 7 , 906–913 (2021).
Forner, J. et al. Nat. Plants 8 , 245–256 (2022).
Zeng, D. et al. Nat. Plants 3 , 17031 (2017).
Ma, L. et al. Nat. Commun. 12 , 4142 (2021).
Xiong, W. et al. Nat. Plants 7 , 1207–1212 (2021).
Li, X. M. et al. Nat. Genet. 47 , 827–833 (2015).
Land Institute. Kernza grain. https://landinstitute.org/our-work/perennial-crops/kernza/ (accessed 6 May 2022).
Huang, G. F. et al. Sustainability 10 , 1086 (2018).
RyAgri. Is the eggplant tree a hoax? Breakthroughs in agricultural science and technology have come true! https://www.ryagrimachinery.com/news/is-the-eggplant-tree-a-hoax-breakthroughs-in-agricultural-science-and-technology-have-come-true-3/ (5 January 2022).
Talabi, A. O. et al. Front. Plant Sci. 13 , 839704 (2022).
Meadow, M. Whitehead Initiative on the Biology and Health of Climate Change undertakes multifaceted project on resilient food crops. Whitehead Institute https://wi.mit.edu/news/whitehead-initiative-biology-and-health-climate-change-undertakes-multifaceted-project (accessed 6 May 2022).
Lemmon, Z. H. et al. Nat. Plants 4 , 766–770 (2018).
Download references
Rights and permissions
Reprints and permissions
About this article
Cite this article.
The grand challenge of breeding by design. Nat. Plants 8 , 451–452 (2022). https://doi.org/10.1038/s41477-022-01166-2
Download citation
Published : 20 May 2022
Issue Date : May 2022
DOI : https://doi.org/10.1038/s41477-022-01166-2
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Sign up for the Nature Briefing: Anthropocene newsletter — what matters in anthropocene research, free to your inbox weekly.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
The PMC website is updating on October 15, 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
- G3 (Bethesda)
- v.12(7); 2022 Jul
Breedbase: a digital ecosystem for modern plant breeding
Nicolas morales.
Boyce Thompson Institute, Ithaca, NY 14853, USA
Cornell University, Ithaca, NY 14853, USA
Alex C Ogbonna
Bryan j ellerbrock, guillaume j bauchet, titima tantikanjana, isaak y tecle, adrian f powell, naama menda, christiano c simoes, prashant hosmani, mirella flores, naftali panitz, ryan s preble, afolabi agbona.
IITA Ibadan, 200001 Ibadan, Nigeria
Ismail Rabbi
Peter kulakow, prasad peteti, robert kawuki.
NaCCRI, Namulonge, Uganda
Williams Esuma
Micheal kanaabi, doreen m chelangat, ezenwanyi uba.
National Root Crops Research Institute (NRCRI), 463109 Umudike, Nigeria
Adeyemi Olojede
Joseph onyeka, trushar shah.
IITA Nairobi, 30709-00100 Nairobi, Kenya
Margaret Karanja
Chiedozie egesi, agre paterne, asrat asfaw.
IITA Abuja, 901101 Abuja, Nigeria
Jean-Luc Jannink
USDA-ARS, Ithaca, NY 14853, USA
Marnin Wolfe
Clay l birkett, david j waring, jenna m hershberger, michael a gore, kelly r robbins, trevor rife.
Kansas State University, Manhattan, KS 66506, USA
Chaney Courtney
Jesse poland, elizabeth arnaud.
Bioversity-CIAT Alliance, 34397 Montpellier, France
Marie-Angélique Laporte
Heneriko kulembeka.
TARI, 33518 Ukiriguru, Tanzania
Kasele Salum
Emmanuel mrema, allan brown, stanley bayo, brigitte uwimana, violet akech, craig yencho.
North Carolina State University (NCSU), Raleigh, NC 27695, USA
Bert de Boeck
CIP, 15000 Lima, Peru
Hugo Campos
Rony swennen.
KU Leuven, 3000 Leuven, Belgium
Jeremy D Edwards
USDA-ARS, Stuttgart, AR 72160, USA
Lukas A Mueller
Associated data.
All codes are available from Github ( https://github.com/solgenomics ) and docker hub ( https://hub.docker.com/r/breedbase/breedbase# ).
Modern breeding methods integrate next-generation sequencing and phenomics to identify plants with the best characteristics and greatest genetic merit for use as parents in subsequent breeding cycles to ultimately create improved cultivars able to sustain high adoption rates by farmers. This data-driven approach hinges on strong foundations in data management, quality control, and analytics. Of crucial importance is a central database able to (1) track breeding materials, (2) store experimental evaluations, (3) record phenotypic measurements using consistent ontologies, (4) store genotypic information, and (5) implement algorithms for analysis, prediction, and selection decisions. Because of the complexity of the breeding process, breeding databases also tend to be complex, difficult, and expensive to implement and maintain. Here, we present a breeding database system, Breedbase ( https://breedbase.org/, last accessed 4/18/2022 ). Originally initiated as Cassavabase ( https://cassavabase.org/, last accessed 4/18/2022 ) with the NextGen Cassava project ( https://www.nextgencassava.org/, last accessed 4/18/2022 ), and later developed into a crop-agnostic system, it is presently used by dozens of different crops and projects. The system is web based and is available as open source software. It is available on GitHub ( https://github.com/solgenomics/, last accessed 4/18/2022 ) and packaged in a Docker image for deployment ( https://hub.docker.com/u/breedbase , last accessed 4/18/2022). The Breedbase system enables breeding programs to better manage and leverage their data for decision making within a fully integrated digital ecosystem.
Introduction
Modern plant breeding is a data-intensive process requiring multiple diverse datasets to be integrated and assessed in decision making. In classical plant breeding, promising individuals are intentionally interbred to generate a diverse population of progeny, from which individuals with the best phenotypic characteristics are selected to be used as elite parents in subsequent breeding cycles or released as improved cultivars ( Breseghello and Coelho 2013 ). Modern plant breeding extends classical breeding with the use of marker-assisted selection and genomic selections (GS) to augment phenotypic selection ( Ribaut and Hoisington 1998 ). Furthermore, with the emergence of high-throughput phenotyping technologies as tools for breeding, the number of potential phenotypes to be tracked has vastly increased ( White et al. 2012 ; Andrade-Sanchez et al. 2013 ).
The development of inexpensive genotyping technologies allow even small breeding programs to acquire high-density genotyping data for a large portion of their germplasm. The availability of this genomic data has enabled more efficient approaches to evaluate important and complex traits in the breeding process ( VanRaden 2008 ). One such approach is GS, which combines genomic and phenomic data to develop a predictive model that can be used to estimate genotypic or breeding values ( Meuwissen and Goddard 2001 ). Since genotyping is both less expensive and faster than phenotypic selection, GS can result in significant acceleration of the breeding cycle with concomitant faster increases in gain. A challenge for genome-based breeding methods is the establishment of an adequate data management infrastructure to integrate the complex datasets spanning the breeding process ( Volk et al. 2021 ). This represents a severe constraint to mainstreaming predictive breeding to small breeding programs, particularly in developing countries.
To address these data management challenges, we initiated a system called Cassavabase ( https://cassavabase.org/ , last accessed 4/18/2022) for the NextGen Cassava project building on a genomics codebase developed for many years for the Solanaceae called SGN ( https://solgenomics.net/ , last accessed 4/18/2022) ( Mueller, Solow, et al. 2005 ; Menda et al. 2008 ; Bombarely et al. 2011 ; Fernandez-Pozo, Menda, et al. 2015 ). With an initial focus on tomato and sequencing its genome ( Mueller, Tanksley, et al. 2005 ; Tomato Genome Consortium 2012 ), SGN already contained a comprehensive genomics database with a strong phenotype management component ( Menda et al. 2008 ), a number of genomics-centric tools ( Mueller et al. 2008 ; Tecle et al. 2010 ; Fernandez-Pozo, Rosli, et al. 2015 ), and a rudimentary version of a genotyping storage backend ( Fernandez-Pozo, Menda, et al. 2015 ). Cassavabase is an open-source, web-based breeding data management and analysis system built with the ability to manage the GS process ( Tecle et al. 2014 ). As more instances of the software were deployed for other crops, the system expanded to better meet each project’s needs by adding further breeding-related tools, such as image-based or near-infrared spectroscopy (NIRS)-based phenotyping tools ( Hershberger et al . 2021 ). To reflect that the underlying software and database are amenable to any crop and to promote adoption by new communities, we named the system “Breedbase” ( https://breedbase.org/ , last accessed 4/18/2022). Major clonal crops using Breedbase currently are cassava ( https://cassavabase.org/ , last accessed 4/18/2022), yam ( https://yambase.org/ , last accessed 4/18/2022), banana ( https://musabase.org/ , last accessed 4/18/2022), and sweetpotato ( https://sweetpotatobase.org/ , last accessed 4/18/2022), collectively known as the RTBbases ( https://rtbbase.org/ , last accessed 4/18/2022); however, major nonclonal crops using Breedbase include wheat ( https://wheat.triticeaetoolbox.org/ , last accessed 4/18/2022) and rice ( https://ricebase.org/ , last accessed 4/18/2022). Breeding and research groups have adopted the system as well, such as the Gore Lab at Cornell University ( https://gorelabbase.sgn.cornell.edu/ , last accessed 4/18/2022).
The purpose of Breedbase is to enable a digital ecosystem that contains an integrated breeding workflow. Processes and data comprising germplasm banks, parental selection, crossing design, experimental design, data collection, analyses, and decision-making tools are aggregated into a single system. This improves efficiency and reduces data errors that can happen when using disjointed informatics tools, for instance when transferring and restructuring data for analyses ( Cobb et al. 2019 ). When data are loaded into a database, many checks can be performed to make sure the data are consistent and in line with specified quality control criteria.
Many breeders, especially in smaller programs that cannot allocate resources to data management tools, maintain their data in spreadsheets. While spreadsheets provide a straightforward way to manage data and analyses, they suffer from a number of drawbacks, even with relatively small volumes of data. For example, it is difficult to precisely merge data across different spreadsheets, often resulting in errors and data quality issues, or to visualize or analyze data across spreadsheets. Data in spreadsheets are typically not normalized, resulting in typographical issues, inconsistent identifiers, liberal use of synonyms, and similar issues that make the data hard to aggregate. Nevertheless, the largest problem with spreadsheets is that their storage is not centralized; in fact, they are often stored on personal computers and laptops, often in multiple inconsistent versions, with potentially limited backup strategies and little recourse if accidental data loss occurs or if a person leaves the breeding program, taking all the breeding data with them. Breeding programs can be very large, encompassing many locations with many collaborators; as such, spreadsheets hinder collaboration because data cannot be accessed in a consistent state by many people at once. Furthermore, with genome-based breeding, spreadsheets become unworkable, as it is difficult to maintain and analyze potentially very large genotypic datasets in spreadsheets in any useful way. It is important to note that using a database is not sufficient for managing a modern breeding cycle—the entire breeding process needs to be integrated around the database to create an efficient digital ecosystem.
Breedbase implements a robust system of breeding workflows, data management procedures, and analysis tools to address breeder informatics problems. Here we present the rationale, design, implementation, and major use cases for Breedbase.
Materials and methods
Implementation.
The Breedbase data architecture is built around a Postgres ( https://postgresql.org/ , last accessed 4/18/2022) relational database with a schema that is mainly derived from Chado ( Jung et al. 2011 ), with some historic, pre-Chado tables from SGN, as well as minor customizations ( Fernandez-Pozo, Menda, et al. 2015 ) ( Fig. 1a ). In relational databases, information is systematically structured into concepts represented as tables (“normalization”), a format that facilitates many aspects of data management. The information in the different tables can be joined based on primary and foreign keys, which are usually numeric values assigned to every row in a table. For some data types, such as genotypic data, Breedbase uses non-SQL extensions built into Postgres, such as JSONb-based data structures ( Morales, Bauchet et al. 2020 ). The application layer is implemented in Perl, using the Moose object system, based on the Model-View-Controller Catalyst web framework ( https://metacpan.org/pod/Catalyst::Manual , last accessed 4/18/2022), with Mason as the templating toolkit ( https://metacpan.org/pod/Mason ). The system uses an object-relational layer based on DBIx::Class, with the main Chado classes organized in the Bio::Chado::Schema namespace. For statistical analyses and some of the data visualizations, the R language and add-on R packages ( https://r-project.org/ , last accessed 4/18/2022) are used. Image analyses and machine learning models are implemented in Python TensorFlow ( https://www.tensorflow.org/ , last accessed 4/18/2022 ) and OpenCV ( https://opencv.org/ , last accessed 4/18/2022) ( Morales, Kaczmar, et al. 2020 ). The frontend graphical user interface (GUI) development has recently transitioned away from Mason components to JavaScript, with a heavy reliance on asynchronous JavaScript requests. Almost all functionalities are implemented as RESTful services, allowing for a more interactive user experience and reusable codebase. JavaScript frameworks used for the GUI include JQuery ( https://openjsf.org/ , last accessed 4/18/2022), D3.js ( https://d3js.org/ , last accessed 4/18/2022), Bootstrap ( https://getbootstrap.com/ , last accessed 4/18/2022), and Brapi.js ( https://brapi.org/ , last accessed 4/18/2022). The entire Breedbase system is built on open source software and is packaged in a Docker image for deployment ( https://docker.com/ , last accessed 4/18/2022). For interoperability with other breeding database and tools, Breedbase implements the BrAPI 2.0 specification ( Selby et al. 2019 ).
a) Breedbase platform architecture. User interface: To offer a dynamic, highly interactive user interface, several JavaScript libraries are implemented including D3, JQuery, and Bootstrap. RESTful APIs, including a full BrAPI 2.0 implementation, handle the communication between the front and back end, allowing fast calculations without reloading the website. HTML5 for interactive graphical display, allowing instant reorganization of visual elements. The Bootstrap framework is used for modern and dynamic page templating. Middleware layer: A Perl software stack including Mason components to connect to the user interface, a Catalyst a web application framework, Moose an object oriented perl library and DBIX::Class an object-relational mapper to connect to SQL code. In addition, BrAPI libraries are used. Finally a job cluster scheduler, Slurm is implemented to allocate server resources and ensure scalability. Data source layer: Breedbase operates on a relational database using Postgres. Postgres 12.0 offers “Big data” solutions including parallel query execution and optimized binary JSON data type handling. Binary JSON (JSONB) is a simple data structure designed to be storage space and scan-speed efficient. In Breedbase, JSONB is used in various data types including genotypic (marker) information. In addition to the relational database a standard file system space is available for flat files. Finally, other databases can communicate to a Breedbase instance to provide additional back-end for marker data [i.e. Genomic Open Source Informatic Initiative (GOBii)] or to exchange germplasm information for example. b) Breedbase codevelopment process. User–developers interactions are promoted using various media. Users have online access to documentation ( https://solgenomics.github.io/sgn/ , last accessed 4/18/2022), video tutorials, or through onsite training. Software development goals are extensively discussed between developers, data managers, breeders, and other appropriate stakeholders. Agile development allows short-term product release. Suggested improvements, issues, and bugs discovered in Breedbase are submitted and tracked on the public GitHub issue tracking software ( https://github.com/ , last accessed 4/18/2022). Software development progress is tracked using a version control system and Docker releases. c) Cassavabase, a breedbase instance: data content overview. Cassavabase involves national and international breeding programs (22) from various African and South American countries (15) and currently has 1,131 registered users. Cassavabase hosts various data types including high-density and low-density genotyping assays (35,000), plot-based phenotypic data points (near 15 million), images from plants and plots from trials (5107) and locations (435).
In terms of user interface, the goal of Breedbase is to provide a standard, modern web interface for all breeding tools. Breedbase is essentially a cloud-based app, obviating the need for the user to install any software. For anyone with web-browsing experience, the interface should be intuitive and straightforward, and it is continuously improved based on user-driven feedback. In Breedbase, processes are presented in an interactive workflow system, providing step-by-step guidance to breeders and users in accomplishing specific tasks. A few of the widely used interfaces include the Wizard, Lists, and Datasets tools, which will be described in more detail later.
The initial development of BreedBase focused on addressing the data collection and management stages necessary to facilitate GS within a breeding program, including:
- Manage accessions and pedigrees in the database, with ontology-based descriptions and support for rich metadata including images
- Design field layouts and track all field metadata
- Load historical data from breeding programs
- Collect phenotypic data on tablets in the field and upload the subsequent phenotypes
- Manage genotypic data associated with the accessions
- Enable genome-based predictive breeding by calculating correlations between phenotypes and genotypes, and predict phenotypes from genotypes [the solGS tool ( Tecle et al. 2014 ), https://cassavabase.org/solgs/search, last accessed 4/18/2022 ]
- Support controlled crossing using customized tracking tools
More recently, a number of other use cases were pursued:
- Advanced statistical analyses including principal component analysis (PCA), stability analysis (AMMI) ( Duarte and Pinto 2002 ) heritability calculations ( Holland et al. 2010 ), mixed model analysis, and genome-wide association studies (GWAS)
- Marker-assisted breeding
- Processing and analysis of unoccupied aerial vehicle image data
- Image analysis
- NIRS data storage and analysis
Plant breeding operations requiring decision support within a growing season include 3 broad activities: crossing, evaluations, and selections. These activities typically include setup of crossing and trial experiments (design, labeling), data and seed collection, genotyping, and subsequent statistical analysis. Breedbase offers support for each of these components through online tools. To streamline accessibility and usage for key routine activities, Breedbase has established workflow components. Each workflow offers the user a guided process for a targeted activity. For example, the trial creation workflow comprises trial creation, planting material and checklist creation, randomization and statistical design selection, field visualization, and storage. During this process, field trial experiment parameters (see Phenotyping Trials section) are input into Breedbase and the relevant experimental design is calculated using open source R libraries such as Agricolae ( De Mendiburu et al. n.d. ) or Digger ( Coombes 2009 ). The experimental layout is calculated and displayed, and can be reviewed and potentially improved by rerunning randomization before the trial design is stored in Breedbase. Additional parameters such as field management factors (i.e. agronomic management or fertilizer application) can also be entered. Similar workflows exist for other activities, such as phenotyping and genotyping.
Development process
The development process can be broadly described as agile ( Beck and Andres 2004 ; James and Shane 2008 ), in which short-term goals are defined and implemented, and subsequently further refined based on new feedback from users; agile teams provide for short release cycles and continuous improvement to the software ( Fig. 1b ). Progress is tracked using a version control system with built-in issue tracking software (GitHub, https://github.com/ ). New features are discussed with breeders and other stakeholders. Issues and bugs discovered in Breedbase are tracked on the public GitHub issue tracker. A programmer is then assigned to a ticket, and will create an issue-specific topical git branch in the relevant code repositories, and implements the required changes in the branch, including tests and edits to the user documentation. When the implementation is ready for release, a pull request is generated on GitHub and a reviewer is assigned. In the review, the code is verified for errors, programming style, tests, and documentation. If the reviewer approves the pull request, the code is merged into the master branch. The test-driven software development approach is tightly integrated with our development process, consisting of unit and integration tests. A ticket meeting is held once a week and all open pull requests and important tickets are discussed. If all the pull requests were merged successfully, and no issues are discovered with tests or other checks, a new release tag is created, the new version is deployed in production, and a new Docker image is released. Since Breedbase is open source, programmers outside of the core development team are able to make contributions to the code base via the same process. The Breedbase project has had 40+ contributors addressing various issues and improvements ( https://github.com/solgenomics/sgn/graphs/contributors ).
A key aspect of data integration is the necessity of standardization. Breedbase is based on the Chado database schema, which relies heavily on controlled vocabularies and ontologies to describe its data, and requires numerous ontologies for its internal functioning. In many ways, it can be described as an ontology-based database. For the breeding application, data standardization in the form of trait catalogs is especially important when several sites or breeding programs share data in the database. Without standardization, the data would not be comparable, limiting the utility of an integrated database. The creation and maintenance of trait ontologies is a considerable task. The Crop Ontology (CO) project was developed by CGIAR to define and maintain relevant breeding ontologies ( Shrestha et al. 2012 ). All the RTBbases use the CO vocabularies and collaborate with CO and breeders to improve and expand these vocabularies ( Arnaud et al. 2020 ). If no ontologies are available, they have to be created, which can be a lengthy and arduous task. The Protégé tool ( https://protege.stanford.edu/ ) ( Musen 2015 ) is commonly used by curators for editing ontologies before upload to CO and Breedbase. The Trait Dictionary Template along with the Guidelines ( Pietragalla et al. 2020 ), available in the CO website, remain useful to collect the trait details from the research community and reach consensus. Each species is allocated a code by the CO coordination team to identify the ontology and crop repositories are created in the Planteome Github to secure the ontology version management. An online term submission form is accessible in Breedbase for users wishing to suggest missing traits or modifications to the CO ( https://submit.rtbbase.org, last accessed 4/18/2022 ).
Interoperability and BrAPPs
Databases must interoperate with a variety of tools to perform their functions in data acquisition, analysis, and data export. Recently, a standard called the Breeding Application Programming Interface (BrAPI; https://brapi.org/ ) was developed to exchange breeding data ( Selby et al. 2019 ), which breeding databases can implement to provide a standard interoperability layer. Standardized application programming interfaces (APIs) allow Breedbase to integrate and interface with a broader set of BrAPI-enabled applications, or BrAPPs, that can be written across diverse programming languages including Android, R, and JavaScript. The BrAPI R package allows data retrieval from Breedbase for further statistical processing within the R environment. JavaScript-based BrAPPs provide dynamic visualization of plant breeding data, such as pedigrees exploration, experimental field maps, and data from multiple trials. BrAPPs can interact with data from any BrAPI compliant database, such as Breedbase or the Breeding Management System ( Fig. 1a ). Activities such as dynamic data filtering, trial comparison, box plotting, and a comparative genetic map viewer are also implemented with BrAPPs on Breedbase. Breedbase fully supports BrAPI version 2.0 and is committed to updating the system for future versions of this essential infrastructure.
Querying Breedbase
Breedbase has a number of query options, which are grouped in the “Search” menu. The most important data types each have a search (“Accessions and plots,” “Trials,” “Organisms,” “Crosses,” etc.). A powerful combined search is available in the form of the Search Wizard ( Fig. 2 ).
Screenshot of the “Search Wizard” interface, a central query function on Breedbase. With the Search Wizard, the data in the database can be intersected by dimensions, such as locations, years, breeding programs, and traits. For each dimension, a number of elements can be selected. The individual selected dimensions can be stored in lists, and the combined selections can be saved as a dataset. Both lists and datasets can be used to feed data into various tools on Breedbase.
The Search Wizard and datasets
The Search Wizard allows users to slice their data in different dimensions, such as breeding programs, locations, years, and so forth. The data in the database can be thought of as a multidimensional cube which is cut along different dimensions, providing an intersection that represents the data of interest. This approach is conceptually related to a query method called Online Analytical Processing ( Celko 2006 ). The current Wizard presents 4 boxes, for 4 different dimensions, which can be selected using pull down menus ( Fig. 2 ). For example, a user who is interested in the performance of cassava clones evaluated by IITA in 2017 and 2018 at the Mokwa station in Nigeria can use the wizard to find this information. Working from left to right, the user selects as the first dimension “Breeding Programs,” which displays all the breeding programs in the database in the first box. The user then selects “IITA” from the individual breeding programs listed in the box. When the user selects “Years” in the second box, all the years for which data for IITA exist are listed. In this example, the user selects 2017 and 2018. Finally, after selecting locations in the third box, the user specifies “Mokwa.” When trials or accessions are selected, phenotypic and genotypic data corresponding to the selection can be downloaded using buttons below the Wizard boxes. The Wizard also allows the combination of current selections to be stored in the database under a user-given name, representing an intersect of data of interest in the database. This stored selection is called a “dataset.” Datasets are used across Breedbase to efficiently reference a complex query with a simple, assigned name. Tools that support the dataset concept in Breedbase include solGS, GWAS, the heritability tool, the stability analysis, and the general mixed model tool.
Quick search
A quick search is provided in the upper right corner of the menu bar that searches a keyword across all data types in the database, and is a fast way to retrieve named objects such as stocks and genes.
Special searches
Topic-specific searches are available from the Search menu, including a trial search, a trait search, searches for genotyping data (including genotyping protocols, projects, and plates), an image search that searches image descriptions and associated tags, and a user search that searches the users of the database. All these searches work in a straightforward and consistent way: a search form is filled in with search criteria, and the search is submitted to the database. A list with matched search results is displayed, from which links are provided to the corresponding detail pages.
Analysis tools
Breedbase is more than a static collection of data, as it enables users to explore and analyze data in the database. Once data is uploaded to the database, users can view summary statistics, evaluate phenotypic variances, and identify observations with missing or outlier data. They can filter observations in a trial based on a range of trait or traits values. For an experiment phenotyped in multiple environments, they can evaluate trait performance across environments using pairwise comparison scatter plots and histograms.
Breedbase also has tools for ANOVA, correlation, PCA, data partitioning using K-means clustering, genomic prediction, GWAS, selection index calculation, genetic gain visualization, and linear mixed models. With the Search Wizard, as explained above, users can construct datasets that can be used as inputs to various tools. Most tools follow a similar blueprint in terms of user interface: (1) select the dataset of interest from a drop-down menu of all available datasets, (2) adjust parameters for the tool, (3) submit the calculation for analysis, and (4) display the results. For some tools that require heavy computation, an email can be optionally sent to the user with a link to the results. Query implementation is a relatively complex task in the programming of a tool, but the Wizard enables the modularization of algorithms into Breedbase with relatively little glue-code, facilitating tool coverage expansion. Results such as predictions from solGS and adjusted means from mixed models can be saved in the database as analysis results. These results can be used like primary data in downstream analyses such as the selection index tool to help identify favorable germplasm.
Managing a breeding program using Breedbase: accessions, phenotyping, crossing, and genotyping data
General principles.
Plant breeding involves the collection of a wide variety of data types at different time points and locations, and across different scenarios (e.g. field, laboratory, seed storage). To give users flexibility and mobility in data collection, smartphone-based applications are often required. Android applications, such as PhenoApps ( http://phenoapps.org/, last accessed 4/18/2022 ), are developed with this perspective ( Rife and Poland 2014 ). Breedbase has adopted the PhenoApps tool suite created by Kansas State University.
PhenoApps include applications for phenotyping (Field Book), cross management (Intercross), sample collection (Coordinate), and inventory management (Inventory). Breedbase has worked to build in native support for these applications and integrate them into best practices workflows. Since internet access is not available at all field sites, the functionality has been developed to allow configuration of these applications prior to field data collection. Field layouts, plant accessions, and traits to be measured can be loaded onto mobile devices through special interfaces in Breedbase. Following collection, data are imported back into Breedbase. Because all the trial information in the collection device was initially downloaded from Breedbase, required identifiers can easily be matched with the existing data in the database. This process is called “round-tripping,” and is a crucially important concept for high quality data management.
List management
Breeding activities often require the maintenance of lists of various types—for example, a list of accessions to plant, traits to measure, or trials to evaluate—and, consistent with digital ecosystem principles, these lists should be managed entirely through the database. Accordingly, Breedbase implements comprehensive list management functions. By default, lists are associated with the user that creates the list. The main list interface can be reached by clicking on the Lists link on the top right of the toolbar, which appears when logged in. A dialog appears that allows users to view, create, and edit new lists. Each list has a data type from an internal ontology called “list_type,” which includes terms for “accessions,” “trials,” “traits,” “years,” etc. Lists are collections of text elements that correspond to names of database objects. Lists can be validated against names that are already present in the database. A validated list can then be used to submit data to various tools, including the Wizard, right on the website. Sometimes, it can be useful to share a list with other users, and this can be achieved by making a list public by clicking the appropriate checkbox in the list detail view. Public lists are shown in a separate section, and become visible to all users. They can be “unshared” if needed.
Germplasm management
Germplasm is the foundation of a breeding program and plays a similarly important role in a database such as Breedbase. In plant breeding programs, tracking and characterization of germplasm is a major challenge. Germplasm in this context includes accessions, stocks, varieties, or, in clonal crops, clones. Breedbase commonly uses the term “accession.” Breedbase is prepopulated with the complete plant section of the NCBI taxonomy database, defining all known species with their associated genus, abbreviation, common name, and GenBank taxon identifier. Researchers using Breedbase can usually find their crops of interest within the 100,000+ organisms available. Accessions are always created in association with one of these organisms.
Some instances of Breedbase, such as Cassavabase and Sweetpotatobase, are designed to only contain germplasm of their respective species; however, it is possible for a single instance of Breedbase to be used for a variety of crop species. Combining many crop species into a single instance can complicate the search interfaces and lead to bloated databases; however, aggregating all data allows for more consistent and queryable data. Alternatively, separating instances can lead to potentially duplicated and inconsistent data, but can be beneficial for fostering communities.
In Breedbase, there are 2 distinct concepts that describe accessions: (1) an accession that can be ordered from a seed bank, which may have been selfed and could be genetically quite pure, or landraces. These are “long-term use” accessions (i.e. historical germplasm, parental inbred lines), which may be actively maintained and can be obtained easily; whereas (2), are “short-term use” accessions (i.e. intermediate generations) that are produced in a breeding program and may go through a few rounds of selection, but most of which will be discarded in the process. These accessions may also not be genetically pure, as they may result from crosses between relatively distant parents.
To create an accession in Breedbase, only a unique name and the organism species name are required. As with all objects stored in a relational database, Postgres will create a primary key identifier for each object, using a data structure called a sequence, which is used to link the accession to other objects in the database using a foreign key. This means that even if the accession name is modified, it will still retain all the connections to other objects of the original entry. Germplasm can be further annotated with configurable properties from the Multi-Crop Passport Descriptors standards ( Food and Agriculture Organization of the United Nations 2018 ) and BrAPI standards ( Selby et al. 2019 ); these properties include “variety,” “donor,” “donor institute,” “donor PUI,” “country of origin,” “institute code,” “institute name,” “notes,” “accession number,” and “PUI.” Germplasm can be added to the database using the interactive list tool (see previous section) or an Excel file upload; the Excel file upload also allows for storing and updating of all attributes listed above. The first step in the initiation of a breeding program is to load relevant accessions into the database. This is critical, as the naming of accessions is often not uniform between breeding programs and the community at large. In some cases, a single name can refer to several different accessions or a single accession may have many different names or synonyms, often the result of historical transcription error or case inconsistency. Before the first upload, it is therefore essential to define a standard unique name and set of possible synonyms for each accession. Though Breedbase allows for synonyms of accession names, they should also be unique. It is best practice to use synonyms only to find accessions and not when performing routine tasks with the database during the breeding process. Whenever new accession names are encountered, Breedbase provides a workflow to compare new names to all existing accessions in the database. In this workflow, a user can consolidate synonyms, for instance to add “Tx 303” as a synonym of “TX303.”
After initial accession upload, it is often necessary to add more accessions, increasing the chance of generating duplicated accessions in the database, or other upload issues. As is the case with synonyms, many of these problems result from poorly defined accession identifiers with capitalization inconsistencies and special characters such as slashes, dots, dashes, underlines, and spaces. Although we recommend avoiding such special characters, especially in primary identifiers, it is not always feasible, notably with legacy data. To ease upload and tracking of such cases, Breedbase has a fuzzy search (also called approximate string matching search) component, enabling an accurate quality control of existing similar germplasm names in the database.
Phenotyping trials
Phenotyping trials are a core activity of plant breeding programs, and must be carefully designed. Trial designs can either be generated directly in Breedbase using the integrated, comprehensive trial design tool or uploaded using Excel files formatted with a Breedbase-provided template. Trial metadata fields include breeding program, location, name, trial type, year, plot dimensions, field size, and trial design type. Supported statistical trial design types currently include alpha lattice, lattice, augmented, split plot, partially replicated, and Wescott designs. Designs should also include the ordinal row and column positions of each plot as it is planted in the field, so Breedbase allows this information to be added either during or after design storage. Once a trial design is finalized, it is stored in the Breedbase schema. Within Breedbase, a field trial links phenotypic observations to the experimental layout under a specific statistical design.
Row crops usually use the concept of plot as the minimal entity for data collection, but many specialty crops (i.e. vegetables) require data collection on a per plant or per tissue basis. Breedbase allows plant- and tissue-level entry creation for each plot in a trial, resulting in database entries and identifiers at each level, which can also be encoded in barcode labels for data collection.
To collect data from crosses, Breedbase requires the creation of a top-level crossing experiment; the crossing experiment is defined with a unique name, a breeding program, a location, a year, and a description. The individual crosses performed are then stored under the crossing experiment and defined by a cross unique id, parents, and a cross type. The cross type can be one of the following: biparental, self, sib, open pollinated, bulk, bulk selfed, bulk and open-pollinated, doubled haploid, polycross, reciprocal, or multicross. Depending on the type of cross performed, different metadata must be provided; for example, in a biparental cross, information from both the male and female parent is required, whereas in an open-pollinated cross, information on only the female is required. In the case of an open-pollinated cross, a population name representing a group of male germplasm can be given as the male parent. In addition to cross unique id, which captures specific details of each cross, users have the option to group crosses having the same parental genotypes via family name for downstream progeny analysis.
Breedbase tracks parental information from crosses in 2 ways: (1) through the accession names of the female and male parents, allowing for simple ancestry tracking of AxB pedigrees for the progeny from a cross. When a cross is created in Breedbase, the pedigree between progeny and parental germplasm is automatically created as well. This first form of parental tracking is applied in all cases when a cross is created in Breedbase. (2) Through the plot or plant names of the male and female parents. The plot or plant names of the parents are related to the field trial in which they are planted, as is described in the above field trial section. This approach allows detailed tracking of female and male parents used in crossing, but is optional in Breedbase because of the difficulty in recording this information in many cases.
Recording information on parental plots is facilitated by mobile data collection platforms. Of note are customized Open Data Kit (ODK) Android applications, such as BTract and the PhenoApps app Intercross. BTract assigns and prints a unique cross barcode label after scanning barcodes to track the precise male and female plots or plants involved in the pollination. Through ODK data synchronization, the cross information can be uploaded into Breedbase. Intercross can be used to scan parental barcodes and associate a unique cross id to the performed cross. The output from Intercross can also be uploaded directly into Breedbase.
In crossing experiments that include evaluation of crosses, Breedbase can store annotations regarding properties of the cross. Default properties include pollination date, tag number, number of flowers, number of bags, number of fruits, and number of seeds; however, these properties are set in the configuration file for the Breedbase instance, allowing researchers flexibility in defining these terms. Breedbase also supports tracking of tissue culture samples.
Crosses can be created individually using an interactive interface on Breedbase or can be uploaded in bulk using an Excel spreadsheet by providing cross unique ids, cross types, and parents involved. Once each cross unique id is saved in Breedbase, additional data can be added or uploaded using the cross unique id as an identifier. Progeny of the cross can be saved as new germplasm in the database, automatically creating pedigrees for the new germplasm.
Genotyping data
High-density genotyping data are a complex data type that have become an important resource in modern breeding programs due to the advent of low-cost next-generation sequencing and genotyping technologies ( Thomson 2014 ). Breedbase offers simple laboratory information management functionalities from field tissue sampling to SNP data storage. Functions include tissue samples collection and tracking via plot barcodes and PCR plate formats (i.e. 96 or 384 wells), genotyping protocol definition, data storage, and subsequent analytics ( Tecle et al. 2014 ; Morales, Bauchet, et al. 2020 ).
The primary means of organizing genotyping data between sequencing events is the “genotyping protocol” in Breedbase. A “genotyping protocol” consists of a specific set of genotypic markers and records all metadata about how the genotypes were produced, including the reference genome and specifics about, analytical platform and related variant calling software. The “genotyping protocols” can be grouped in Breedbase under a “genotyping project” which displays all relevant genotyping data and provides an overview, which is especially useful for very active genotyping programs.
Multiple genotyping technologies can be stored in Breedbase from low density genotyping (i.e. Kompetitive allele-specific PCR, KASP) to high density genotyping such as genotyping-by-sequencing or DArT-seq ( Elshire et al. 2011 ; Kilian et al. 2012 ; Semagn et al. 2014 ). The preferred method for uploading high-density genotyping data to Breedbase is through variant call format (VCF) files. VCF provides for compact representation of genotypic scores for large numbers of samples and markers ( Danecek et al. 2011 ). PostgreSQL nonrelational functionalities allow Breedbase to store high-density genotyping data in JavaScript object notation (JSON) structures within the larger relational database schema ( ISO/IEC TR 19075-6:2017 2018 ). Breedbase particularly relies on the binary JSON (JSONb) data type for compressed data storage and faster retrieval ( Morales, Bauchet, et al. 2020 ).
Genotyping data can be queried alongside relationally stored phenotypic and experimental information for analyses, including computation of a genomic relationship matrix for user specified germplasm and computation of a GWAS for user specified germplasm and phenotypic traits ( VanRaden 2008 ). Queries spanning specific markers or marker sets and experimental information can be readily constructed. Genotyping data results can be downloaded as VCF files from the Search Wizard web interface. The genotyping data are also used in the Genomic Selection tool, solGS, to predict GEBVs of genotyped lines.
Authentication and authorization
During breeding processes, a potentially large number of people will need to access the database to download, upload, modify, or delete data. This requires a fine-tuned layer of authentication and authorization management in the database. Breedbase requires a user to login for most functionalities (authentication). Every user account is associated with “roles” that determine what the user will be allowed to do in the system (authorization). Currently, there are 3 major roles: user, submitter, and curator. The user role allows read-only access. With the submitter role, a user can upload data, and can modify or delete data that they themselves uploaded. The curator role allows a user to modify any type of data. In addition, every breeding program in the database has a corresponding role that controls authorization over specific breeding program activities, such as creating and uploading trial data.
Cassavabase, the flagship Breedbase database
Cassavabase ( https://cassavabase.org/ ) is the breeding database for the NextGen Cassava project ( https://nextgencassava.org/, last accessed 4/18/2022 ). The NextGen Cassava partners, IITA (Ibadan, Nigeria), NRCRI (Umudike, Nigeria), NaCRRI (Namulonge, Uganda), TARI (Ukiriguru, Tanzania), Embrapa (Cruz das Almas, Brazil), and CIAT (Cali, Colombia) use Cassavabase for their breeding programs, starting as early as 2014. To date, Cassavabase has accumulated an immense amount of cassava breeding data ( Fig. 1c ), consisting of information on more than 500,000 cassava accessions, characterized by over 19 million phenotypic measurements in over 4,000 trials, and nearly 35,000 genotyping experiments. This shows that the Breedbase system can scale to fairly large datasets and large, multi-institute and multinational programs.
Other instances of Breedbase
In addition to Cassavabase, Breedbase has been deployed for various crops, notably for other Roots, Tuber and Banana (RTB) crops ( https://rtbbase.org/, last accessed 4/18/2022 ) in the CGIAR: banana, ( https://musabase.org/, last accessed 4/18/2022 ), sweetpotato ( https://sweetpotatobase.org/, last accessed 4/18/2022 ), and yam ( https://yambase.org/, last accessed 4/18/2022 ). In addition, several dozen Breedbase instances are currently deployed for other crops, such as rice ( https://ricebase.org/, last accessed 4/18/2022 ), wheat ( https://wheat.triticeaetoolbox.org/, last accessed 4/18/2022 ), oat ( https://oat.triticeaetoolbox.org/, last accessed 4/18/2022 ), kelp ( https://sugarkelpbase.org/, last accessed 4/18/2022 ), potato, and maize. While the aforementioned projects use Breedbase for mainly breeding informatics purposes, other Breedbase instances focus on genomics. These include SGN ( https://solgenomics.net/ ; Fernandez-Pozo, Menta, et al. 2015 ), which focuses on tomato and other Solanaceae , fern ( https://fernabase.org/, last accessed 4/18/2022; Li et al. 2018 ), Erysimum ( https://erysimum.org/ , last accessed 4/18/2022; Züst et al. 2020 ), and milkweed ( https://milkweedbase.org/ , last accessed 4/18/2022). In addition, a Breedbase instance has been deployed to characterize a tritrophic vector-borne disease system, the citrus greening disease ( https://citrusgreening.org/ , last accessed 4/18/2022) ( Saha et al. 2017 ). An instance named ImageBreed has been deployed for high-throughput imaging of maize and alfalfa field experiments ( https://imagebreed.org/ , last accessed 4/18/2022; Morales, Kaczmar, et al. 2020 ). A number of academic labs and breeding companies also use Breedbase for data management within their programs. The Breeding Insight project ( https://breedinginsight.org/, last accessed 4/18/2022 ), which creates breeding databases for USDA breeding programs, has also adopted the Breedbase system as a foundation for their breeding solutions.
Box 1. Providing data management tools for small grains breeders: the Triticeae Toolbox adaptation of Breedbase
As documented in this article, Breedbase provides many features for working breeding programs. The mission of The Triticeae Toolbox (T3) is to provide these features to a diverse audience of small grains breeding programs, by mandate in the United States, and by extension globally.
The development of T3 is motivated by the belief that larger datasets provide greater power to identify genetic effects that are relevant to all breeders. Across wheat, oat, and barley, T3 stores 5,600 trials, comprising over 1,800,000 phenotypic data points on over 30,000 lines with genotype data. From there, T3 seeks to provide breeders with results from analyses that tap into these data, in the hope that this will help breeders gain insights from their own data. The primary example we have in this area is a function to show marker trait associations identified among all trials submitted to T3 with adequate marker density, and meta-analyzed to determine robust associations across trials. The next milestone on the roadmap of this function is to develop marker imputation functionality on T3 that will present genotype trials with uniform high-density marker scores, enabling meta-analysis over more trials. Indeed, marker data are a critical rationale for T3’s mission: the database contains data on many lines that now are connected to current populations primarily through the marker alleles segregating.
An important advantage of a web-based data management platform is that it links the data to the world of knowledge available on the web. T3 provides that connectivity by providing links to external information on markers, traits, and germplasm. Our primary partners in that regard are GrainGenes, Wheat Expression Browser, and the Wheat KnetMiner ( Hassani-Pak et al. 2021 ). For example, a marker trait association close to a gene can be used to connect that trait to JBrowse ( https://jbrowse.org/, last accessed 4/18/2022 ), to gene expression data ( expVIP and EMBL-EBI ) or to a knowledge network, KnetMiner . Traits in Breedbase are defined using collaborative ontologies crucial to forging these links: the ontologies represent agreements on naming traits and gene functions that enable meaningful bridges across knowledge platforms.
The diversity of T3 users means that they will not operate together as an integrated breeding organization. Rather each breeding program submitting data to T3 will want data privacy and ease in determining what data becomes incorporated into the public production database. Currently, all data on the production database is available to anyone. We plan on implementing privacy settings specifying data visibility as public or restricted. Absent this feature, we now work with a few users by providing them with separate instances of T3 that are not publicly visible but can easily transmit datasets to the T3 production database when ready.
The wide range of T3 users also means that we expect them to have varying degrees of familiarity with the Breedbase platform. To allow users to test the addition and modification of datasets without modifying curated data by mistake, T3 has created sandbox instances for each crop. Users can freely upload data to the sandbox, ensure that the uploaded data added to the database is correct, and then easily publish the data to the production instance. A data curator checks the submitted data before adding it to the production database. The Breedbase system was crucial in establishing these features and reduced duplication of effort.
Box 2. Usage example
Recently, Ogbonna et al. (2021) leveraged legacy breeding data to investigate the genetic architecture of cyanide content in cassava, a key trait in food safety. Authors performed a retrospective analysis, mining historical cyanide data from the African IITA breeding program (18 locations, 23 years, and 393 trials) and Colombian CIAT (41 locations, 11 years, and 155 trials) program from the Breedbase instance cassavabase.org. Recycling open source, standardized, breeding data in conjunction with novel genotypic data provided a high statistical power and allowed the detection of key loci controlling cassava root cyanide content using GWAS. Such loci would otherwise have gone undetected, and was identified only because of the availability of the Breedbase digital ecosystem.
Breeding is a complex process involving many different types of data, especially considering genome-based breeding methods at the current state of the art. Creating and maintaining breeding databases is therefore generally considered to be time-consuming and expensive. Many large breeding companies maintain their own databases and software for managing breeding processes and selection, but this is not an option for smaller programs. The lack of bespoke databases is especially true in resource poor areas of the world, where the need for plant improvement is often the greatest. A free, user-driven, and open source platform such as Breedbase that integrates a complete digital ecosystem for breeding will help close the gap for these programs as well as many smaller to mid-sized organizations. Still, Breedbase databases can scale significantly to large breeding programs with hundreds of thousands of accessions and millions of phenotypic scores.
Integration in breeding programs
Even the best breeding data management tools will fail to deliver if breeding programs do not use them or use them incorrectly. A significant effort is required to integrate a breeding database into the workflow of a breeding organization, as data management is central to the work of modern breeding programs but remains a shortcoming. Breeding activities need to be closely tracked; to ensure complete integration, all materials, operations, and operators need to be systematically recorded and reviewed throughout the process. This is important to enable analyses, improve data quality, and to identify sources of errors in real time and post hoc.
It is important for breeding programs to work closely with groups that have significant experience in data management, which can also help the breeding programs to understand their needs, and to train staff better in the use of the database. In the RTB breeding programs, we found it to be helpful to designate specific staff as Data Managers, who receive extensive database training. Data Managers have spent time at the BTI to learn more about the database developments, and can provide additional training and help on the ground in the breeding programs. They also provide timely feedback on the tools and features based on their first hand experiences, which is vital for the improvement of the database. We have put a significant effort in user training through in-person workshops, reciprocal visits, and training materials, such as a complete on-line manual, slideshows, and most recently a YouTube channel with recorded workshops.
In our experience, one of the bottlenecks in implementing a breeding database is the availability of standardized trait ontologies for the crop in question. Especially in larger projects, it can be difficult for all breeders to agree on a common ontology, including common sample preparation and measurement protocols, as well as measurement units. Without this standardization, a database loses much of its appeal as it becomes impossible to aggregate and reconcile disparate data. This challenge cannot be understated as it is a major obstacle especially when phenotypic data is collected across different locations for a variety of crops and has to be stored in a single integrated system. We have focused on developing ontologies and common vocabularies to address this issue but it can be harder than expected, as there are often diverging and strong opinions on these matters. In addition, breeding programs introduce new traits to be measured, for example, quality traits, and there needs to be a process to integrate such new terms into the ontology. Fortunately, the CO project ( Shrestha et al. 2012 ) has created trait ontologies for a wide range of crops, which we contribute to and many Breedbase instances rely on. CO has also defined processes for updating and developing the ontologies, which allows new traits and methods to be introduced to breeding programs with relative ease.
Future developments
Progress in the last few years in digital agriculture has been enormous and will continue to be so in the foreseeable future. New genotyping and phenotyping technologies, such as NIRS, are constantly being developed or improved. Breeding databases must coevolve with the technological advances to remain relevant, requiring significant effort in refactoring and implementation. Systems that easily adapt to new technologies will have a distinct advantage; in terms of software development strategies, agile software development will be more efficient than older waterfall-type models. Another area of improvement is that of algorithms and other aspects of methodology. With a strong connection to the R programming language, it is relatively easy to implement new algorithms in Breedbase, as they often require little modification from standalone scripts to work within Breedbase. At its core, Breedbase uses a relational database with integrated JSON data storage, which provides a healthy balance between highly structured, normalized data and flexibility. However, other systems, such as graph databases and highly parallelized solutions like Hadoop, or a combination thereof, are becoming popular and may be integrated into Breedbase in the future.
All of the Breedbase codes are open source and readily available on the code sharing site GitHub ( https://github.com/solgenomics ).
Conclusions
Breedbase provides a fully open-source, scalable, and feature-rich breeding digital ecosystem that has been in use at the RTB crops breeding centers of the CGIAR for many years, starting with the NextGen Cassava database, Cassavabase ( https://cassavabase.org/ ). The system has now been adopted by various breeding programs including vegetable and grain crops and maintains an open and collaborative approach to software development, allowing database customization for each research community while sustaining a common framework. Our hope is that Breedbase, and the digital ecosystem that it provides, can contribute, in a small way, to solving the world’s big problems with food scarcity and food quality, and thus contribute to improving subsistence farmers’ lives around the world.
Web resources
https://github.com/solgenomics/ —Github repositories for Breedbase code, last accessed 4/18/2022
https://hub.docker.com/r/breedbase/breedbase# —Docker image for Breedbase server, last accessed 4/18/2022
https://breedbase.org/ —Breedbase demo site, last accessed 4/18/2022
https://cassavabase.org/ —Cassavabase, the flagship Breedbase site, last accessed 4/18/2022
https://musabase.org/ —Breedbase site for banana breeding, last accessed 4/18/2022
https://yambase.org/ —Breedbase site for yam breeding, last accessed 4/18/2022
https://sweetpotatobase.org/ —Breedbase site for sweet potato breeding, last accessed 4/18/2022
https://www.youtube.com/channel/UC3jrvvzGKKEHzOriDBgnj0A —YouTube channel for Breedbase, last accessed 4/18/2022
Data availability
Acknowledgments.
We thank the Breeding Insight Project, including Moira Sheehan, Tim Parsons, Nick Palladino, and Chris Tucker, for providing code to the project. We also thank many of the previous members of the breeding leads and staff, including Racheal Mukisa, Dorcus Gemenet, Edward Carey, Jolien Swanckaert, Jan Low, Robert Mwanga, and many others. Thanks to Thomas Hickey, Keo Corak, and Julie Dawson for their many bug reports and suggestions. For the development of food quality-related features, we thank the RTBFoods project, especially Karima Meghar, Thierry Tran, and Dominique Dufour. Thanks to Lynn Johnson, Chris Hernandez, and Peter Selby for their help and suggestions. Thanks to Luka Wanjohi, Reinhard Simon, Ciro Rosales, Ivan Perez, and Elisa Salas for comments and suggestions. Special thanks to Kathy Kahn, Jim Lorenzen, and the entire BMGF for their tireless support of the African RTB breeding programs. This article is dedicated to the memory of Martha Hamblin.
This work was partially supported by the NEXTGEN Cassava project, through a grant to Cornell University by the Bill & Melinda Gates Foundation (Grant INV-007637 http://www.gatesfoundation.org ) and the UK’s Foreign, Commonwealth & Development Office (FCDO). Other funding was provided by BMGF through the Africa Yam project, Better Breeding Bananas, and the sweetpotato-focused GT4SP and SweetGAINS projects.
JLJ, CLB, and DJW received partial support from the US Wheat and Barley Scab Initiative and the National Research Initiative Competitive Grants 2017‐67007‐25939 and 2017‐67007‐25929 from the National Institute of Food and Agriculture, US Department of Agriculture.
The Crop Ontology was financially supported through the CGIAR Platform for Big Data in Agriculture and the CGIAR Agrifood Research Programmes, by the CGIAR Trust Fund ( https://www.cgiar.org/funders/ ), and UKAID. The Crop Ontology was additionally supported through the Planteome Project, led by Pankaj Jaiswal (Oregon State University), by the National Science Foundation, USA (IOS : 1340112).
Conflicts of interest
None declared.
Contributor Information
Nicolas Morales, Boyce Thompson Institute, Ithaca, NY 14853, USA. Cornell University, Ithaca, NY 14853, USA.
Alex C Ogbonna, Boyce Thompson Institute, Ithaca, NY 14853, USA. Cornell University, Ithaca, NY 14853, USA.
Bryan J Ellerbrock, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Guillaume J Bauchet, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Titima Tantikanjana, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Isaak Y Tecle, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Adrian F Powell, Boyce Thompson Institute, Ithaca, NY 14853, USA.
David Lyon, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Naama Menda, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Christiano C Simoes, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Surya Saha, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Prashant Hosmani, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Mirella Flores, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Naftali Panitz, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Ryan S Preble, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Afolabi Agbona, IITA Ibadan, 200001 Ibadan, Nigeria.
Ismail Rabbi, IITA Ibadan, 200001 Ibadan, Nigeria.
Peter Kulakow, IITA Ibadan, 200001 Ibadan, Nigeria.
Prasad Peteti, IITA Ibadan, 200001 Ibadan, Nigeria.
Robert Kawuki, NaCCRI, Namulonge, Uganda.
Williams Esuma, NaCCRI, Namulonge, Uganda.
Micheal Kanaabi, NaCCRI, Namulonge, Uganda.
Doreen M Chelangat, NaCCRI, Namulonge, Uganda.
Ezenwanyi Uba, National Root Crops Research Institute (NRCRI), 463109 Umudike, Nigeria.
Adeyemi Olojede, National Root Crops Research Institute (NRCRI), 463109 Umudike, Nigeria.
Joseph Onyeka, National Root Crops Research Institute (NRCRI), 463109 Umudike, Nigeria.
Trushar Shah, IITA Nairobi, 30709-00100 Nairobi, Kenya.
Margaret Karanja, IITA Nairobi, 30709-00100 Nairobi, Kenya.
Chiedozie Egesi, Boyce Thompson Institute, Ithaca, NY 14853, USA. IITA Ibadan, 200001 Ibadan, Nigeria. National Root Crops Research Institute (NRCRI), 463109 Umudike, Nigeria.
Hale Tufan, Cornell University, Ithaca, NY 14853, USA.
Agre Paterne, IITA Ibadan, 200001 Ibadan, Nigeria.
Asrat Asfaw, IITA Abuja, 901101 Abuja, Nigeria.
Jean-Luc Jannink, Cornell University, Ithaca, NY 14853, USA. USDA-ARS, Ithaca, NY 14853, USA.
Marnin Wolfe, Cornell University, Ithaca, NY 14853, USA.
Clay L Birkett, Cornell University, Ithaca, NY 14853, USA. USDA-ARS, Ithaca, NY 14853, USA.
David J Waring, Cornell University, Ithaca, NY 14853, USA. USDA-ARS, Ithaca, NY 14853, USA.
Jenna M Hershberger, Cornell University, Ithaca, NY 14853, USA.
Michael A Gore, Cornell University, Ithaca, NY 14853, USA.
Kelly R Robbins, Cornell University, Ithaca, NY 14853, USA.
Trevor Rife, Kansas State University, Manhattan, KS 66506, USA.
Chaney Courtney, Kansas State University, Manhattan, KS 66506, USA.
Jesse Poland, Kansas State University, Manhattan, KS 66506, USA.
Elizabeth Arnaud, Bioversity-CIAT Alliance, 34397 Montpellier, France.
Marie-Angélique Laporte, Bioversity-CIAT Alliance, 34397 Montpellier, France.
Heneriko Kulembeka, TARI, 33518 Ukiriguru, Tanzania.
Kasele Salum, TARI, 33518 Ukiriguru, Tanzania.
Emmanuel Mrema, TARI, 33518 Ukiriguru, Tanzania.
Allan Brown, IITA Ibadan, 200001 Ibadan, Nigeria.
Stanley Bayo, IITA Ibadan, 200001 Ibadan, Nigeria.
Brigitte Uwimana, IITA Ibadan, 200001 Ibadan, Nigeria.
Violet Akech, IITA Ibadan, 200001 Ibadan, Nigeria.
Craig Yencho, North Carolina State University (NCSU), Raleigh, NC 27695, USA.
Bert de Boeck, CIP, 15000 Lima, Peru.
Hugo Campos, CIP, 15000 Lima, Peru.
Rony Swennen, KU Leuven, 3000 Leuven, Belgium.
Jeremy D Edwards, USDA-ARS, Stuttgart, AR 72160, USA.
Lukas A Mueller, Boyce Thompson Institute, Ithaca, NY 14853, USA.
Literature cited
- Arnaud E, Laporte M-A, Kim S, Aubert C, Leonelli S, Miro B, Cooper L, et al. The ontologies community of practice: a CGIAR initiative for big data in agrifood systems . Patterns . 2020; 1 ( 7 ):100105. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Andrade-Sanchez P, Gore MA, Heun JT, Thorp KR, Carmo-Silva AE, French AN, Salvucci ME, White JW.. Development and evaluation of a field-based high-throughput phenotyping platform . Funct Plant Biol . 2013; 41 ( 1 ):68–79. [ PubMed ] [ Google Scholar ]
- Beck K, Andres C.. 2004. Extreme Programming Explained: Embrace Change . Boston: Addison-Wesley. [ Google Scholar ]
- Bombarely A, Menda N, Tecle IY, Buels RM, Strickler S, Fischer-York T, Pujar A, Leto J, Gosselin J, Mueller LA.. The Sol Genomics Network (solgenomics.net): growing tomatoes using Perl . Nucleic Acids Res . 2011; 39 ( Database issue ):D1149–D1155. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Breseghello F, Coelho ASG.. Traditional and modern plant breeding methods with examples in rice (Oryza sativa L.) . J Agric Food Chem . 2013; 61 ( 35 ):8277–8286. [ PubMed ] [ Google Scholar ]
- Celko J. 2006. OLAP basics. In: Celko J, editor. Joe Celko’s Analytics and OLAP in SQL. Amsterdam: Elsevier. https://doi:10.1016/b978-012369512-3/50028-7 .
- Cobb JN, Juma RU, Biswas PS, Arbelaez JD, Rutkoski J, Atlin G, Hagen T, Quinn M, Ng EH.. Enhancing the rate of genetic gain in public-sector plant breeding programs: lessons from the Breeder’s equation . Theor Appl Genet . 2019; 132 ( 3 ):627–645. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Coombes NE. 2009. DiGGeR, a spatial design program. Biometric Bulletin. NSW DPI.
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al.; 1000 Genomes Project Analysis Group. The variant call format and VCFtools . Bioinformatics . 2011; 27 ( 15 ):2156–2158. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- De Mendiburu F, Simon R.. n.d. Agricolae - ten years of an open source statistical tool for experiments in breeding . Agric Biol . https://doi:10.7287/peerj.preprints.1404v1 . [ Google Scholar ]
- Duarte JB, Pinto RMC.. Biplot AMMI graphic representation of specific combining ability . CBAB . 2002; 2 ( 2 ):161–170. 10.12702/1984–7033.v02n02a01. [ CrossRef ] [ Google Scholar ]
- Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE.. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species . PLoS One . 2011; 6 ( 5 ):e19379. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Fernandez-Pozo N, Menda N, Edwards JD, Saha S, Tecle IY, Strickler SR, Bombarely A, Fisher-York T, Pujar A, Foerster H, et al. The Sol Genomics Network (SGN)—from genotype to phenotype to breeding . Nucleic Acids Res . 2015; 43 ( D1 ):D1036–D1041. 10.1093/nar/gku1195. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Fernandez-Pozo N, Rosli HG, Martin GB, Mueller LA.. The SGN VIGS tool: user-friendly software to design virus-induced gene silencing (VIGS) constructs for functional genomics . Mol Plant . 2015; 8 ( 3 ):486–488. [ PubMed ] [ Google Scholar ]
- Food and Agriculture Organization of the United Nations. Genebank Standards for Plant Genetic Resources for Food and Agriculture . Rome, Italy: Food and Agriculture Organization; 2018. [ Google Scholar ]
- Hassani-Pak K, Singh A, Brandizi M, Hearnshaw J, Parsons JD, Amberkar S, Phillips AL, Doonan JH, Rawlings C.. KnetMiner: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across species . Plant Biotechnol J . 2021; 19 ( 8 ):1670–1678. 10.1111/pbi.13583. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Hershberger J, Morales CC, Simoes N, Ellerbrock B. Making WAVES in Breedbase: an integrated spectral data storage and analysis pipeline for plant breeding programs. Plant Phenome J. 2021; 4 ( 1 ):1–7. [ Google Scholar ]
- Holland JB, Nyquist WE, Cervantes-Martínez CT.. Estimating and interpreting heritability for plant breeding: an update. In: Jules J, editor. Plant Breeding Reviews ; New York NY: John Wiley & Sons, 2010. https://doi:10.1002/9780470650202.ch2 . [ Google Scholar ]
- ISO/IEC TR 19075-6:2017. Information technology — Database languages — SQL Technical Reports — Part 6: SQL support for JavaScript Object Notation (JSON) ISO. 2018. https://www.iso.org/standard/67367.html .
- James S, Shane W.. The Art of Agile Development . Sebastopol, CA: O’Reilly Media, Inc.; 2008. [ Google Scholar ]
- Jung S, Menda N, Redmond S, Buels RM, Friesen M, Bendana Y, Sanderson L-A, Lapp H, Lee T, MacCallum B, et al. The Chado natural diversity module: a new generic database schema for large-scale phenotyping and genotyping data . Database 2011; 2011(November ):bar051. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Kilian A, Wenzl P, Huttner E, Carling J, Xia L, Blois H, Caig V, Heller-Uszynska K, Jaccoud D, Hopper C, et al. Diversity arrays technology: a generic genome profiling technology on open platforms . Methods Mol Biol . 2012; 888 :67–89. [ PubMed ] [ Google Scholar ]
- Li F-W, Brouwer P, Carretero-Paulet L, Cheng S, de Vries J, Delaux P-M, Eily A, Koppers N, Kuo L-Y, Li Z, et al. Fern genomes elucidate land plant evolution and cyanobacterial symbioses . Nat Plants . 2018; 4 ( 7 ):460–472. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Menda N, Buels RM, Tecle I, Mueller LA.. A community-based annotation framework for linking solanaceae genomes with phenomes . Plant Physiol . 2008; 147 ( 4 ):1788–1799. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Meuwissen TH, Goddard ME.. Prediction of identity by descent probabilities from marker-haplotypes . Genet Sel Evol . 2001; 33 ( 6 ):605–634. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Morales N, Bauchet GJ, Tantikanjana T, Powell AF, Ellerbrock BJ, Tecle IY, Mueller LA.. High density genotype storage for plant breeding in the Chado schema of Breedbase . PLoS One . 2020; 15 ( 11 ):e0240059. https://doi:10.1371/journal.pone.0240059 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Morales N, Kaczmar NS, Santantonio N, Gore MA, Mueller LA, Robbins KR.. ImageBreed: open‐access plant breeding web–database for image‐based phenotyping . Plant Phenome J . 2020; 3 ( 1 ). https://doi:10.1002/ppj2.20004 . [ Google Scholar ]
- Mueller LA, Mills AA, Skwarecki B, Buels RM, Menda N, Tanksley SD.. The SGN comparative map viewer . Bioinformatics . 2008; 24 ( 3 ):422–423. [ PubMed ] [ Google Scholar ]
- Mueller LA, Solow TH, Taylor N, Skwarecki B, Buels R, Binns J, Lin C, Wright MH, Ahrens R, Wang Y, et al. The SOL Genomics Network: a comparative resource for Solanaceae biology and beyond . Plant Physiol . 2005; 138 ( 3 ):1310–1317. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Mueller LA, Tanksley SD, Giovannoni JJ, van Eck J, Stack S, Choi D, Kim BD, Chen M, Cheng Z, Li C, et al. The tomato sequencing project, the first cornerstone of the International Solanaceae Project (SOL) . Comp Funct Genomics . 2005; 6 ( 3 ):153–158. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Musen MA, The protégé project . AI Matters . 2015; 1 ( 4 ):4–12. https://doi:10.1145/2757001.2757003 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Ogbonna AC, Braatz de Andrade LR, Rabbi IY, Mueller LA, de Oliveira EJ, Bauchet GJ.. Large-scale GWAS using historical data identifies a conserved genetic architecture of cyanogenic glucosides content in Cassava (Manihot Esculenta Crantz.) root . Plant J . 2021; 105 ( 3 ):754–770. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Pietragalla J, Valette L, Shrestha R, Laporte M-A, Hazekamp T, Arnaud E. Guidelines for creating crop-specific ontologies to annotate phenotypic data, version 2.0. Alliance Bioversity International-CIAT, December 2020.
- Ribaut J-M, Hoisington D.. Marker-assisted selection: new tools and strategies . Trends in Plant Science . 1998; 3 ( 6 ):236–239. [ Google Scholar ]
- Rife TW, Poland JA.. Field Book: an open-source application for field data collection on Android . Crop Sci . 2014; 54 :1624–1627. https://doi:10.2135/cropsci2013.08.0579 . [ Google Scholar ]
- Saha S, Hosmani PS, Villalobos-Ayala K, Miller S, Shippy T, Flores M, Rosendale A, Cordola C, Bell T, Mann H, et al. Improved annotation of the insect vector of citrus greening disease: biocuration by a diverse genomics community . Database . 2017; 2017 :bax032. https://doi:10.1093/database/bax032 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Selby P, Abbeloos R, Backlund JE, Basterrechea Salido M, Bauchet G, Benites-Alfaro OE, Birkett C, Calaminos VC, Carceller P, Cornut G, et al. BrAPI—an application programming interface for plant breeding applications . Bioinformatics . 2019; 35 ( 20 ):4147–4155. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Semagn K, Babu R, Hearne S, Olsen M.. Single nucleotide polymorphism genotyping using Kompetitive Allele Specific PCR (KASP): overview of the technology and its application in crop improvement . Mol Breeding . 2014; 33 ( 1 ):1. 10.1007/s11032-013–9917-x. [ CrossRef ] [ Google Scholar ]
- Shrestha R, Matteis L, Skofic M, Portugal A, McLaren G, Hyman G, Elizabeth A.. Bridging the phenotypic and genetic data useful for integrated breeding through a data annotation using the Crop Ontology developed by the crop communities of practice . Front Physiol . 2012; 3 ( August ):326. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Tecle IY, Edwards JD, Menda N, Egesi C, Rabbi IY, Kulakow P, Kawuki R, Jannink J-L, Mueller LA.. solGS: a web-based tool for genomic selection . BMC Bioinformatics . 2014; 15(December ):398. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Tecle IY, Menda N, Buels RM, van der Knaap E, Mueller LA.. solQTL: a tool for QTL analysis, visualization and linking to genomes at SGN database . BMC Bioinformatics . 2010; 11(October ):525. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Thomson MJ. High-throughput SNP genotyping to accelerate crop improvement . Plant Breed Biotech . 2014; 2 ( 3 ):195–212. [ Google Scholar ]
- Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution . Nature . 2012; 485 ( 7400 ):635–641. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- VanRaden PM. Efficient methods to compute genomic predictions . J Dairy Sci . 2008; 91 ( 11 ):4414–4423. [ PubMed ] [ Google Scholar ]
- Volk GM, Byrne PF, Coyne CJ, Flint-Garcia S, Reeves PA, Richards C.. Integrating Genomic and Phenomic Approaches to Support Plant Genetic Resources Conservation and Use . Plants . 2021; 10 ( 11 ):2260. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- White JW, Andrade-Sanchez P, Gore MA, Bronson KF, Coffelt TA, Conley MM, Feldmann KA, French AN, Heun JT, Hunsaker DJ, et al. Field-based phenomics for plant genetics research. 2012.
- Züst T, Strickler SR, Powell AF, Mabry ME, An H, Mirzaei M, York T, Holland CK, Kumar P, Erb M, et al. Independent evolution of ancestral and novel defenses in a genus of toxic plants ( Erysimum , Brassicaceae) . eLife . 2020; 9 ( April ). 10.7554/eLife.51712. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Plant Breeding and Genomics
Analysis of Variance (ANOVA): Experimental Design for Fixed and Random Effects
David M. Francis, The Ohio State University
Heather L. Merk, The Ohio State University
Matthew Robbins, The Ohio State University
Introduction
This page is a continuation of the Overview of Analysis of Variance page and is intended to help plant breeders consider the notions of fixed and random effects and the impacts these can have on ANOVA in the context of plant breeding. Briefly, ANOVA is a statistical test that takes the total variation and assigns it to known causes, leaving a residual portion allocated to uncontrolled or unexplained variation, called the experimental error. By measuring variability as sums of squares deviating from the mean sum of squares for all observations, the variation assigned to different controlled causes will be additive. It is therefore important to completely define the statistical model. Otherwise, the experimental error may be unnecessarily inflated (McIntosh, 1983).
Fixed and Random Effects
In the Overview of Analysis of Variance page, we considered the following linear model:
Y = m + f(treatment) + error
- Y is equal to the trait value
- m is the population mean
- f(treatment) is a function of the treatment
- error represents the residual
Fixed Effects
Intuitively, we may think about the treatments as being under our control and as “fixed.” Usually we are interested in comparing the dependent variable among factors/levels of the fixed effect. For example, we may want to evaluate whether differences in yield (dependent variable) between field locations for some elite cultivars we’ve been developing. To conduct this experiment, we would select the cultivars we want to evaluate and find suitable locations for our trial. We could think of the cultivars and locations as being fixed; we purposely chose to study different cultivars and locations. In this case, we are only interested in the performance of the elite cultivars we’re testing in the specific locations we’re testing.
Random Effects
Random effects, in contrast to fixed effects, are typically used to account for variance in the dependent variable. Also, unlike fixed effects, we aren’t looking to compare one level of the random effect to another. In our example, we could also consider location as a random effect. In the case of random effects, levels are chosen randomly from an infinite population and we want to make inferences that can extend beyond the sample. If this were the case, the cultivars would still be fixed effects, but location would be random. If we felt our locations were representative of all possible locations, we could use the different locations to help us make an evaluation of how well cultivars perform across locations as a whole, not just at the locations we’ve tested. The classification of effects as fixed or random determines the appropriate F-test.
ANOVA tables
McIntosh (1983) provides a set of reference tables for use during experimental design and analysis. These tables are intended for field experiments conducted over two or more locations or years. Some of the tables are replicated below.
Sources of variation | df | Mean squares | Expected mean squares | ||
---|---|---|---|---|---|
RL-RT | RL-FT | FL-FT | |||
Locations (l) | l-1 | M | σ + rσ + tσ + rtσ | σ + tσ + rtσ | σ + tσ + rtσ |
Blocks(Location) (r) | l(r-1) | M | σ + tσ | σ + tσ | σ + tσ |
Treatment (t) | t-1 | M | σ + rσ + rlσ | σ + rσ + rlσ | σ + rlσ |
Location x treatment | (l-1)(t-1) | M | σ + rσ | σ + rσ | σ + rσ |
Pooled error | l(r-1)(t-1) | M | σ | σ | σ |
1 R = random, F = fixed, L = location, T = treatment
Sources of variation | Mean squares | Expected mean squares | ||
---|---|---|---|---|
RL-RT | RL-FT | FL-FT | ||
Locations (l) | M | (M +M )/(M +M ) | M /M | M /M |
Blocks(Location) (r) | M | |||
Treatment (t) | M | M /M | M /M | M /M |
Location x treatment | M | M /M | M /M | M /M |
Pooled error | M |
In a genetic/breeding experiment, treatments would likely be genotypes or varieties.
When designing experiments, plant breeders must consider the question they want to answer. Consequently, plant breeders must consider what type of statistical analyses are appropriate to answer the desired question. With regards to ANOVA, two important points should be considered in this context.
- Unaccounted sources of variation will be pooled into the error term resulting in an inflated error.
- The appropriate F-tests differ depending on whether the effects are fixed or random.
- Fixed effects influence mean and random effects influence variance.
References Cited
- McIntosh, M. S. 1983. Analysis of combined experiments. Agronomy Journal 75: 153–155. (Available online at: http://dx.doi.org/10.2134/agronj1983.00021962007500010041x ) (verified 13 Dec 2010).
Additional Information
Many statistics textbooks provide a good discussion of theory and applications of ANOVA. Two examples are listed below.
- Clewer, A. G., and D. H. Scarisbrick. 2001. Practical statistics and experimental design for plant and crop science. John Wiley & Sons, New York.
- Steel, R.G.D., J. H. Torrie, and D. A. Dickey. 1997. Principles and procedures of statistics: A biometrical approach. McGraw–Hill, New York.
Funding Statement
Development of this page was supported in part by the National Institute of Food and Agriculture (NIFA) Solanaceae Coordinated Agricultural Project, agreement 2009-85606-05673, administered by Michigan State University. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the view of the United States Department of Agriculture.
PBGworks 865
Validation of cross-progeny variance genomic prediction using simulations and experimental data in winter elite bread wheat
- Original Article
- Open access
- Published: 18 September 2024
- Volume 137 , article number 226 , ( 2024 )
Cite this article
You have full access to this open access article
- Claire Oget-Ebrad ORCID: orcid.org/0000-0003-1832-871X 1 ,
- Emmanuel Heumez 2 ,
- Laure Duchalais 3 ,
- Ellen Goudemand-Dugué 4 ,
- François-Xavier Oury 1 ,
- Jean-Michel Elsen ORCID: orcid.org/0000-0001-6773-3151 5 &
- Sophie Bouchet ORCID: orcid.org/0000-0001-5868-3359 1
Key message
From simulations and experimental data, the quality of cross progeny variance genomic predictions may be high, but depends on trait architecture and necessitates sufficient number of progenies.
Genomic predictions are used to select genitors and crosses in plant breeding. The usefulness criterion (UC) is a cross-selection criterion that necessitates the estimation of parental mean (PM) and progeny standard deviation (SD). This study evaluates the parameters that affect the predictive ability of UC and its two components using simulations. Predictive ability increased with heritability and progeny size and decreased with QTL number, most notably for SD. Comparing scenarios where marker effects were known or estimated using prediction models, SD was strongly impacted by the quality of marker effect estimates. We proposed a new algebraic formula for SD estimation that takes into account the uncertainty of the estimation of marker effects. It improved predictions when the number of QTL was superior to 300, especially when heritability was low. We also compared estimated and observed UC using experimental data for heading date, plant height, grain protein content and yield. PM and UC estimates were significantly correlated for all traits (PM: 0.38, 0.63, 0.51 and 0.91; UC: 0.45, 0.52, 0.54 and 0.74; for yield, grain protein content, plant height and heading date, respectively), while SD was correlated only for heading date and plant height (0.64 and 0.49, respectively). According to simulations, SD estimations in the field would necessitate large progenies. This pioneering study experimentally validates genomic prediction of UC but the predictive ability depends on trait architecture and precision of marker effect estimates. We advise the breeders to adjust progeny size to realize the SD potential of a cross.
Avoid common mistakes on your manuscript.
Introduction
How humans choose candidates for selection has evolved over time. At first, selection based solely on phenotypic observation of progeny was next assisted by pedigree-based predictions, more particularly in animals (Henderson 1975 ; Falconer and Mackay 1996 ). Nowadays, genomic predictions of individual genetic values (Whittaker et al. 2000 ; Meuwissen et al. 2001 ) and cross values (Schnell and Utz 1975 ; Zhong and Jannink 2007 ; Lehermeier et al. 2017 ; Danguy des Déserts et al. 2023 ) are promising and open up new prospects for improving mating design and optimizing long-term genetic gain (Allier et al. 2019a , 2019b ).
In plants, crosses are essentially selected to secure a high mean yield performance of the progeny. Breeders seek to identify which crosses will produce the best superior progeny in order to ensure genetic gain in the next selection cycle, and to maintain long-term genetic gain. It is thus important that the selected crosses also generate a high progeny genetic variance. A strategy to find these crosses is to select on the usefulness criterion ( \(\text{UC}\) ) (Schnell and Utz 1975 ; Zhong and Jannink 2007 ) defined as: \(\text{UC}=\text{PM}+i\times \text{SD}\) , where \(\text{PM}\) is the parental mean, \(i\) the selection intensity corresponding to the fraction \(q\) of selected progenies, and \(\text{SD}\) the progeny standard deviation (square root of progeny variance). PM can be estimated from the mean of additive parental genetic values using their phenotypes or genomic estimated breeding value (GEBV). Several attempts have been made to predict progeny variance using phenotypic distances (Utz et al. 2001 ) and genetic distances (Bohn et al. 1999 ; Hung et al. 2012 ), but without success. According to the recent methodological studies, genomic predictions appear to be a promising tool for optimizing complementarity between parents, as well as short- and long-term genetic gains in breeding schemes.
Many factors can influence genomic prediction accuracy (Liu et al. 2018 ). This was extensively demonstrated for GEBV (see Elsen 2022 for a review), and to a lesser extent for progeny standard deviation ( \(\text{SD}\) ), and for usefulness criterion ( \(\text{UC}\) ) which combines PM and SD. The accuracy of these predictions influences the relative efficiency of selection using \(\text{UC}\) . A first factor influencing genomic prediction accuracy is the genetic architecture of traits, which includes the number of QTL and the distribution of linkage disequilibrium over the genome. Zhong and Jannink ( 2007 ) demonstrated algebraically that genetic gain using \(\text{UC}\) criteria decreases when the number of QTL increases in plants. In dairy cattle, Santos et al. ( 2019 ) demonstrated a higher \(\text{UC}\) efficiency relatively to classical selection based on GEBV when the trait is controlled by 200 QTL rather than 20. A second factor influencing genomic prediction ability is the number of progenies. The progeny standard deviation ( \(\text{SD}\) ) estimated algebraically, or by simulation, may be correct for an ideal offspring population of infinite size, but in practice the number of progenies is limited. A third factor is the precision of the marker effects estimation which depends (1) on the size of the reference population (the set of individuals both phenotyped and genotyped) and (2) on the statistical method used when estimating these marker effects. The effects of the size and characteristics of the reference population have been extensively studied for the estimation of GEBV (Hayes et al. 2009 ; Calus 2010 ; Goddard et al. 2011 ; Erbe et al. 2012 ; Lee et al. 2017 ), showing that the precision increases when the reference population is bigger, more diverse and genetically closer to the set of selection candidate individuals. A huge literature deals with comparisons of statistical prediction methods of marker effects and GEBV precision. The list includes genomic best linear unbiased prediction (GBLUP) (Habier et al. 2007 ), RR-BLUP, Bayesian methods such as Bayes A, B, C, etc. (Meuwissen et al. 2001 ; Gianola et al. 2009 ), partial least square and least absolute shrinkage and selection operator (Usai et al. 2009 ). If accuracies of the different genomic prediction methods are similar when the trait depends on a large number of genes with small effect (infinitesimal model), Bayesian approaches perform better when the trait is controlled by QTL with very different effects.
The precision of the progeny standard deviation ( \(\text{SD}\) ) estimations not only depends on the marker effects estimation method, but also on the way they are processed. An algebraic formula to predict inbred progeny variance derived from a cross between two inbred lines was provided by Lehermeier et al. ( 2017 ) based on marker effect estimates and their co-segregation in progeny. In this paper, to this prediction described as the “Variance of posterior mean” (VPM) was compared a “Posterior mean variance” (PMV) estimator which, using an MCMC method, takes into account the error in marker effect estimates and gives higher genetic gain. In the present paper, we use an algebraic interpretation of their PMV and we propose a more advanced estimator which considers the fact that the uncertainty of the estimation of marker effects is modulated by the genomic composition of each parent as well.
Those elements influencing estimation precisions were considered in the simulations we performed to better understand the parameters impacting our predictive ability of \(\text{PM}\) , \(\text{SD}\) , and \(\text{UC}\) .
Several other studies showed that selecting on criteria based on progeny variance such as UC could actually increase the genetic gain in plants and in animals (Tiede et al. 2015 ; Allier et al. 2019a , 2019b ; Bijma et al. 2020 ). But, to our knowledge, although the ability to predict the value of a cross has only been validated by simulations, just a few publications have validated the ability to predict UC compared to real data (Lian et al. 2015 ; Neyhart et al. 2019 ; Osthushenrich et al. 2017 ), none on wheat, and no publications tested the ability of prediction of SD.
So this study was carried out to (i) estimate the ability of genomic prediction of three cross value components (PM, SD, and UC) in experimental data, (ii) to identify the factors influencing this predictive ability based on simulations, (iii) to test if we could improve SD estimation by taking into account the error of the estimation of marker effects and the genomic composition of each parent. We were able to compare cross value components predictions and experimental observations for 101 crosses with an average of 55 unselected progenies each.
Materials and methods
The objective of the study was to compare genomic predicted values of parental/progeny mean (PM), parental gametic variance/standard deviation (SD) and mean of top progenies (UC: usefulness criterion) to experimental observations of winter bread wheat. We evaluated four traits (yield, grain protein content, plant height, and heading date) for 101 crosses with 55 progenies on average for 3 years.
In order to compare our results to a reference and to better understand the parameters influencing our predictive ability, we conducted simulations based on the same crosses, using the same parental genotypes.
The training population (TP)
Tp composition.
The TP was composed of 2,146 INRAE-AO (the French research institute Institut National de la Recherche pour l’Agriculture, l’Alimentation et l’Environnement and its subsidiary company Agri-Obtentions) breeding lines and 650 French registered varieties evaluated by GEVES ( Groupe d'Étude et de contrôle des Variétés Et des Semences ). Data were collected between 2000 and 2022 in two French breeding networks (North and South). Each year, the lines were phenotyped in up to 11 locations for INRAE-AO and 15 for GEVES in split-plot trials with two repetitions. In each location, four registered varieties were used as controls for several consecutive years. Those controls evolved gradually during 20 years (only one control can be changed on one specific year). In total, up to 169 (42) lines were observed each year for the INRAE-AO (GEVES) dataset. For two consecutive years, up to 76 (40) lines were in common for the INRAE-AO (GEVES) dataset. We report here data for yield, grain protein content, plant height and heading date under high-yield crop management (optimized pesticide, fungicide and nitrogen amount) (Supplementary Tables S1 –S9).
TP adjusted phenotypes
Trial experts removed blocks presenting obvious experimental problems and outliers. In each trial, we modeled spatial heterogeneity when field coordinates where available. Rows and columns were modelized as random effects, and field trends were modelized as a smooth surface as implemented in the 2-dimensional P-spline mixed model of the SpATS v1.0.16 R package (Rodríguez-Álvarez et al. 2018 ) (Supplementary Table S1 ). Genotype was set as a fixed term to estimate genotypic BLUEs. Note that it has been proven that when there was no spatial variation, this model did not perform worse than not including a spatial effect (Selle et al. 2019 ). In the other trials, we applied arithmetic means to replicates to get genotypic values. We then calculated best linear unbiased estimator (BLUE) values for each line of the TP using the following model:
where \(y_{{ij}}\) is the adjusted mean for line \(i\) in environment (year and location intersection) \(j\) , \(\mu\) the grand mean, \({g}_{i}\) the fixed effect of line \(i\) , \({e}_{j}\) the random effect of environment \(j\) , and \({\varepsilon }_{ij}\) the random effect of residual. The model was implemented using the lme4 R package. We considered these BLUE values as phenotypes in further analyses (Supplementary Table S1 ).
TP genotypes
The 2,146 lines were genotyped using a 35 K SNP (single nucleotide polymorphisms) chip representative of the TaBW280K array (Rimbert et al. 2018 ; Ben-Sadoun et al. 2020 ). After quality filtering for minor allele frequency below 0.01 and for missing call rate superior to 0.9, we used 19,065 markers. We imputed missing genotypes with the algorithm implemented in BEAGLE v5.3 (Browning et al. 2018 ) using the genetic positions previously estimated for a West-European bread wheat population (Danguy des Déserts et al. 2021 ). The total number of lines with genotypes and phenotypes was 2,146, 2,062, 2,126 and 2,145 for yield, grain protein content, plant height and heading date, respectively (Supplementary Table S1 ).
TP variance components
First, we estimated repeatability of each trait, using the following model:
where \(y_{{ijk}}\) is the raw observations for line \(i\) in environment (year and location intersection) \(j\) , repetition \(k\) , \(\mu\) the grand mean, \({g}_{i}\) the effect of line \(i\) , \({e}_{j}\) the effect of environment \(j\) , \({\left(g\times e\right)}_{ij}\) the effect of the interaction between line \(i\) and environment \(j\) , and \(\varepsilon\) the effect of residual. All effects were considered as random. The model was implemented using the lme4 R package. We then computed repeatability (Piepho and Möhring 2007 ) at the plot and design levels as:
where \({\widehat{\sigma }}_{g}^{2}\) , \({\widehat{\sigma }}_{\left(g\times e\right)}^{2}\) and \({\widehat{\sigma }}_{\varepsilon }^{2}\) are the line, the interaction between line and environment and the residual estimated variances, respectively, and \(\text{nb}\_\text{env}\) and \(\text{nb}\_\text{rep}\) are the average number of environments and replications per line, respectively.
Next, we estimated genomic heritability of each trait with a RR-BLUP (Ridge Regression Best Linear Unbiased Predictor) approach, using the following model, implemented in the rrBLUP v4.6.1 R package (Endelman 2011 ):
where \(\varvec{y}\) is the vector of BLUE values, \({\varvec{\mu}}\) the grand mean, \({\varvec{\beta}}\) a vector of random SNP effects assumed to be normally distributed \(N\left(0,{{\varvec{I}}\sigma }_{\beta }^{2}\right)\) , with its matrix of incidence \({\varvec{X}}\) , and \({\varvec{r}}\) the vector of random residual effects assumed to be normally distributed \(N\left(0,{{\varvec{I}}\sigma }_{r}^{2}\right)\) . We then computed genomic heritability (Nyquist and Baker 1991 ; Falconer and Mackay 1996 ) as:
where \(\hat{\sigma }_{g}^{2} = \hat{\sigma }_{\beta }^{2} \times 2\mathop \sum \limits_{{i = 1}}^{n} p_{i} q_{i}\) , with \({p}_{i}\) being the allele frequency of SNP \(i\) , \({ q}_{i}=1-{p}_{i}\) , and \(n\) the total number of SNP.
Predictive ability of genome estimated breeding values (GEBV) in TP
We assessed the predictive ability in the TP using a tenfold cross-validation procedure iterated 10 times. In each iteration, the TP was split into 10 equal sets, each containing 1/10-th of the TP, which was predicted by the remaining 9/10-ths of the TP. The predictive abilities of individual models are expressed as the average between the GEBV and observed phenotypic values in the validation set across all tenfold. The process was repeated 10 times with 10 new independent sets being sampled each replication, in which case the reported predictive abilities are averages across all folds and replications. This cross-validation procedure is implemented in the PopVar v1.3.0 R package (Mohammadi et al. 2015 ; Neyhart et al. 2019 ).
The experimental population (EP)
In this study, we generated 101 crosses evaluated in six environments for three years. The objective was to compare cross value components estimated using genomic predictions and the same components observed in the field. Details about the experimental design are provided in Supplementary Table S10.
EP description
Within the FSOV PrediCropt project, a total number of 101 crosses using 106 different parents included in the TP were generated between 2020 and 2022. These crosses were generated by a private company (Florimond-Desprez) (16), and INRAE-AO (85). The number of progenies per cross ranged from 8 to 122 (mean: 54.8 progenies per cross, Supplementary Table S11). The 16 crosses generated by Florimond-Desprez were evaluated in two locations (Cappelle and Houville, France), and the 85 remaining were evaluated in 4 other locations in France (Auzeville: INRAE UE GC, Clermont-Ferrand: INRAE UE PHACC, Lusignan: INRAE UE FERLUS, and Mons: INRAE UE GCIE). Crop management methods corresponded to high-yield objectives (optimized pesticide, fungicide and nitrogen amount). Two registered varieties and the parents of the generated crosses were used as controls and repeated across trials (year x location). At least 10% of the progenies were replicated in each trial. For each partner, all crosses were observed in all locations.
Lines were evaluated for yield, grain protein content, plant height and heading date. For grain protein content, only crosses generated during the two first years (73) were analyzed in this study. We discarded from further analyses plots that were affected by lodging or that were still segregating (presenting different types of plants). We kept all crosses for the study after checking congruency between parents and progeny (by genotyping 2 progenies per cross and applying quick visual observations such as presence/absence of awns, common stems and leaf color). After cleaning raw observations, the total number of phenotyped progenies were 5565, 3522, 5591 and 5590 for yield, grain protein content, plant height and heading date, respectively (Supplementary Table S10).
Adjusted EP phenotypes
For each trial, we adjusted phenotypes for spatial effects using the SpATS R package. We checked for usual Pearson’s correlations between sites and years for each trait (Supplementary Tables S12 and S13).
We then calculated BLUE values for each line using the following model:
where \(y_{{ij}}\) is the adjusted phenotype for line \(i\) in environment (year and location intersection) \(j\) , \(\mu\) the grand mean, \({g}_{i}\) the fixed effect of line \(i\) , \({e}_{j}\) the random effect of environment \(j\) , and \({\varepsilon }_{ij}\) the random effect of residual. The total number of environments was 11 for grain protein content and 15 for the three other traits. The model was implemented using the lme4 R package. We considered these BLUE values as phenotypes in further analyses. The total number of BLUE values was 5658, 3596, 5684 and 5683 for yield, grain protein content, plant height and heading date, respectively (Supplementary Table S10).
Cross value components predictive ability
In this study, we considered three components for the value of a cross: the parental mean (PM), the progeny standard deviation (SD) which is the square root of gametic variance and the UC, corresponding in our study to the expected mean of the 7% best progeny of a cross (Schnell and Utz 1975 ; Zhong and Jannink 2007 ; Lehermeier et al. 2017 ).
Models and formulae to predict cross value components
The three cross value components can be predicted by analytic formulae using information obtained from the TP.
Marker effects estimation The estimation of marker effects was performed with the PopVar R package using the following model (Meuwissen et al. 2001 ):
where \(\varvec{y}\) is the vector of TP BLUEs described above, \({\varvec{\mu}}\) the grand mean, \({\varvec{\beta}}\) a vector of random SNP effects assumed to be normally distributed \(N\left(0,{\boldsymbol{ }{\varvec{I}}\sigma }_{\beta }^{2}\right)\) , with its matrix of incidence \({\varvec{X}}\) for SNPs (TP), and \({\varvec{r}}\) the vector of random residual effects assumed to be normally distributed \(N\left(0,{ I\sigma }_{r}^{2}\right)\) . We tested genomic prediction models using different hypotheses of marker effect variance distribution (Meuwissen et al. 2001 ; Habier et al. 2007 , 2011 ; VanRaden 2008 ; Park and Casella 2008 ): “BRR” (Bayesian ridge regression, Bayesian approach with a Gaussian prior for SNP effects), “BayesA” (Bayesian approach with a scaled-t prior), “BayesB” (Bayesian approach with a two-component mixture prior with a point of mass at zero and a scaled-t slab), “BayesC” (Bayesian approach with a two-component mixture prior with a point of mass at zero and a Gaussian slab), and “BL” (Bayesian Lasso, Bayesian approach with a double-exponential prior). These models are implemented in the R package BGLR v1.0.9 (Pérez and de los Campos 2014 ) available in PopVar .
PM estimation We predicted PM as the average parental GEBV computed as the matrix product between parental genotypes and estimated marker effects:
where \(\mu _{{P_{1} \times P_{2} }}\) is the PM for cross \({P}_{1}\times {P}_{2}\) , \(\widehat{{\varvec{\beta}}}\) the vector of estimated marker effects, and \({{\varvec{X}}}_{{{\varvec{P}}}_{1}}\) and \({{\varvec{X}}}_{{{\varvec{P}}}_{2}}\) are the vectors of genotypes for parent \({P}_{1}\) and parent \({P}_{2}\) , respectively.
Gametic variance estimation The formula takes into account marker effects and phase in parents as well as recombination rates. We used the formula proposed by Lehermeier et al. 2017 :
where \(\hat{\sigma }_{{P_{1} \times P_{2} }}^{2}\) is the gametic variance for cross \({P}_{1}\times {P}_{2}\) , and \({{\varvec{V}}}_{{{\varvec{P}}}_{1}\times {{\varvec{P}}}_{2}}\) the genotypic variance–covariance matrix for biparental RIL progeny calculated as follows:
where \({D}_{jl}^{*}\) is the linkage disequilibrium (LD) between alleles at loci \(j\) and \(l\) among both parental lines (either 0 if parents carry the same alleles at loci \(j\) and/or \(l\) , or 0.25 if alleles are different at both locus between parents and in coupling phase, that is, one parent carries the two beneficial alleles, while the other carries deleterious alleles, or -0.25 if the alleles are in repulsion phase), \(k\) the number of generations, and \({c}_{jl}\) the recombination rate between loci \(j\) and \(l\) . The recombination rates were computed from the West-European bread wheat population genetic map (Danguy des Déserts et al. 2021 ).
An alternative approach was used to compute the gametic variance. We called Vg1 (standing for variance of gametes) the formulae proposed by Lehermeier et al. 2017 using RR-BLUP model (Meuwissen et al. 2001 ) to estimate marker effects:
For the model called Vg2, we added a first term taking into account the error in marker effect estimation given by Lehermeier et al. 2017 (Eq. 10) in their algebraic version of the posterior mean variance (PMV):
with \(\left( {\beta |X,y} \right) = \hat{\sigma }_{\beta }^{2} \left( {I - X^{\prime} \left( {XX^{\prime} + I\frac{{\hat{\sigma }_{r}^{2} }}{{\hat{\sigma }_{\beta }^{2} }}} \right)^{{ - 1}} X} \right)\) , where \({\widehat{\sigma }}_{\beta }^{2}\) and \({\widehat{\sigma }}_{r}^{2}\) are the markers and residual estimated variances, and \({\varvec{X}}\) is the matrix of TP’s genotypes.
For the model called Vg3, we added a second term that aims to consider the fact that the uncertainty of the estimation of marker effects is modulated by the genomic composition of each parent:
where \({{\varvec{X}}}_{{{\varvec{P}}}_{1}\times {{\varvec{P}}}_{2}}\) is the vector of genotypes for the F1 of cross \({P}_{1}\times {P}_{2}\) . Details are provided in Supplementary Information S1 .
UC estimation The UC was calculated as follows (Schnell and Utz 1975 ; Zhong and Jannink 2007 ):
where \(i\) is the selection intensity corresponding in our study to a 7% selection rate ( \(i \sim 1.91\) , computed as the inverse Mills ratio), and \({\widehat{\sigma }}_{{P}_{1}\times {P}_{2}}\) the estimated progeny SD for cross \({P}_{1}\times {P}_{2}\) .
EP cross value components
We calculated the experimental cross value components for each cross from the distribution of the progeny BLUE values: mean (PM), standard deviation (SD) and mean of the 7% best progeny (observed UC) (or the value of the best progeny when progeny size was inferior to 15).
Cross value components’ predictive ability
We compared the three cross value components (PM, SD and UC) genomic estimates with the corresponding experimental values using a weighted Pearson’s correlation ( weights v1.0.4 R package (Pasek and Schwemmle 2021 )) to adjust for the number of progenies per cross. These correlations were considered as predictive ability.
Factors influencing predictive ability of the cross value components (simulations)
In order to evaluate the different parameters impacting the predictive ability of the three cross value components (PM, SD and UC) and to have a reference for the values we can get in optimal/suboptimal conditions (perfect/unperfect prediction of marker effects, limited/infinite progeny size), we developed a simulation study.
Simulation design description
The general scheme of the simulation study is provided in Fig. 1 . We estimated cross value components using analytic formulae 1 or by simulating progenies 2 for the same 73 crosses used in the experiment.
Simulation scheme. TP training population, TBV true breeding value, pred_true: predicted true cross value components, pred_estimated: predicted estimated cross value components, progeny_phenotypes: observed (simulated phenotypes) cross value components
We ran different scenarios, varying the number of QTL (from 10 to 1,000), heritability (from 0.2 to 0.8) and progeny size (from the size in experiment to 2,000). We simulated 20 traits for each number of QTL. We positioned the QTL randomly along the genetic map (Danguy des Déserts et al. 2021 ). We assigned them an effect drawn from a normal distribution \(N\left(\text{0,1}\right).\) These true marker effects 1.2 were then adjusted to provide a variance of true breeding values (TBV) similar to the GEBV variance observed for yield in the TP.
Analytic formulae 1. To estimate cross value components, we either used the true marker effects 1.2 (TRUE scenario) to mimic a scenario where the error in marker effect prediction is negligible (TP size is infinite and the prediction model perfect) or estimated marker effects 1.5 (ESTIMATED scenario) when the size of the TP is limited. The TBV of the TP were calculated as the cross-product between true marker effects and genotypes 1.3 . We simulated phenotypes 1.4 for the TP by adding a noise to their TBV. This noise was normally distributed with a variance corresponding to the residual variance from a specified heritability. For instance, if the TBV variance is 80 and we specify a heritability of 0.7, we would draw the noise for phenotypes computation from \(N\left(\text{0,34}\right)\) according to the following equation: \({h}^{2}=\frac{80}{80+34}=0.7\) . We used the genotypes (excluding QTL) and simulated phenotypes of the TP to estimate marker effects 1.5 using different genomic prediction models. The pred_true ( pred_estimated ) cross value components were calculated using true (estimated) marker effects using the formula described in 2.2.1.1. Recombination rates were estimated in a Western Europe diversity panel (Danguy des Déserts et al. 2021 ).
Simulated progenies 2. We simulated progeny genotypes 2.2 for the 73 experimental crosses 2.1 phenotyped for all four traits, using the genetic map from (Danguy des Déserts et al. 2021 ) and the MoBPS v1.6.64 R package (Pook et al. 2020 ). The number of progenies per cross varied between the one observed in the experiment (Supplementary Table S11) and 2,000. We produced 25 times these number of progenies per cross to account for variations due to Mendelian gamete sampling. (The results presented in this study are the average of these 25 simulations.) Progeny TBV 2.3 were calculated as the cross-product between true simulated marker effects and the simulated progeny genotypes. Progeny phenotypes were simulated 2.4 by adding a noise to their TBV. This noise was normally distributed with a variance corresponding to the residual variance observed in the TP (34 if we keep the example of the previous section with a heritability of 0.7).
Finally, we calculated the cross value components for each cross from the distribution of the progeny phenotypes ( progeny_phenotypes in Fig. 1 ): mean, standard deviation and mean of the 7% best progeny for the UC (or the value of the best progeny when progeny size was inferior to 15).
Predictive ability For each scenario, we compared the genomic predictions of the three cross value components obtained with the analytic formulae using true or estimated marker effects, with the ones obtained in simulated progenies using a weighted Pearson’s correlation to adjust for the number of progenies per cross. These correlations were considered as predictive ability.
Parameters influencing the predictive ability
For each scenario, 20 trait architectures were simulated. To compare scenarios, we considered for each cross value component the median values of correlations across the 20 runs.
We explored different parameters that could impact the predictive ability of the three cross value components: genetic architecture (number of QTL and heritability of the trait), number of progenies per cross, genomic prediction models, three different formulae to estimate the gametic variance (Vg1, Vg2, Vg3) and mating design (similar to the experiment or maximizing diversity).
Genetic architecture We tested different numbers of QTL for the simulation of the genetic determinism of agronomic traits: 10-30-300-1,000. The scenarios with 10 or 30 QTL represent polygenic traits (with major genes), whereas 300 and 1,000 QTL scenarios correspond to quantitative traits. Heritability ranged from 0.2 to 0.8 (with a 0.1 step). Note that this heritability is the true heritability. In our simulations, the corresponding genomic heritability estimated using a RR-BLUP approach was lower. For instance, a true heritability of 0.7 corresponded to a genomic heritability of ~ 0.5 (Supplementary Figure S1 ), congruent with the genomic heritability observed for yield in experimental data.
Progeny size To evaluate if the observed variance in the field using a limited progeny size is reliable, we tested different numbers of progenies per cross, starting with the real numbers observed in our experiment (ranging from 5 to 122 progenies; mean = 47.6 progenies per cross, Supplementary Table S11) to a maximum of 2000 progenies for all crosses (50, 100, 200, 1,000 and 2,000).
Precision of marker effect estimation To evaluate the importance of the quality of marker effects estimation in genomic predictions, we implemented a true and an estimated scenario. For the later, several genomic selection methods were compared: different Bayesian approaches and a RR-BLUP approach.
Precision of SD estimation We implemented two algebraic variants of the PMV approach proposed by Lehermeier et al. 2017 with a RR-BLUP approach described earlier (“Vg1,” “Vg2,” “Vg3”).
Genetic composition of the mating plan The experimental mating plan of the PrediCropt FSOV project was chosen by breeders to maximize genetic gain. To evaluate the bias possibly generated by expert choice (some parents were used in several crosses and some were genetically similar, as described in Supplementary Table S11 and Supplementary Figure S2, respectively), we tested a different set of parents. This new set was chosen to be representative of the total genetic diversity present in the TP and the least possible genetically similar. We chose 146 different parents to generate 73 crosses. To do so, we first extracted the 10 principal components (PC analysis) from the variance-standardized genomic relationship matrix (GRM) of the TP lines, using PLINK v2.00a2LM toolset. We computed Euclidean distances between lines using those 10 PC and performed a hierarchical cluster analysis (146 groups) with the Ward’s minimum variance method (Murtagh and Legendre 2014 ) ( dist and hclust functions, respectively, from the stats v4.1.1 R package (R Core Team 2021 )). In each cluster, we picked the most distant line. Finally, we generated 73 crosses by randomly mating this new set of 146 parents 10 times. We considered the median values of correlations across the 10 random mating plans to compare with the crosses from the experiment. The plots of the 4 first PC for the two sets of parents (“PrediCropt” crosses and this new set of “Unrelated” parents) show that “Unrelated” parents cover a larger diversity (Supplementary Figure S2).
Quality of the training population (TP)
We used as TP the historical French registration data (GEVES) and the INRAE-AO breeding program, including 2000–2022 evaluation trials, both Southern and Northern France networks. The average Pearson’s correlations between years in each dataset were 0.69 for yield, 0.78 for grain protein content, 0.87 for plant height and 0.91 for heading date (Supplementary Tables S2 to S9).
The repeatabilities at the plot level were 0.32 for yield, 0.52 for grain protein content, 0.73 for plant height and 0.85 for heading date. Repeatabilities at the design level were very high for all traits (ranging from 0.91 to 0.99). Those values are in congruence with other trials and reflect higher GxE interactions for grain protein content and yield compared to plant height and heading date. Genomic heritabilities were 0.53 for yield, 0.51 for grain protein content, 0.56 for plant height and 0.73 for heading date. If we compare repeatability’s rankings to genomic heritability’s rankings, we notice a high missing heritability for plant height, maybe due to epistasis. Finally, the predictive ability using an additive model ranged from 0.56 for plant height to 0.70 for heading date (Table 1 ).
Simulation study
In our simulation study, we explored different ranges of parameters that could impact the predictive ability of the three cross value components (PM, SD and UC): the number of QTL (10–1000), the heritability (0.2–0.8), the progeny size (experimental progeny size ~ 48–2000), the precision of the estimation of marker effect (true or estimated) and the genetic composition of the mating plan. For all scenarios, the predictive ability was higher for PM and UC compared to SD.
Genetic architecture
The predictive ability of the three cross value components increased with heritability (Fig. 2 ). For 300 QTL, for instance, predictive abilities increased from 0.64 to 0.93 for PM, from 0.48 to 0.86 for UC and from 0.04 to 0.36 for SD between heritability 0.2 and heritability 0.8.
Impact of genetic architecture on the predictive ability of the three cross value components, based on simulations . The number of progenies per cross was the same as in experimental data, and the marker effects were estimated with “BayesA” method. PM parental mean, SD standard deviation, UC usefulness criterion
The predictive ability of the three cross value components decreased when QTL number increased (Fig. 2 ). While PM predictions were slightly impacted (0.91 to 0.84 between 10 and 1,000 QTL with h2 = 0.6, for instance), SD decreased significantly (from 0.69 to 0.18). UC that is calculated with both PM and SD ( \(\text{UC}=\text{PM}+i\times \text{SD}\) ) was still correctly estimated (dropped from 0.82 to 0.77). This suggests that UC is more driven by PM than SD in this elite material.
Progeny size
To evaluate if the observed variance in the field using a limited progeny size (< 100) is reliable, we tested different numbers of progenies per cross (Fig. 3 ).
Impact of the number of progenies per cross on the predictive ability of the three cross value components, based on simulations. A number of progenies equal to 0 means that the number of progenies was the same as in the experimental data (see Materials and methods section). Marker effects were estimated with “BayesA” method. PM parental mean, SD standard deviation, UC usefulness criterion
The predictive ability of the three cross value components increased when the number of progenies per cross increased from the number of progenies in the experiment (~ 48) to 2,000 (Fig. 3 ). PM and UC were slightly impacted: The predictive abilities increased from 0.90 to 0.91 for PM and from 0.82 to 0.90 for UC when h2 = 0.7 and QTL number = 300, for instance. The interest to produce large progeny was much higher for SD estimation: Predictive abilities increased from 0.29 to 0.57. We observed a first slope decrease when progeny size is superior to 250 and a second when progeny size is superior to 1000, close to a plateau.
The relative increase in predictive ability for SD was remarkably stronger in scenarios with high number of QTL (Fig. 3 and Supplementary Figure S3), with a plateau when the number of QTL is superior to 300.
Marker effect estimation precision
We compared two scenarios to test the impact of marker effect estimation precision on cross value components prediction. We considered the TRUE scenario as a reference ( pred_true in Fig. 1 ). In this optimal situation, TP size is infinite, and its genetic composition ideal and marker effects are perfectly estimated: We use true simulated marker effects to estimate PM, SD and UC. In the more realistic ESTIMATED scenario ( pred_estimated in Fig. 1 ), marker effects are estimated using a genomic prediction model from our TP that is limited.
In the TRUE scenario ( pred_true ), the predictive abilities were very high for PM and UC whatever the number of QTL or heritability. But SD predictive ability was high only when the number of QTL was inferior to 300. For instance, the predictive ability was 0.97, 0.89 and 0.81 for PM, UC and SD, respectively, when QTL number was 10 and when the number of progenies was the same as in the experimental data (Fig. 3 and Supplementary Figure S3), and 0.99, 0.95 and 0.95 when the number of progenies was 2,000 (Fig. 4 B). We observed a strong decrease in predictive ability for SD when the number of QTL increased (0.37 for QTL number = 1000 versus 0.81 for QTL number = 10 and when progeny size = the number of progenies in experimental data, Fig. 4 A). This decrease was much smaller when the number of progenies per cross was 2000 (0.86 for QTL number = 1000 versus 0.95 for QTL number = 10, Fig. 4 B). All these results show that if we would perfectly estimate marker effects and progeny size was sufficient, we would be able to estimate SD as well very well, even when the number of QTL is high.
Impact of the quality of marker effect estimates on the predictive ability of the three cross value components based on simulations. The number of progenies per cross was the same as in experimental data for plot A and equal to 2000 for plot B. The heritability was 0.7. The true cross value components (pred_true) were calculated with true marker effects, whereas the estimated cross value components (pred_estimated) were calculated with estimated marker effects using “Bayes A” method. PM parental mean, SD standard deviation, UC usefulness criterion
As we observed an important gap of predictive abilities between the TRUE and the ESTIMATED scenarios for SD (Fig. 4 ), we tested different genomic prediction models with different hypotheses on trait architecture (BayesA, BayesB, BayesC, BL, BRR and RR-BLUP (Vg1)). We also tested 2 different SD estimators deriving from the algebraic formulae of SD: Vg2 = Vg1 + term that takes into account the error of estimation of marker effects, and Vg3 = Vg2 + term that takes into account the variance in parents (Supplementary Information S1 ) (Fig. 5 ).
Impact of the genomic selection method on the predictive ability of SD based on simulations . The number of progenies per cross was the same as in experimental data for plots A and B and equal to 2,000 for plots C and D. In A and C, the heritability was 0.7. In B and D, the number of QTL was 300. PM parental mean, SD standard deviation, UC usefulness criterion
When progeny size was equal to the experimental design (~ 48) and QTL number was inferior or equal to 300, Bayesian models performed better than the RR-BLUP approach, and more particularly the Bayes A and B models. For instance, predictive abilities were 0.76 for both Bayes A and B models versus 0.62 for Vg1 when QTL number = 10 and heritability = 0.7 ( Fig. 5 A and B). Taking into account the error in marker effect estimates slightly increased the predictive abilities in scenarios with 1,000 QTL and heritability = 0.7 (0.26 for Vg2 and Vg3 compared to 0.23 for Vg1) (Fig. 5 B).
When progeny size was 2,000 and QTL number was inferior to 300, Bayes B still gave better predictions of SD (for instance, 0.91 for Bayes B versus 0.73 for Vg1 when QTL number = 10 and heritability = 0.7; Fig. 5 C). When number of QTL ≥ 300, Vg3 increased predictive abilities (Fig. 5 C), especially when heritability was low (0.25 for Vg1 versus 0.33 for Vg3 when QTL number = 300 and heritability = 0.2; Fig. 5 D). Vg3 is equal to Bayes B when heritability = 0.8 and number of QTL = 300 (0.64) (Fig. 5 D).
All these results show that the quality of marker effect estimates has a strong impact on the predictive ability of SD. For traits with small number of QTL, Bayesian models (Bayes A or B) should be preferred. When the number of QTL is high, Vg3 should be used, especially if heritability is low. (It is equivalent to Bayes B when heritability is very high.)
Genetic composition of the mating plan
To evaluate the bias possibly generated by the choice of crosses in our experimental design, we compared the results obtained with the experimental crosses called “PrediCropt” with a set of crosses called “Unrelated.” We selected the “Unrelated” parents to be representative of the total genetic diversity present in the TP and the crosses by mating parents with low genetic similarity (see Materials and methods). The 4 first PC for the two sets of parents are provided in Supplementary Figure S2 and show that the “Unrelated” parents are more scattered throughout the TP than “PrediCropt” parents (for instance, no “PrediCropt” parent in the bottom left side in graphic A—PC 1 versus PC 2).
Ten random associations of the “Unrelated” set of parents gave the predictive abilities’ results provided in Supplementary Figure S4 for scenario with 300 QTL and heritability of 0.7. Little differences were observed between the two sets of parents. Indeed, the predictive abilities were 0.90 (“PrediCropt”) versus 0.88 (“Unrelated”) for PM, 0.82 for both sets for UC, and 0.29 (“PrediCropt”) versus 0.24 (“Unrelated”) for SD. According to these results, we considered as negligible the bias generated by the choice of our experimental crosses.
Experimental data application
In our experimental data, we phenotyped yield, plant height and heading date for 101 progenies and grain protein content for 73 progenies (Supplementary Table S11). On average correlations between trials (year x location) were 0.48, 0.43, 0.79 and 0.88 for yield, grain protein content, plant height and heading date, which is congruent with other trials (Supplementary Tables S12 and S13).
We compared experimental cross value components to genomic predictions described above. Note that both are predictors of the true values and we do not know a priori if one is better than the other. We just look here at their correlations (Table 2 ) . Observed and predicted PM was significantly correlated for the 4 phenotypes. Median correlation values across all prediction models were 0.38, 0.63, 0.51 and 0.91, for yield, grain protein content, plant height and heading date, respectively. SD was significantly correlated for plant height and heading date (0.59 and 0.38, respectively) but not for yield and grain protein content (0.01 and 0.13, respectively). UC was, however, significantly correlated for all 4 traits (0.45, 0.52, 0.54 and 0.74 for yield, grain protein content, plant height and heading date, respectively).
Note that for plant height and heading date, SD was better predicted with non-Gaussian Bayesian approaches (0.63–0.64 and 0.46–0.49, respectively) than with Gaussian approaches (0.53–0.54 and 0.33–0.34, respectively), suggesting that those traits are polygenic with some major QTL (Ellis et al. 2002 ).
The present study aimed at investigating the genomic predictive ability of the cross value called UC (usefulness criterion; \(\text{UC}=\text{PM}+i\times \text{SD}\) ) and its two components, the mean (PM; parental mean) and the variance (SD; standard deviation) of the progeny. We used simulations to test the impact of the precision of marker effect estimation in predictions. We compared two scenarios, one using perfectly estimated marker effects (true marker effects) and one using more realistically estimated marker effects. We tested two new estimators of SD that take into account the error in marker effect estimation. We also tested the impact of trait heritability (varying from 0.2 to 0.8), trait architecture (number of QTL varying between 10 and 1000) and progeny size (varying from the number of progenies per cross observed in the field to 2000). We used those figures as a reference grid for the comparison of predictions with experimental data composed of yield, grain protein content, plant height and heading date evaluated for 101 crosses with 55 progenies on average. Our results shed light on the factors influencing cross value predictive ability and provide insights into the prerequisites for using such methods in precision breeding programs.
A high-quality training population (TP)
In this study, we took advantage of a historical TP of more than 20 years of data collection to predict three cross value components (PM, SD and UC) of 73 or 101 crosses depending on the trait using genomic prediction models. We compared these predictions to experimental observations of progenies. All the results concerning the TP (correlations between years, repeatabilities at plot and design levels) suggest a high-quality dataset. The predictive abilities obtained by cross-validation ranging from 0.56 (plant height) to 0.70 (heading date) were higher than previously obtained with a smaller TP (G. Charmet, personal communication). We checked if we could optimize the training population (data not shown) by an iterative approach that removes the environments (year x location) one by one and that computes predictive ability at each iteration (Heslot et al. 2013 ). We finally kept all the environments of the TP because we did not detect any environment deteriorating the quality of the prediction. Even trials from the South network were improving predictions of North trials. We also checked (data not shown) the predictive ability within each sub-dataset of the TP (GEVES, INRAE-AO, North and South) and the predictive abilities of the 73 crosses obtained for the three cross-value components using these sub-datasets as TP. We observed better predictive abilities using all datasets combined as TP. All these preliminary results led us to retain the entire dataset as the TP in this study.
Due to the size and complexity of the TP design (not uniform spatial information), for computational reasons, and as done in previous studies (Tiede et al. 2015 ; Danguy des Déserts et al. 2023 ), we separated the steps of correction for the fixed effects such as spatial coordinates and environmental effects (location and year) from the estimation of marker effects. It would be interesting to evaluate the impact of this pre-correction on the estimation of marker effects step.
Parameters impacting predictive abilities based on simulations
To evaluate different parameters impacting predictive abilities, we simulated progenies using the same crosses than in our experiment, varying heritability, QTL number, progeny size, prediction models, and genetic composition of the mating plan.
The first comment on this simulation study concerns the difference observed between true heritability and estimated heritability, with a missing heritability ranging from 0.1 to 0.2 in estimated heritabilities, whatever the number of QTL simulated. This could be explained by a lack of genetic information, such as the absence of rare variants or structural variants (Manolio et al. 2009 ), epistasis and GxE. According to this observation, we focused our results on scenarios where heritability = 0.7, which correspond to an estimated heritability of 0.5, congruent with yield observations.
First, the predictive ability of the three cross value components increased when heritability increased. The impact of QTL number impacted SD predictions: the predictive ability decreased when QTL number increased. These two results were in concordance with the results obtained in previous studies (Wimmer et al. 2013 ; Tiede et al. 2015 ; Yao et al. 2018 ). The first conclusion about this simulation study is that trait architecture is the most important factor impacting predictive abilities and more especially for the prediction of SD.
Second, the predictive ability of the three cross value components increased when progeny size increased, the SD being strongly affected. To our knowledge, no study has yet investigated this parameter. Our results suggest that more than 1000 progenies per cross should be sufficient to maximize SD predictive ability as little differences were observed between the 1000 and 2000 progenies per cross-scenarios.
Third, we observed that if we perfectly knew marker effects (TRUE scenario), we were able to estimate the three cross value components when QTL number is low and progeny size is large almost perfectly. But SD predictions decrease strongly with the number of QTL, especially when marker effects are estimated. This could be due to two factors: The TP is not informative enough compared to our validation population, or/and the prediction model is not optimal. We could not explore the first factor because no other TP was available for this study. However, we investigated the second parameter by testing different prediction models. We observed small differences between prediction models, and mostly in the small progeny size scenarios. This result was in concordance with several previous studies (Yao et al. 2018 ; Neyhart et al. 2019 ; Santos et al. 2019 ). RR-BLUP is known to have a similar or slightly better performance for GEBV prediction than models with differential shrinkage when the number of QTL is large (Daetwyler et al. 2010 ). We actually observed that Bayesian models, Bayes A and B in particular, outperformed models using Gaussian marker effect variance distribution when the number of QTL was low, in accordance with several previous studies (Meuwissen et al. 2001 ; Shepherd et al. 2010 ; Legarra et al. 2011 ).
In this study, we computed SD using the algebraic formula proposed by Lehermeier et al. 2017 for gametic variance. As the precision of marker effect estimates strongly impacted predictive abilities, we proposed two variants for the computation of SD that account for the error in marker effect estimation. The first variant (Vg2) was the algebraic version of the posterior mean variance (PMV) (Lehermeier et al. 2017 ). The second variant (Vg3) aimed at considering the fact that the uncertainty of the estimation of marker effects is modulated by the genomic composition of each parent. These two variants were computed only with marker effects estimated by a RR-BLUP approach in our study. They increased SD predictive abilities when progeny size was large and the number of QTL was high, and to a greater extent when heritability was low. This result is in concordance with the 300 QTL simulation scenario described in Lehermeier et al. 2017 .
Finally, SD predictive ability was high only in the TRUE scenario, when the number of QTL was 10. We did not take into account the interaction between genotype and environment or epistasis in our prediction models. This could improve predictive ability of traits such as yield, grain protein content and plant height in particular. We could also test if our predictive ability is sufficient to improve our mating design and short/long-term genetic gain in elite and more diverse material.
Validation in experimental data
The predictive ability of cross value components based on experimental progenies’ phenotypes was validated for maize yield, moisture and test weight SD (0.18 for yield, 0.49 for moisture and 0.52 for test weight) (Lian et al. 2015 ). For flowering time (Osthushenrich et al. 2017 ), the correlations were around 0.90 for PM and of 0.46–0.65 for SD. For barley (Tiede et al. 2015 ; Neyhart et al. 2019 ), the predictive abilities for PM were moderate to high (0.46–0.69), whereas those for SD were lower (0.01–0.48). They were higher for heading date (the most heritable trait) and lower for FHB severity (the least heritable trait).
In our study, observed and predicted PM and UC were significantly correlated for all traits (PM: 0.38, 0.63, 0.51 and 0.91; UC: 0.45, 0.52, 0.54 and 0.74; for yield, grain protein content, plant height and heading date, respectively), while SD was correlated only for heading date and plant height (0.64 and 0.49, respectively). We were not able to validate our ability of prediction of SD for yield because progeny sizes were too small according to simulations. As quoted by Santos et al. ( 2019 ), inferences using SD « should be regarded as a bet»: The value of the best individual among 40 may differ from its parental UC calculated with a of 10 or 2%. For grain protein content, plant height and heading date, non-Gaussian Bayesian approaches outperformed Gaussian approaches for SD prediction, indicating that considering more flexible distributions for marker effects can improve SD estimation accuracy.
It would now be interesting to validate the value of UC in terms of genetic gain using different levels of diversity in the starting germplasm (elite breeding program vs pre-breeding program) under a long-term perspective, using different traits of interest. The progeny size is very important to calibrate to assure the realization of the SD potential of crosses. It would also be interesting to test if including interaction effect of markers with environments improves SD predictive ability.
Recommendations for breeding schemes
Our results contribute to practical guidance for precision breeding programs. The study highlights the importance of accurate marker effect estimates, thanks to large and optimized TP datasets, as well as the optimization of prediction models to estimate the value of a cross. As progeny size is determinant to realize the potential of a cross, we can advise the breeders to produce larger populations for crosses with outstanding SD. Optimizing mating design based on UC might be even more interesting with more diverse material where the variance of SD between crosses is larger, in pre-breeding programs, for instance.
Finally, breeders are interested in predicting several traits of agronomic interest simultaneously in progenies, the more challenging case being those with negative genetic correlations, yield and grain protein content, for instance, for bread wheat (Thorwarth et al. 2018 ). One interesting perspective is now to optimize the predictive ability of this correlation (Neyhart et al. 2019 ) in order to predict crosses with outstanding progenies for the main trait and satisfying values for the secondary trait.
Data availability
The data that support the findings of this study are openly available in repository FSOV PrediCropt at https://doi.org/10.57745/F6230V , reference number F6230V.
Allier A, Lehermeier C, Charcosset A, Moreau L, Teyssèdre S (2019a) Improving short- and long-term genetic gain by accounting for within-family variance in optimal cross-selection. Front Genet 10:1006
Article PubMed PubMed Central Google Scholar
Allier A, Moreau L, Charcosset A, Teyssèdre S, Lehermeier C (2019b) Usefulness criterion and post-selection parental contributions in multi-parental crosses: application to polygenic trait introgression. G3 Genes Genomes Genetics 9(5):1469–1479. https://doi.org/10.1534/g3.119.400129
Ben-Sadoun S, Rincent R, Auzanneau J, Oury FX, Rolland B, Heumez E, Ravel C, Charmet G, Bouchet S (2020) Economical optimization of a breeding scheme by selective phenotyping of the calibration set in a multi-trait context: application to bread making quality. Theor Appl Genet 133:2197–2212
Article PubMed CAS Google Scholar
Bijma P, Wientjes YCJ, Calus MPL (2020) Breeding top genotypes and accelerating response to recurrent selection by selecting parents with greater gametic variance. Genetics 214:91–107
Article PubMed Google Scholar
Bohn M, Utz HF, Melchinger AE (1999) Genetic similarities among winter wheat cultivars determined on the basis of RFLPs, AFLPs, and SSRs and their use for predicting progeny variance. Crop Sci 39:228–237
Article CAS Google Scholar
Browning BL, Zhou Y, Browning SR (2018) A one-penny imputed genome from next-generation reference panels. Am J Human Genet 103:338–348
Calus MPL (2010) Genomic breeding value prediction: methods and procedures. Animal 4:157–164
Daetwyler HD, Pong-Wong R, Villanueva B, Woolliams JA (2010) The impact of genetic architecture on genome-wide evaluation methods. Genetics 185:1021–1031
Article PubMed PubMed Central CAS Google Scholar
Danguy des Déserts A, Bouchet S, Sourdille P, Servin B (2021) Evolution of recombination landscapes in diverging populations of bread wheat.Genom Biol Evolut 13. https://doi.org/10.1093/gbe/evab152
Danguy des Déserts A, Durand N, Servin B, Goudemand-Dugué E, Alliot J-M, Ruiz D, Charmet G, Elsen J-M, Bouchet S ( 2023) Comparison of genomic-enabled cross selection criteria for the improvement of inbred line breeding populations. G3 Genes|Genomes|Genetics jkad195
Ellis M, Spielmeyer W, Gale K, Rebetzke G, Richards R (2002) “Perfect” markers for the Rht-B1b and Rht-D1b dwarfing genes in wheat. Theor Appl Genet 105:1038–1042
Elsen J-M (2022) Genomic prediction of complex traits, principles, overview of factors affecting the reliability of genomic prediction, and algebra of the reliability. In: Ahmadi N, Bartholomé J (eds) Genomic prediction of complex traits: methods and protocols. Springer US, New York, pp 45–76. https://doi.org/10.1007/978-1-0716-2205-6_2
Chapter Google Scholar
Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome. https://doi.org/10.3835/plantgenome2011.08.0024
Article Google Scholar
Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, Mason BA, Goddard ME (2012) Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci 95:4114–4129
Falconer DS, Mackay TFC (1996) Introduction to quantitative genetics. Prentice Hall, Harlow, England
Google Scholar
Gianola D, de Los Campos G, Hill WG, Manfredi E, Fernando R (2009) Additive genetic variability and the bayesian alphabet. Genetics 183(1):347–363
Goddard ME, Hayes BJ, Meuwissen THE (2011) Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet 128:409–421
Habier D, Fernando RL, Dekkers JCM (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177:2389–2397
Habier D, Fernando RL, Kizilkaya K, Garrick DJ (2011) Extension of the bayesian alphabet for genomic selection. BMC Bioinf 12:186
Hayes BJ, Visscher PM, Goddard ME (2009) Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res (camb) 91:47–60
Henderson CR (1975) Best linear unbiased estimation and prediction under a selection model. Biometrics 31:423–447
Heslot N, Jannink J-L, Sorrells ME (2013) Using genomic prediction to characterize environments and optimize prediction accuracy in applied breeding data. Crop Sci 53:921–933
Hung H-Y, Browne C, Guill K, Coles N, Eller M, Garcia A, Lepak N, Melia-Hancock S, Oropeza-Rosas M, Salvo S et al (2012) The relationship between parental genetic or phenotypic divergence and progeny variation in the maize nested association mapping population. Heredity (edinb) 108:490–499
Lee SH, Clark S, van der Werf JHJ (2017) Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship. PLoS ONE 12:e0189775
Legarra A, Robert-Granié C, Croiseau P, Guillaume F, Fritz S (2011) Improved Lasso for genomic selection. Genet Res 93:77–87
Lehermeier C, Teyssèdre S, Schön C-C (2017) Genetic gain increases by applying the usefulness criterion with improved variance prediction in selection of crosses. Genetics 207:1651–1661
Lian L, Jacobson A, Zhong S, Bernardo R (2015) Prediction of genetic variance in biparental maize populations: Genomewide marker effects versus mean genetic variance in prior populations. Crop Sci 55:1181–1188
Liu X, Wang H, Wang H, Guo Z, Xu X, Liu J, Wang S, Li W-X, Zou C, Prasanna BM et al (2018) Factors affecting genomic selection revealed by empirical evidence in maize. Crop J 6:341–352
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829
Mohammadi M, Tiede T, Smith KP (2015) PopVar: a genome-wide procedure for predicting genetic variance and correlated response in biparental breeding populations. Crop Sci 55:2068–2077
Murtagh F, Legendre P (2014) Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? J Classif 31:274–295
Neyhart JL, Lorenz AJ, Smith KP (2019) Multi-trait improvement by predicting genetic correlations in breeding crosses. G3 Genes Genom Genet 9(10):3153–3165. https://doi.org/10.1534/g3.119.400406
Nyquist WE, Baker RJ (1991) Estimation of heritability and prediction of selection response in plant populations. Crit Rev Plant Sci. https://doi.org/10.1080/07352689109382313
Osthushenrich T, Frisch M, Herzog E (2017) Genomic selection of crossing partners on basis of the expected mean and variance of their derived lines. PLoS ONE 12:e0188839
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686
Pasek J (2021) Schwemmle with some assistance from AT and some code modified from R -core; A contributions by GC and M. weights: Weighting and Weighted Statistics. https://cran.r-project.org/web/packages/weights/index.html
Pérez P, de Los Campos G (2014) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198(2):483–495
Piepho H-P, Möhring J (2007) Computing heritability and selection response from unbalanced plant breeding trials. Genetics 177:1881–1888
Pook T, Schlather M, Simianer H (2020) MoBPS—modular breeding program simulator. G3 Genes Genom Genet 10(6):1915–1918. https://doi.org/10.1534/g3.120.401193
R Core Team (2021) R: A language and environment for statistical computing. https://www.r-project.org/
Rimbert H, Darrier B, Navarro J, Kitt J, Choulet F, Leveugle M, Duarte J, Rivière N, Eversole K (2018) High throughput SNP discovery and genotyping in hexaploid wheat. PLoS ONE 13(1):e0186329. https://doi.org/10.1371/journal.pone.0186329
Rodríguez-Álvarez MX, Boer MP, van Eeuwijk FA, Eilers PHC (2018) Correcting for spatial heterogeneity in plant breeding experiments with P-splines. Spat Stat 23:52–71
Santos DJA, Cole JB, Lawlor TJ, VanRaden PM, Tonhati H, Ma L (2019) Variance of gametic diversity and its application in selection programs. J Dairy Sci 102:5279–5294
Schnell FW, Utz HF (1975) F1-leistung und elternwahl euphyder züchtung von selbstbefruchte. In Bericht über die Arbeitstagung der Vereinigung Österreichischer Pflanzenzüchter, pp. 243–248, Gumpenstein, Austria
Selle ML, Steinsland I, Hickey JM, Gorjanc G (2019) Flexible modelling of spatial variation in agricultural field trials with the R package INLA. Theor Appl Genet 132:3277–3293
Shepherd RK, Meuwissen TH, Woolliams JA (2010) Genomic selection and complex trait prediction using a fast EM algorithm applied to genome-wide markers. BMC Bioinf 11:529
Thorwarth P, Piepho HP, Zhao Y, Ebmeyer E, Schacht J, Schachschneider R, Kazman E, Reif JC, Würschum T, Longin CFH (2018) Higher grain yield and higher grain protein deviation underline the potential of hybrid wheat for a sustainable agriculture. Plant Breed 137:326–337
Tiede T, Kumar L, Mohammadi M, Smith KP (2015) Predicting genetic variance in bi-parental breeding populations is more accurate when explicitly modeling the segregation of informative genomewide markers. Mol Breed 35:199
Usai MG, Goddard ME, Hayes BJ (2009) LASSO with cross-validation for genomic selection. Genet Res (camb) 91:427–436
Utz HF, Bohn M, Melchinger AE (2001) Predicting progeny means and variances of winter wheat crosses from phenotypic values of their parents. Crop Sci 41:1470–1478
VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91:4414–4423
Whittaker JC, Thompson R, Denham MC (2000) Marker-assisted selection using ridge regression. Genet Res 75:249–252
Wimmer V, Lehermeier C, Albrecht T, Auinger H-J, Wang Y, Schön C-C (2013) Genome-wide prediction of traits with different genetic architecture through efficient variable selection. Genetics 195:573–587
Yao J, Zhao D, Chen X, Zhang Y, Wang J (2018) Use of genomic selection and breeding simulation in cross prediction for improvement of yield and quality in wheat ( Triticum aestivum L.). The Crop Journal 6:353–365
Zhong S, Jannink J-L (2007) Using quantitative trait loci results to discriminate among crosses on the basis of their progeny mean and variance. Genetics 177:567–576
Download references
Acknowledgements
The authors thank the Fonds de Soutien à l’Obtention Végétale (FSOV) for financing the FSOV 2020 I—PrediCropt project. The authors acknowledge the partners of the project (Agri-Obtentions and Florimond-Desprez Veuve & Fils) and the INRAE personal responsible for experimental evaluation (Laurent Falchetto, Sandrine Berges -INRAE UE PHACC-, Clement Debiton -INRAE UMR GDEC, CRB group-, Kevin Bargoin -INRAE UMR GDEC, DIGEN group-, Patrice Walczak -INRAE UE Ferlus-, Paul Bataillon -INRAE UE Auzeville-, and Emmanuel Heumez -INRAE UE GCIE-). They also thank Marie-Hélène Bernicot and Solène Barrais for providing the GEVES training population dataset (corresponding to the experimental evaluation data for French national registration) and Justin Blancon for helping in adjusting for spatial heterogeneity in experimental measures.
This work was supported by the Fonds de Soutien à l’Obtention Végétale (FSOV 2020 I—PrediCropt project). Genotyping was supported by the Breedwheat grant (ANR-10-BTBR-0003) and INRAE IVD program.
Author information
Authors and affiliations.
UMR1095, GDEC, INRAE-Université Clermont-Auvergne, Clermont-Ferrand, France
Claire Oget-Ebrad, François-Xavier Oury & Sophie Bouchet
INRAE-UE Lille, 2 Chaussée Brunehaut, Estrées Mons, BP50136, 80203, Peronne Cedex, France
Emmanuel Heumez
Agri-Obtentions, Ferme de Gauvilliers, 78660, Orsonville, France
Laure Duchalais
Florimond-Desprez Veuve & Fils SAS, Cappelle-en-Pévèle, France
Ellen Goudemand-Dugué
UMR1388, GenPhySE, INRAE-Université de Toulouse, Castanet-Tolosan, France
Jean-Michel Elsen
You can also search for this author in PubMed Google Scholar
Contributions
COE collected and formatted the data, performed the analyses, interpreted the results and wrote the manuscript. EH, LD, EGD and FXO provided crosses’ progenies for the FSOV PrediCropt project. SB conceived and coordinated the FSOV project. JME and SB supervised COE, interpreted the results and helped writing the manuscript. All authors reviewed and approved the final version of the manuscript.
Corresponding author
Correspondence to Sophie Bouchet .
Ethics declarations
Conflict of interest.
The authors have no relevant financial or non-financial interests to disclose.
Ethical approval
This is an observational study. No ethical approval is required.
Additional information
Communicated by Huihui Li.
Jean-Michel Elsen and Sophie Bouchet have contributed equally to this work.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file1 (DOCX 2034 KB)
Rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .
Reprints and permissions
About this article
Oget-Ebrad, C., Heumez, E., Duchalais, L. et al. Validation of cross-progeny variance genomic prediction using simulations and experimental data in winter elite bread wheat. Theor Appl Genet 137 , 226 (2024). https://doi.org/10.1007/s00122-024-04718-6
Download citation
Received : 26 September 2023
Accepted : 16 August 2024
Published : 18 September 2024
DOI : https://doi.org/10.1007/s00122-024-04718-6
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Find a journal
- Publish with us
- Track your research
IMAGES
VIDEO
COMMENTS
Experimental units must be defined during experimental design. The experimental unit is an individual, object, or plot subjected to treatment independently of other units. ... Genetic and developmental differences, as well as differences in species abundance and diversity, can vary between experimental units. In plant breeding, clones and ...
This manual provides guidelines on the design and statistical. s of plant breeding field trails with a focus on b. do not have access to permanent biometric support. Published on 08/02/2022. eeding-scheme-optimization-manuals Exc. llenceinBreeding.org. Design.
November 14, 2019 by plant-breeding-genomics. Experimental Design. These tutorials provide an introduction to experimental design including different layouts and considerations for statistical inference. Experimental design is the process of choosing treatments, responses, and controls, defining experimental and sample units, and determining ...
Part 1 - Introduction to augmented designs. Part 2 - Background to sample analysis using Dr. Jennifer Kling's meadowfoam breeding program. Part 3 - Experiment design and field layout. Part 4 - Analysis treating genotypes as fixed. Part 5 - Analysis treating genotypes as random. Part 6 - Introduction to two way control of ...
This open textbook covers common statistics used in agriculture research, including experimental design in plant breeding and genetics, as well as the analysis of variance, regression, and correlation. About the Contributors Editors. Suza is an Adjunct Associate Professor at Iowa State University. He teaches courses on Genetics and Crop ...
Plant breeding is fundamental to developing new cultivars with higher yield, improved quality, and tolerance or resistance to several abiotic and biotic stresses. ... In this scenario, an optimal experimental design could be designed as follows: (1) determine the subset from CS to undergo field-testing, thereby forming the TRS (TRS optimization ...
The basic principles of a statistically valid experimental design comprise: 1. Replication 2. Randomization 3. ... 1926) nearly 100 years ago and still remain valid today. Although there is a vast number of experimental designs used in plant breeding trials at different stages and for different purposes (e.g., Cochran and Cox 1992; Hinkelmann ...
Experimental design concepts and model assumptions 41 concept of factorial structures has less relevance to plant breeding experiments. All five concepts are covered extensively in other books, such as Mead (1988). 4.2.2 Model assumptions The primary purpose of a plant breeding trial is to assess differences among
Plant breeding; With plentiful knowledge of gene function and the development of technologies like gene editing, breeders are fully equipped to address grand challenges and eliminate various forms ...
In this chapter a short overview of experimental designs for tests that can be used when developing new varieties is presented. Problems with regard to the estimation of differences between candidates and testing their significance are not considered. These topics belong to a special branch of statistics, i.e. design and analysis of experiments.
Kent M. Eskridge University of Nebraska Lincoln, NE. Abstract: Field trial design is a crucial consideration in efficient, cost-effective plant breeding programs. Design properties and specific designs relevant to plant breeding are considered. Good field designs are important at all major stages of a plant breeding program from the earlier ...
The model used for the design of field trials for plant breeding programs is usually a linear mixed model, which is consistent with the linear mixed model used for the analysis. Early work on model-based design for plant breeding selection trials focussed on methods to find optimal designs for spatially dependent data (Martin 1986 ; Martin et ...
This online programme covers statistical principles and experimental designs for breeding trials, including concepts such as randomisation, replication, blocking, and the use of controls. You will learn about different experimental designs, including completely randomised design, randomised complete block design, and split-plot designs. You will also learn about linear regression, analysis of ...
Abstract. It describes designs and analysis in relation to problems encountered in plant breeding research. It covers issues like estimation of heritability, progeny row trials, incomplete block ...
The purpose of Breedbase is to enable a digital ecosystem that contains an integrated breeding workflow. Processes and data comprising germplasm banks, parental selection, crossing design, experimental design, data collection, analyses, and decision-making tools are aggregated into a single system.
Experiment: Designs used in plant breeding experiment, analysis of Randomized Block Design. ntal design including different layouts and considerations for statistical inference of RBDExperimental design is the process of choosing treatments, responses, and controls, defining experimental. and sample units, and determining the physical ...
Consequently, there has been considerable emphasis on improving experimental design for field testing. The implementation of genomic selection and prediction in plants is becoming common, and there are opportunities for its incorporation into plant breeding programmes which differ from those in animals. Ultimately, however, the link between ...
With limited resources, plant breeders must make trade-offs in resource allocation. While the randomized complete block design (RCBD) has been popular for agricultural research, it has been recognized that alternative experimental designs may better meet the goals of plant breeding.
The approach presented in the current paper for the analysis of multi-phase plant breeding exper- iments builds on that of Brien (1983) and Wood et al. (1988). In both of those papers the analysis of multi- phase experiments is conducted by determining the experimental structure then including appropriate Table 2.
With limited resources, plant breeders must make trade-offs in resource allocation. While the randomized complete block design (RCBD) has been popular for agricultural research, it has been recognized that alternative experimental designs may better meet the goals of plant breeding.
Randomization and Layout. Randomization, or random distribution of treatments into experimental units, helps ensure that measurements of experimental variation are unbiased by destroying correlations among errors. When an entire treatment is grouped together; for example, on the sunny side of greenhouse, lighting becomes a confounding factor to ...
Introduction. This page is a continuation of the Overview of Analysis of Variance page and is intended to help plant breeders consider the notions of fixed and random effects and the impacts these can have on ANOVA in the context of plant breeding. Briefly, ANOVA is a statistical test that takes the total variation and assigns it to known causes, leaving a residual portion allocated to ...
Key message From simulations and experimental data, the quality of cross progeny variance genomic predictions may be high, but depends on trait architecture and necessitates sufficient number of progenies. Abstract Genomic predictions are used to select genitors and crosses in plant breeding. The usefulness criterion (UC) is a cross-selection criterion that necessitates the estimation of ...