CN117558341A - American black poplar whole genome breeding chip and construction method and application thereof - Google Patents

American black poplar whole genome breeding chip and construction method and application thereof Download PDF

Info

Publication number
CN117558341A
CN117558341A CN202311612769.4A CN202311612769A CN117558341A CN 117558341 A CN117558341 A CN 117558341A CN 202311612769 A CN202311612769 A CN 202311612769A CN 117558341 A CN117558341 A CN 117558341A
Authority
CN
China
Prior art keywords
snp
sites
poplar
breeding
whole genome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311612769.4A
Other languages
Chinese (zh)
Inventor
韦素云
尹佟明
郭臣臣
吴怀通
戴晓港
陈赢男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Forestry University
Original Assignee
Nanjing Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Forestry University filed Critical Nanjing Forestry University
Priority to CN202311612769.4A priority Critical patent/CN117558341A/en
Publication of CN117558341A publication Critical patent/CN117558341A/en
Pending legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/6895Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B40/00Libraries per se, e.g. arrays, mixtures
    • C40B40/04Libraries containing only organic compounds
    • C40B40/06Libraries containing nucleotides or polynucleotides, or derivatives thereof
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B50/00Methods of creating libraries, e.g. combinatorial synthesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/13Plant traits
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Genetics & Genomics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Immunology (AREA)
  • General Chemical & Material Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Medicinal Chemistry (AREA)
  • Microbiology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Botany (AREA)
  • Mycology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a American black poplar whole genome breeding chip and a construction method and application thereof, belonging to the field of biological breeding of poplars. The invention utilizes the whole genome resequencing data of a large-scale poplar group to identify polymorphic sites, and screens SNP sites with high quality by setting the maximum deletion rate, the minimum allele frequency, the Hardwberg balance and other inspection parameters. Functional sites obviously related to important economic traits such as growth lumber property and the like are obtained by utilizing whole genome association analysis, and finally a 40K SNP breeding chip suitable for important economic traits of poplar is developed. The invention not only designs a high-efficiency, low-cost and high-precision gene parting chip, but also obviously improves the genome selective breeding accuracy of important characters of poplar. Therefore, the invention can improve the early stage breeding efficiency of the forest, accelerate the fine variety breeding process of the forest, provide a high-efficiency molecular breeding technical means for fine variety breeding of the forest, and have wide breeding application prospect.

Description

American black poplar whole genome breeding chip and construction method and application thereof
Technical Field
The invention belongs to the field of biological breeding of poplars, and relates to a American black poplar whole genome breeding chip, a construction method and application thereof.
Background
The sustainable development of the forestry affairs, the economy and the society is fundamental, and the improved variety is the prime power for the pulse generation of the forestry and the promotion of the development of the forestry industry. Poplar is one of the fast-growing wood species with the widest global distribution and the largest cultivation area, and plays an indispensable important role in forestry production and ecological environment construction in China due to the characteristics of fast growth, high yield, easy propagation, strong adaptability and the like. In addition, as woody plants for which whole genome sequencing is completed for the first time, complete whole genome sequences of poplars provide a solid foundation for functional gene mining and forest genetic breeding.
Because the forest has the characteristics of long generation period, inbreeding recession, complex genetic background and the like, the factors limit the progress of the genetic improvement of the forest, and become the bottleneck for improving the breeding efficiency of the forest and accelerating the innovation of the germplasm of the forest. The key to breaking the bottleneck of the long breeding cycle of the forest is to realize the transition from phenotype selection to genotype selection. Many important economic traits of forest trees such as growth, lumber property, resistance and the like belong to quantitative traits of micro-effect polygenic control. With the development of molecular biology and genomics, genetic mapping and association analysis based on molecular Marker Assisted Selection (MAS) technology overcomes the limitation of the traditional quantitative genetic research method, and the accuracy of quantitative trait gene positioning is remarkably improved. In the post genome era, the construction of a high-density genetic linkage map of the forest and the whole genome association analysis (GWAS) lay a foundation for deeply revealing the genetic mechanism of the quantitative trait of the forest, and provide important gene resources for the genetic improvement and breeding of the forest. In recent years, the development of rapid development of high throughput sequencing technologies and the development of genome-wide genetic markers has prompted the development of modern selective breeding technologies. A whole genome selective breeding technology (GS) based on classical quantitative genetics and molecular marker calculation breeding values can rapidly select genotypes with excellent traits from a large number of germplasm resources, and improves the selection efficiency of micro-effect polygenic control of complex traits and low genetic traits. The GS technology has great success in the breeding of complex quantitative characters, can accelerate the breeding period, is helpful for realizing the breeding improvement directionally and efficiently, and becomes a basic method of modern breeding.
The core of developing whole genome selective breeding is to adopt a high-efficiency and low-cost whole genome molecular marker typing technology. As genetic markers widely distributed in genomes, single nucleotide polymorphisms (Single Nucleotide Polymorphism, SNP) have the characteristics of numerous numbers, high genetic stability, abundant diversity, easy detection and the like, and become the most common marker type with ideal effects in genetic variation research. Whole genome resequencing techniques are capable of obtaining whole genome SNP markers, but their high large-scale sequencing typing cost remains a challenge; simplifying the genome sequencing technology can greatly reduce the typing cost, but only obtain the markers near the enzyme cutting sites. SNP breeding chip technology has been widely used in the fields of animal husbandry and crop breeding due to its high accuracy and good repetition rate. The development of the whole genome selection technology of the forest is still in a starting stage, a GS prediction model is mainly constructed aiming at the characters of forest growth, lumber property and the like, theoretical research is carried out on main factors influencing prediction accuracy, and reports related to a whole genome selection breeding chip of the forest are not yet seen at present.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a American black poplar whole genome breeding chip for screening fast-growing fine varieties of poplars; the invention aims to provide a construction method of the American black poplar whole genome breeding chip. Another technical problem to be solved by the invention is to provide an application of the populus jaborandi whole genome breeding chip.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
the construction method of the American black poplar whole genome breeding chip comprises the following steps:
1) Performing genome re-sequencing and genotyping on 296 poplar to obtain 855,807 high-quality SNP loci;
2) Based on 855,807 high-quality SNP loci obtained, carrying out genome-wide association analysis on poplar sex characters by combining 296 poplar chest diameter values, wood density, cellulose content, hemicellulose content and lignin content phenotype data to obtain 23,791 SNP functional loci;
3) Carrying out gene annotation and region screening on 855,807 high-quality SNP loci by using ANNOVAR software, and screening 16,442 SNP skeleton loci;
4) 23,791 SNP functional loci and 16,442 SNP skeleton loci are combined and screened, and a set containing 40,213 SNP loci is finally obtained, so that a 40K SNP breeding chip for the sexual character of the poplar growth material is formed.
Step 1) adopting WGS genome re-sequencing technology, and carrying out double-end PE150 sequencing on 296 poplar plants by utilizing an IlluminaHiSeq6000 high-throughput sequencing platform; comparing the sequencing data to a reference genome of populus americana by using a BWA tool to obtain a BAM format comparison result; to improve the accuracy of the subsequent mutation detection, preprocessing the comparison result, including removing the PCR repeated sequence, quality control, local heavy comparison and base quality value correction; subsequent detection of single nucleotide variations and insertions/deletions using the biplotypeCaller tool in GATK; firstly, a variation result is preliminarily filtered based on quality and depth indexes through a variant filtration tool in GATK to remove false positive and false variation; next, the genotypes were subjected to stringent filtration using PLINK and VCFtools software, wherein the filtration criteria included a sequencing depth of greater than 3X, an integrity of greater than 0.8, a minimum allele frequency of no less than 0.05, a deletion rate of less than 20%, and compliance with the hadi-winberg equilibrium law; finally, filtered genotype data was populated using the BEAGLE software to predict partial single nucleotide polymorphism sites that may be lost by sequencing, and these SNP sites were annotated and functionally predicted using the ANNOVAR software; finally 855,807SNP loci with high quality are obtained.
Step 2) based on the 855,807SNP variation sites obtained, performing principal component analysis and genetic relationship analysis by using PLINK software and GEMMA software respectively to obtain a feature vector PCA matrix of all individuals and a genetic relationship coefficient Kinship matrix between the two individuals, combining the breast diameter value, wood density, cellulose content, hemicellulose content and lignin content phenotype data of poplars, adding the genetic relationship as random effect into analysis of a mixed linear model by using the GEMMA software by taking a population structure as a fixed effect, obtaining an associated P value of each SNP and a character, sorting from small to large according to the P value, taking the first 5000 SNP sites, merging to obtain a significant SNP site set associated with the character of breast diameter value, wood density, cellulose content, hemicellulose content and lignin content, and totally comprising 23,791 SNP functional sites.
And 3) carrying out gene annotation and region screening on 855,807SNP mutation sites on the whole genome by using ANNOVAR software, positioning the mutation sites to specific intergenic regions, untranslated regions, 1kb regions upstream or downstream of genes, intronic regions, shearing sites and exon regions, wherein the SNP sites of the exon regions are further subdivided into nonsensical mutations, synonymous mutations and stop codon acquisition or loss mutation, after removing a SNP site set closely related to the sex characteristics of poplar growth, selecting SNP sites uniformly covering all sections of the chromosome, increasing the specific gravity of the SNP sites of the nonsensical mutations of the exon regions, and finally screening 16,442 SNP skeleton sites.
Step 4) combining 23,791 SNP functional sites and 16,442 SNP skeleton sites, and screening to finally obtain a set containing 40,213 SNP sites, thereby forming a 40K SNP breeding chip for the poplar growth sex characteristics.
The construction method of the American black poplar whole genome breeding chip comprises the following specific steps:
1) Adopting WGS genome re-sequencing technology, and performing double-end PE150 sequencing on 296 poplar plants by using an IlluminaHiSeq6000 high-throughput sequencing platform; comparing the sequencing data to a reference genome of populus americana by using a BWA tool to obtain a BAM format comparison result; to improve the accuracy of the subsequent mutation detection, preprocessing the comparison result, including removing the PCR repeated sequence, quality control, local heavy comparison and base quality value correction; subsequent detection of single nucleotide variations and insertions/deletions using the biplotypeCaller tool in GATK; firstly, a variation result is preliminarily filtered based on quality and depth indexes through a variant filtration tool in GATK to remove false positive and false variation; next, the genotypes were subjected to stringent filtration using PLINK and VCFtools software, wherein the filtration criteria included a sequencing depth of greater than 3X, an integrity of greater than 0.8, a minimum allele frequency of no less than 0.05, a deletion rate of less than 20%, and compliance with the hadi-winberg equilibrium law; finally, filtered genotype data was populated using the BEAGLE software to predict partial single nucleotide polymorphism sites that may be lost by sequencing, and these SNP sites were annotated and functionally predicted using the ANNOVAR software; 855,807SNP loci with high quality are finally obtained;
2) Based on 855,807SNP variation sites obtained in the above, performing principal component analysis and genetic relationship analysis by using PLINK software and GEMMA software respectively to obtain a feature vector PCA matrix of all individuals and a genetic relationship coefficient Kinship matrix between two individuals, combining the breast diameter value, wood density, cellulose content, hemicellulose content and lignin content phenotype data of poplars, using GEMMA software to take a group structure as a fixed effect, adding genetic relationship as a random effect into analysis of a mixed linear model to obtain an associated P value of each SNP and a character, taking the first 5000 SNP sites after sorting from small to large according to the P value, and combining to obtain a remarkable SNP site set associated with the characters of breast diameter value, wood density, cellulose content, hemicellulose content and lignin content, wherein the total contains 23,791 SNP functional sites;
3) Carrying out gene annotation and region screening on 855,807SNP mutation sites on the whole genome by using ANNOVAR software, positioning the mutation sites to specific intergenic regions, untranslated regions, 1kb regions upstream or downstream of genes, intronic regions, shearing sites and exon regions, wherein the SNP sites of the exon regions are further subdivided into nonsensical mutations, synonymous mutations and stop codon obtaining or losing mutation, selecting SNP sites uniformly covering all sections of a chromosome after removing a SNP site set closely related to the sex characteristics of poplar, increasing the specific gravity of the SNP sites of the nonsensical mutations of the exon regions, and finally screening 16,442 SNP skeleton sites;
4) 23,791 SNP functional loci and 16,442 SNP skeleton loci are combined and screened, and a set containing 40,213 SNP loci is finally obtained, so that a 40K SNP breeding chip for the sexual character of the poplar growth material is formed.
7. The populus jalapa whole genome breeding chip constructed by the construction method of the populus jaboracic whole genome breeding chip according to any one of claims 1 to 6, wherein the populus jaboracic whole genome breeding chip consists of 40,213 SNP loci.
The SNP molecular marker combination for American black Yang Susheng fine variety breeding consists of 40,213 SNP loci.
The application of the populus jaborandi whole genome breeding chip in populus jaborandi whole genome selective breeding.
The application of the American black poplar whole genome breeding chip in American black Yang Quansu fine variety breeding.
The invention has the beneficial effects that:
the invention is different from the traditional solid-phase chip and sequencing-based liquid-phase chip principle, firstly, the whole genome re-sequencing of a large-scale poplar group is carried out, the polymorphic sites are identified by using sequencing data, and SNP sites with high quality are screened out by setting the maximum deletion rate, the minimum allele frequency, the Hardwberg balance and other inspection parameters. And obtaining functional sites which are obviously related to important economic traits such as growth lumber property and the like by utilizing whole genome association analysis, and finally developing a 40K SNP breeding chip suitable for important economic traits of poplar. The invention not only designs a high-efficiency, low-cost and high-precision gene parting chip, but also obviously improves the genome selective breeding accuracy of important characters of poplar. The invention can improve the early stage breeding efficiency of the forest, accelerate the fine variety breeding process of the forest, provide a high-efficiency molecular breeding technical means for fine variety breeding of the forest, and have wide breeding application prospect.
Drawings
FIG. 1 is a graph showing the distribution of 855,807SNP locus on different chromosomes on poplar whole genes according to the embodiment of the present invention;
FIG. 2 is a Manhattan scatter plot of a whole genome correlation analysis of a poplar growth material trait in accordance with an embodiment of the present invention;
FIG. 3 is a statistical diagram of SNP locus annotation distribution on poplar whole genes according to an embodiment of the invention;
FIG. 4 is a distribution box diagram of the poplar 40K SNP breeding chip according to the embodiment of the invention in the whole genome selection prediction accuracy of poplar growth trait (chest diameter) and wood property trait (wood basic density).
Detailed Description
The present invention will be further described with reference to specific embodiments for the purpose of making the objects, technical solutions and advantages of the present invention more apparent. Unless otherwise indicated, all technical means used in the following examples are conventional means well known to those skilled in the art.
The 296 poplar test materials selected in the application all grow in the American black poplar germplasm resource library (Sihong county Chen Wei forest farm of Jiangsu province) of the university of Nanjing forestry, and the poplars all originate from 1000 clones which are in 12 different source regions and have no direct relationship with each other.
A tree growth cone with an inner diameter of 5 mm was used at a height of 1.3 m in the north-south direction, starting from the bark through the medulla and taking a complete and defect-free wood core sample.
The American black poplar whole genome sequence public website disclosed by the application is as follows: https:// www.ncbi.nlm.nih.gov/data/genome/GCA_ 014884945.1/.
Example 1
1. The collected wood cores are used for measuring the basic density of poplar wood by a drainage method, and the specific operation is as follows: the volume (in cm) of the sample in the saturated state of moisture was determined by calculating the difference between the mass of the sample and the mass after soaking in distilled water to saturation 3 ) The method comprises the steps of carrying out a first treatment on the surface of the Placing the sample in an oven with temperature of 103+ -3deg.C, oven drying until the weight of the sample is kept constant, andthe absolute dry weight (in g) was measured with an electronic balance having an accuracy of 0.0001 g; according to the formula ρ=m/v (where ρ represents the wood basis density in g/cm 3 The method comprises the steps of carrying out a first treatment on the surface of the m represents the absolute dry weight of the sample in g; v represents the volume of the sample at saturation in cm 3 ) The wood basis density of each sample was calculated.
2. The Van Soest wash method was used to determine the cellulose, hemicellulose and lignin content of each sample, and the specific experimental procedure was as follows:
1) Sample preparation: and (3) drying the wood core sample for measuring the wood density to constant weight, crushing and grinding the wood core by using a FW-100 type high-speed general crusher, and fully and uniformly mixing the sieved wood powder.
2) Weighing a sample: 1 gram of wood flour sample was accurately weighed using a precision electronic balance.
3) Fiber component measurement: the weighed samples were measured for neutral washed fiber content (NDF), acid washed fiber content (ADF), acid washed lignin content (ADL), and acid insoluble ash content (AIA) using a FIWE type 6 fiber assay. Each sample was assayed 3 times in duplicate to improve the reliability of the results.
4) The content of wood chemical components was calculated according to the following formula:
cellulose content (%) =adf (%) -ADL (%)
Hemicellulose content (%) =ndf (%) -ADF (%)
Lignin content (%) =adl (%) -AIA (%)
3. Statistical analysis of the phenotypic data, including calculation of mean, minimum, maximum, standard deviation, and coefficient of variation, was performed using R4.1.2 software. And calculating the skewness and kurtosis of the data set by using the movements package in the R software, and carrying out normal distribution detection. Phenotype correlations and genetic correlations were calculated using cov functions and var functions in the R language. The generalized genetic transmission of poplar base density and wood cellulose, hemicellulose and lignin content was calculated using lme.
4. The results are shown in Table 1 and FIG. 1, where all traits exhibited varying degrees of variation. Average wood basis density of poplar breeding populations was 0.39g/cm 3 Amplitude of 0.26g/cm 3 To 0.51g/cm 3 The method comprises the steps of carrying out a first treatment on the surface of the The average cellulose content is 53.78%, and the amplitude is between 47.85% and 60.77%; the average hemicellulose content is 24.17%, and the amplitude is 20.36% to 30.22%; the average lignin content is 12.7% and the amplitude is 6.51% to 17.64%. Wherein, the minimum and maximum coefficients of variation are cellulose (3.22%) and lignin (12.2%), respectively, indicating that the lignin content is more affected by the environment. The phenotype values of the various traits basically accord with normal distribution, which indicates that the poplar sex traits belong to typical quantitative traits and are controlled by multiple genes. Correlation analysis showed that there was a very significant correlation between the material properties (p<0.001 Indicating that the wood properties are affected with each other in the development process of the forest tree. Generalized genetic transmission of wood basic density, cellulose, hemicellulose and lignin (h 2 ) 0.43, 0.82, 0.17 and 0.001, respectively. The genetic power of the cellulose content is significantly higher than other material properties, and therefore, the chemical properties of the cellulose content are controlled to a higher degree by inheritance than other material properties.
TABLE 1 Poplar training population lumber trait phenotype data descriptive statistics
Example 2
1. SNP locus of poplar whole genome resequencing data
Double-end PE150 sequencing was performed on 296 poplar plants using an Illumina HiSeq6000 high-throughput sequencing platform using WGS (wholegenomesequencing) genome re-sequencing technology. The sequencing data was aligned to the reference genome of populus americana using a BWA (Burrows-wheeler aligner) tool, resulting in an alignment in BAM format. To improve the accuracy of subsequent mutation detection, the alignment results are preprocessed, including removal of PCR repeated sequences, quality control, local alignment (localalignment), base quality value correction (BaseQualityScoreRecalibration, BQSR). Single Nucleotide Variation (SNV) and insertion/deletion (Indel) detection was then performed using the HaplotyCaller tool in GATK. The mutation result is primarily filtered based on quality and depth indexes by a variant filtration tool in GATK to remove false positive and false mutation. Next, genotypes were subjected to stringent filtration using PLINK and VCFtools software, wherein the filtration criteria included a sequencing depth of greater than 3X, an integrity of greater than 0.8, a minimal allele frequency of not less than 0.05, a deletion rate of less than 20%, and a Haydig-Winberg equilibrium law (p-value of greater than 0.00001). Finally, filtered genotype data was populated using the BEAGLE software to predict portions of Single Nucleotide Polymorphism (SNP) sites that might be lost by sequencing, and these SNP sites were annotated and functionally predicted using the ANNOVAR software. Finally 855,807 high-quality SNP loci are obtained and used for subsequent genetic analysis and research.
2. Poplar sex trait whole genome association analysis
Based on 855,807SNP mutation sites obtained in the above, principal component analysis and genetic relationship analysis are performed by using PLINK software and GEMMA software respectively, so as to obtain a feature vector PCA matrix of all individuals and a genetic relationship coefficient Kinship matrix between two individuals. Combining the poplar breast diameter value, wood density, cellulose content, hemicellulose content and lignin content phenotype data, using GEMMA software, taking a group structure (PCA) as a fixed effect, adding a genetic relationship (Kinship) as a random effect into analysis of a Mixed Linear Model (MLM), obtaining an associated P value of each SNP and the character, sorting from small to large according to the P value, taking the first 5000 SNP sites, and combining to obtain a remarkable SNP site set associated with the characters of the breast diameter value, wood density, cellulose content, hemicellulose content and lignin content, wherein the remarkable SNP site set contains 23,791 SNP functional sites (figure 2).
3. Core backbone SNP sites on Yang Shuquan genome scale
855,807SNP mutation sites on the whole genome were genomically annotated and region screened using ANNOVAR software, and these mutation sites were mapped to specific intergenic regions, untranslated regions, 1kb regions upstream or downstream of the gene, intronic regions, splice sites, and exonic regions. Wherein, SNP loci of exon regions are further subdivided into nonsensical mutations, synonymous mutations, stop codon acquisition or loss variation. After the exclusion of the SNP site set closely related to the sex trait of poplar growth, SNP sites uniformly covering each segment of chromosome were selected, and the specific gravity of SNP sites of non-synonymous mutation of exon region was increased, and 16,442 SNP backbone sites were finally screened out (fig. 3).
4. Poplar 40K SNP breeding chip
The SNP sites closely associated with the sex characteristics of poplar growth (23,791) and Yang Shuquan genome-wide core skeleton SNP sites (16,442) were combined, and finally a pool containing 40,213 SNP sites was screened out. These SNP loci constitute a 40K SNP breeding chip for the sexual traits of poplar growth, as shown in Table 2.
TABLE 2 SNP locus information of 40K SNP breeding chip
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
Example 3
1. Phenotype and genotype file arrangement
The phenotype and genotype data of the training population is first filled in for deletions and format before full Genome Selection (GS) prediction. The chest diameter, the basic density, the cellulose content, the hemicellulose content and the lignin content of 296 poplar are respectively used as phenotype data for selection prediction of the poplar sex property whole genome. Genotype data was converted to 0/1/2 format using PLINK software, with homozygous non-mutant genotype encoded as 0, heterozygous genotype encoded as 1, and homozygous mutant genotype encoded as 2.
2. Genome-wide selection model parameter settings
Statistical models were selected using 16 different whole genomes, including the best linear unbiased prediction model (GBLUP, rrBLUP), bayesian model (BRR, bayesA, bayesB, bayesC, bayes Lasso), and multiple roboticsA learning model (Ridge, linear Lasso, elastic net, linear Regression, kernel Ridge, pls regression, random Forest,SVRlinear,SVRpoly. Wherein, 2 optimal linear unbiased prediction models are realized by adopting an R software package rrBLUP, 5 Bayesian models are realized by adopting an R software package BGLR, and 9 machine learning models are realized by adopting a Python installation package scikit-learn (sklearn).
80% of the poplar population was used as training population and the remaining 20% was used as test population by the 5-fold cross validation method. In the training population, a poplar growth material character whole genome selection prediction model is established by using phenotype data, 40K SNP genotype data and 16 whole genome selection statistical models. Subsequently, breeding values for poplar sexual traits were estimated in the validated population using genotype data and predictive models. To eliminate sampling errors, the process is iterated 500 times and the Pearson correlation coefficient (r) mean of the test population breeding values and the actual observations is used as an index for evaluating the accuracy of the whole genome selection prediction. And finally, determining an optimal whole genome selection statistical model and an optimal SNP genotype position set according to the evaluation standard. And finally, determining an optimal whole genome selection statistical model according to the evaluation standard.
3. Determining poplar sex trait whole genome selection optimal model
The 16 whole genome selection statistical models are used for carrying out whole genome selection prediction analysis on poplar growth traits (breast diameter) and wood sex traits (wood basic density).
As shown in fig. 4, in combination with the poplar 40K SNP breeding chip, the 9 models based on machine learning exhibit significantly improved prediction accuracy (highest prediction accuracy r=0.84), and have significant advantages compared to the optimal linear unbiased prediction model and bayesian model (highest prediction accuracy r=0.7). The result highlights the great potential of the prediction model based on artificial intelligence in improving the accuracy of the tree whole-gene selection prediction. In conclusion, by using the 40K SNP breeding chip of the poplar growth material property designed by the invention, the prediction accuracy of Ridge, linear Regression and SVRlinear statistical models is highest, the accuracy of the growth property breeding value reaches 0.77, and the accuracy of the material property breeding value can reach 0.84.
4. Based on a poplar 40K SNP breeding chip, a Ridge, linearRegression and SVRlinear statistical model is adopted to estimate breeding values GEBV (GenomicEstimatedBreedingValue) of the growth character (breast diameter) and the lumber character (wood basic density, cellulose content, hemicellulose content and lignin content) of the poplar breeding population. In order to screen the fine poplar seeds, the poplar plants (tables 3-7) with the top 10 ranks are selected as the fine quality materials for the subsequent fine poplar seed selection and molecular mechanism research according to the breeding values of the poplar groups.
TABLE 3 screening out fast-growing elite materials with top ten growth trait (thoracodiameter) breeding values in poplar breeding populations
TABLE 4 screening out fast-growing elite Material with top ten seed-up values for the sexual trait (Timber base Density) in poplar breeding populations
Table 5 screening out the fast-growing improved variety materials with ten top-ranking breeding values of the sexual traits (cellulose content) in the poplar breeding population
Table 6 screening out the fast-growing improved variety materials with the top ten of the breeding values of the sex characters (hemicellulose content) in the poplar breeding population
TABLE 7 screening out fast-growing elite Material with top ten seed values of the sexual Property (lignin content) in the poplar seed population
/>

Claims (10)

1. The construction method of the populus jaborandi whole genome breeding chip is characterized by comprising the following steps:
1) Performing genome re-sequencing and genotyping on 296 poplar to obtain 855,807 high-quality SNP loci;
2) Based on 855,807 high-quality SNP loci obtained, carrying out genome-wide association analysis on poplar sex characters by combining 296 poplar chest diameter values, wood density, cellulose content, hemicellulose content and lignin content phenotype data to obtain 23,791 SNP functional loci;
3) Carrying out gene annotation and region screening on 855,807 high-quality SNP loci by using ANNOVAR software, and screening 16,442 SNP skeleton loci;
4) 23,791 SNP functional loci and 16,442 SNP skeleton loci are combined and screened, and a set containing 40,213 SNP loci is finally obtained, so that a 40K SNP breeding chip for the sexual character of the poplar growth material is formed.
2. The method for constructing a whole genome breeding chip of populus jaborandi according to claim 1, wherein step 1) is to adopt WGS genome re-sequencing technology to sequence double-end PE150 of 296 poplar plants by using IlluminaHiSeq6000 high throughput sequencing platform; comparing the sequencing data to a reference genome of populus americana by using a BWA tool to obtain a BAM format comparison result; to improve the accuracy of the subsequent mutation detection, preprocessing the comparison result, including removing the PCR repeated sequence, quality control, local heavy comparison and base quality value correction; subsequent detection of single nucleotide variations and insertions/deletions using the biplotypeCaller tool in GATK; firstly, a variation result is preliminarily filtered based on quality and depth indexes through a variant filtration tool in GATK to remove false positive and false variation; next, the genotypes were subjected to stringent filtration using PLINK and VCFtools software, wherein the filtration criteria included a sequencing depth of greater than 3X, an integrity of greater than 0.8, a minimum allele frequency of no less than 0.05, a deletion rate of less than 20%, and compliance with the hadi-winberg equilibrium law; finally, filtered genotype data was populated using the BEAGLE software to predict partial single nucleotide polymorphism sites that may be lost by sequencing, and these SNP sites were annotated and functionally predicted using the ANNOVAR software; finally 855,807SNP loci with high quality are obtained.
3. The construction method of the American black poplar whole genome breeding chip according to claim 1, wherein the step 2) is based on 855,807SNP variation sites obtained by using PLINK software and GEMMA software to perform principal component analysis and genetic relationship analysis respectively, obtaining a feature vector PCA matrix of all individuals and a genetic relationship coefficient Kinship matrix between two individuals, combining the data of poplar chest diameter value, wood density, cellulose content, hemicellulose content and lignin content phenotype, adding the genetic relationship as a random effect into analysis of a mixed linear model by using GEMMA software by taking a group structure as a fixed effect, obtaining an associated P value of each SNP and a character, taking the first 5000 SNP sites after sorting from small to large according to the P value, and obtaining a significant SNP site set associated with the chest diameter value, wood density, cellulose content, hemicellulose content and lignin content characters after combining, wherein the significant SNP site set contains 23,791 SNP functional sites in total.
4. The method for constructing a holomorpha americana genome-wide breeding chip according to claim 1, wherein step 3) is to perform gene annotation and region screening on 855,807SNP mutation sites on the holomorpha genome using ANNOVAR software, locate these mutation sites to specific intergenic regions, untranslated regions, 1kb regions upstream or downstream of genes, introns, cleavage sites and exons, wherein the SNP sites of the exons are further subdivided into nonsensical mutations, synonymous mutations and stop codon acquisition or loss variants, select SNP sites uniformly covering each section of chromosome after excluding the SNP sites set closely associated with the poplar growth sex trait, and increase the specific gravity of SNP sites of the nonsensical mutations of the exons, and finally select 16,442 SNP backbone sites altogether.
5. The method for constructing a holomorpha americana genome-wide breeding chip according to claim 1, wherein step 4) is to combine 23,791 SNP functional sites and 16,442 SNP framework sites and screen them to obtain a collection containing 40,213 SNP sites, thereby forming a 40K SNP breeding chip for poplar growth material character.
6. The method for constructing a holomorpha americana genome-wide breeding chip according to claim 1, comprising the specific steps of:
1) Adopting WGS genome re-sequencing technology, and performing double-end PE150 sequencing on 296 poplar plants by using an IlluminaHiSeq6000 high-throughput sequencing platform; comparing the sequencing data to a reference genome of populus americana by using a BWA tool to obtain a BAM format comparison result; to improve the accuracy of the subsequent mutation detection, preprocessing the comparison result, including removing the PCR repeated sequence, quality control, local heavy comparison and base quality value correction; subsequent detection of single nucleotide variations and insertions/deletions using the biplotypeCaller tool in GATK; firstly, a variation result is preliminarily filtered based on quality and depth indexes through a variant filtration tool in GATK to remove false positive and false variation; next, the genotypes were subjected to stringent filtration using PLINK and VCFtools software, wherein the filtration criteria included a sequencing depth of greater than 3X, an integrity of greater than 0.8, a minimum allele frequency of no less than 0.05, a deletion rate of less than 20%, and compliance with the hadi-winberg equilibrium law; finally, filtered genotype data was populated using the BEAGLE software to predict partial single nucleotide polymorphism sites that may be lost by sequencing, and these SNP sites were annotated and functionally predicted using the ANNOVAR software; 855,807SNP loci with high quality are finally obtained;
2) Based on 855,807SNP variation sites obtained in the above, performing principal component analysis and genetic relationship analysis by using PLINK software and GEMMA software respectively to obtain a feature vector PCA matrix of all individuals and a genetic relationship coefficient Kinship matrix between two individuals, combining the breast diameter value, wood density, cellulose content, hemicellulose content and lignin content phenotype data of poplars, using GEMMA software to take a group structure as a fixed effect, adding genetic relationship as a random effect into analysis of a mixed linear model to obtain an associated P value of each SNP and a character, taking the first 5000 SNP sites after sorting from small to large according to the P value, and combining to obtain a remarkable SNP site set associated with the characters of breast diameter value, wood density, cellulose content, hemicellulose content and lignin content, wherein the total contains 23,791 SNP functional sites;
3) Carrying out gene annotation and region screening on 855,807SNP mutation sites on the whole genome by using ANNOVAR software, positioning the mutation sites to specific intergenic regions, untranslated regions, 1kb regions upstream or downstream of genes, intronic regions, shearing sites and exon regions, wherein the SNP sites of the exon regions are further subdivided into nonsensical mutations, synonymous mutations and stop codon obtaining or losing mutation, selecting SNP sites uniformly covering all sections of a chromosome after removing a SNP site set closely related to the sex characteristics of poplar, increasing the specific gravity of the SNP sites of the nonsensical mutations of the exon regions, and finally screening 16,442 SNP skeleton sites;
4) 23,791 SNP functional loci and 16,442 SNP skeleton loci are combined and screened, and a set containing 40,213 SNP loci is finally obtained, so that a 40K SNP breeding chip for the sexual character of the poplar growth material is formed.
7. The populus jalapa whole genome breeding chip constructed by the construction method of the populus jaboracic whole genome breeding chip according to any one of claims 1 to 6, wherein the populus jaboracic whole genome breeding chip consists of 40,213 SNP loci.
8. The SNP molecular marker combination for the fine variety breeding of America black Yang Susheng is characterized by comprising 40,213 SNP loci.
9. The use of the populus jaborandi whole genome breeding chip of claim 7 in populus jaborandi whole genome selective breeding.
10. The use of the american black poplar whole genome breeding chip of claim 7 in the breeding of american black Yang Quansu elite.
CN202311612769.4A 2023-11-29 2023-11-29 American black poplar whole genome breeding chip and construction method and application thereof Pending CN117558341A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311612769.4A CN117558341A (en) 2023-11-29 2023-11-29 American black poplar whole genome breeding chip and construction method and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311612769.4A CN117558341A (en) 2023-11-29 2023-11-29 American black poplar whole genome breeding chip and construction method and application thereof

Publications (1)

Publication Number Publication Date
CN117558341A true CN117558341A (en) 2024-02-13

Family

ID=89812479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311612769.4A Pending CN117558341A (en) 2023-11-29 2023-11-29 American black poplar whole genome breeding chip and construction method and application thereof

Country Status (1)

Country Link
CN (1) CN117558341A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117746979A (en) * 2024-02-21 2024-03-22 中国科学院遗传与发育生物学研究所 Animal variety identification method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117746979A (en) * 2024-02-21 2024-03-22 中国科学院遗传与发育生物学研究所 Animal variety identification method

Similar Documents

Publication Publication Date Title
Cortés et al. Genotyping by sequencing and genome–environment associations in wild common bean predict widespread divergent adaptation to drought
Zhang et al. The landscape of gene–CDS–haplotype diversity in rice: Properties, population organization, footprints of domestication and breeding, and implications for genetic improvement
Gapare et al. Strong spatial genetic structure in peripheral but not core populations of Sitka spruce [Picea sitchensis (Bong.) Carr.]
Uchiyama et al. Demonstration of genome-wide association studies for identifying markers for wood property and male strobili traits in Cryptomeria japonica
Caruana et al. Validation of genotyping by sequencing using transcriptomics for diversity and application of genomic selection in tetraploid potato
AU2011261447B2 (en) Methods and compositions for predicting unobserved phenotypes (PUP)
Guo et al. Resequencing 200 flax cultivated accessions identifies candidate genes related to seed size and weight and reveals signatures of artificial selection
Ladejobi et al. Maximizing the potential of multi-parental crop populations
CN111223520B (en) Whole genome selection model for predicting nicotine content in tobacco and application thereof
CN117558341A (en) American black poplar whole genome breeding chip and construction method and application thereof
Liu et al. Draft genome analysis provides insights into the fiber yield, crude protein biosynthesis, and vegetative growth of domesticated ramie (Boehmeria nivea L. Gaud)
Pégard et al. Favorable conditions for genomic evaluation to outperform classical pedigree evaluation highlighted by a proof-of-concept study in poplar
CN112687340A (en) Method for breeding corn high-yield material based on whole genome association analysis and whole genome selection
CN115820892A (en) SNP molecular marker associated with upland cotton chromosome A07 and boll weight and application thereof
WO2021196255A1 (en) Rapmap method for rapid and high-throughput positioning and cloning of plant qtl gene
CN116334248A (en) Liquid chip for local chicken genetic resource protection and variety identification and application thereof
Huang et al. Genome-wide association mapping for agronomic traits in an 8-way upland cotton MAGIC population by SLAF-seq
CN109727642B (en) Whole genome prediction method and device based on random forest model
Tan et al. Comparison between flat and round peaches, genomic evidences of heterozygosity events
Liu et al. Genetic analysis in maize foundation parents with mapping population and testcross population: Ye478 carried more favorable alleles and using QTL information could improve foundation parents
CN113421612A (en) Corn harvest period seed water content prediction model, construction method thereof and related SNP molecular marker combination
CN110853711B (en) Whole genome selection model for predicting fructose content of tobacco and application thereof
Shirasawa et al. An improved reference genome for Trifolium subterraneum L. provides insight into molecular diversity and intra-specific phylogeny
Yan et al. Accuracy of genomic selection for important economic traits of cashmere and meat goats assessed by simulation study
Tang et al. A strategy for the acquisition and analysis of image-based phenome in rice during the whole growth period

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination