CN117133354A

CN117133354A - Method for efficiently identifying key breeding gene modules of forest tree

Info

Publication number: CN117133354A
Application number: CN202311097273.8A
Authority: CN
Inventors: 权明洋; 张德强; 杜庆章
Original assignee: Beijing Forestry University
Current assignee: Beijing Forestry University
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-11-28
Anticipated expiration: 2043-08-29
Also published as: CN117133354B

Abstract

The application provides a method for efficiently identifying key breeding gene modules of trees, and relates to the technical field of molecular genetics. The method can accurately and efficiently identify the key breeding gene module of the important forest trait, systematically evaluate the genetic effect of each genotype combination in the breeding module on the phenotypic variation, determine the genotype combination of the optimal key breeding gene module, and be widely applied to the accurate breeding of the important forest trait in the seedling stage and the genetic improvement process of the forest molecules.

Description

Method for efficiently identifying key breeding gene modules of forest tree

Technical Field

The application relates to the technical field of molecular genetics, in particular to a method for efficiently identifying key breeding gene modules of woods.

Background

The important economic characters of the forest are complex quantitative characters regulated by multiple genes, and the forest has the characteristics of multiple years of growth, strong wild property, wide distribution and the like, so that the genetic basis of the important characters is unclear, and the regulation mechanism is unknown. In recent years, with the gradual penetration of molecular genetics and genomics research, a series of key genes with important breeding values are discovered; however, the key genes are difficult to be widely applied to forest molecular genetic improvement practice and important character seedling stage precise breeding, and the main reasons of the key genes include the following three aspects: (1) The natural resources of the forest are widely distributed, so that the phenotypic variation and the allelic variation of the genome of the germplasm resources are abundant, and the genetic effect of the allelic locus of the genome affecting the phenotypic variation of the important characters cannot be fully known in the prior art; (2) The complex forest traits are subjected to the joint regulation of multiple genes, the genetic regulation mechanism is very complex, most of current researches only concern the biological functions of single genes, and neglect the joint genetic effect of a breeding module consisting of multiple genes on phenotypic variation; (3) The existing molecular genetics technical means lack of accurate identification of key breeding modules of important traits and deep analysis of genetic effects thereof, so that the development of molecular genetic improvement on important economic traits of trees and the screening efficiency and the screening precision in seedling stage are low. Therefore, the lack of a strategy for efficiently identifying key breeding gene modules of the forest in the prior art influences the establishment of a tree molecular design breeding technology system and the effective implementation of tree molecular genetic improvement.

Disclosure of Invention

The application aims to provide a method for efficiently identifying key breeding gene modules of trees, which can be used for efficiently identifying key breeding gene modules of important characteristics of the trees, accurately analyzing the genetic effect of each genotype combination in the breeding modules on phenotype variation, and can be widely applied to efficient screening of tree important characteristics in seedling stage, thereby providing important technical support for tree molecular design breeding.

The application provides a method for efficiently identifying key breeding gene modules of woods, which comprises the following steps:

1) Carrying out whole genome association analysis on genome Single Nucleotide Polymorphism (SNP) genotype data of each individual of a wood germplasm resource group to be detected and phenotype values of each individual of specific characters of the wood to be detected in the germplasm resource group, and determining SNP loci obviously associated with the characteristic characters;

the determined conditions include: SNP genotype loci in genome are obviously associated with phenotypic traits, so as to reach significance level in biology statistics;

2) Performing functional annotation on a transcription module where the SNP locus which is obviously associated with the specific character in the step 1) is located, and defining the transcription module as a candidate gene;

3) Determining the expression quantity data of the candidate genes in the step 2) in each individual of the forest germplasm resource group to be detected;

4) Detecting a pearson correlation coefficient r between the population expression level of the candidate genes in the step 3) and the population phenotype value of the specific trait in the step 1), detecting the 'expression and phenotype' correlation between the candidate genes and the specific trait, determining the candidate genes with highly correlated expression patterns with the phenotypic variation of the specific trait, defining the candidate genes as key genes, and indicating that the expression level of the key genes greatly influences the phenotypic variation of the specific trait;

the determined conditions include: the pearson correlation coefficient r >0.4 or r < -0.4;

5) Based on the SNP genotype data obviously associated in the key genes in the step 4), in combination with the idioplasm resource population phenotype values of the specific characters in the step 1), detecting the superior interaction effect affecting the phenotypic variation among SNP loci in the key genes, determining a key breeding gene module affecting the specific phenotypic character, evaluating the genetic effect of each genotype combination in the key breeding gene module on the phenotypic variation, and identifying the optimal genotype combination in the key breeding gene module;

the determined conditions include: the episodic interaction combinatorial relationship affecting a particular phenotypic variation needs to meet a level of significance in biometrics.

Preferably, the number of the forest germplasm resource groups in each step is more than 200 plants.

Preferably, the SNP genotype frequency in steps 1), 2) and 5) is greater than 10%.

Preferably, the method for performing the whole genome association analysis in the step 1) is a mixed linear model in TASSEL v5.0, and the significance level of the association between each SNP site and a specific phenotype is obtained by using software to obtain a P value; multiple hypothesis testing is performed on the P value by using 1/n (n represents the total number of SNP in the whole genome; bonferroni method), and SNP sites with the P value less than 1/n are screened, so that SNP sites which are obviously associated with specific characters are determined.

Preferably, the annotated transcription module in step 2) includes a protein coding gene, long non-coding RNA, and microRNA.

Preferably, the software for calculating the pearson correlation coefficient r in the step 4) includes SPSS v19.0.

Preferably, the software for detecting the epistatic interaction effect in the step 5) is an epi np1 package in epi SNP software, and the significance P value associated with phenotype between SNPs is calculated by the software; SNP-SNP interaction pairs that are significantly associated with a particular trait are determined using a screening criteria of P.ltoreq.0.001.

Preferably, in the step 5), only when SNPs with significant episodic interactions are involved, the corresponding key genes can be incorporated into the key breeding gene module.

Preferably, in said step 5), the frequency of each genotype combination in the critical breeding gene module is greater than 10% of the germplasm resources population when evaluating the phenotypic inheritance effect of each genotype combination.

Another object of the present application is to provide the application of the above method in molecular design breeding of forest trees.

The application provides a method for efficiently identifying key breeding gene modules of trees. The key breeding genes identified by the prior art are difficult to apply to the genetic improvement of the tree molecules and the accurate screening of important character seedling stage, and the reason is that the prior researches fail to fully recognize the allelic variation rule of the tree germplasm resource group, and the system identification and the deep analysis of the key breeding module with strong phenotypic variation genetic effect of the important character are lacking. Therefore, the method for efficiently identifying the key breeding gene module of the forest can accurately and efficiently identify the key breeding gene module of the forest important character, deeply analyze the genetic effect of each genotype combination in the breeding module on the phenotype variation, can be widely applied to the accurate screening of the important character in the seedling stage and the genetic improvement process of the forest molecules, and provides important technical support for the design and breeding of the forest molecules.

By adopting the method provided by the application, the key breeding gene module of the xylem xylose content of the populus tomentosa is PtoGAO1-PtoCAMTA5-PtoC3H3-PtoDOF2, and the chr1_34278210-chr6_5959112-chr10_12844616-chr2_21224438 genotype combination AA/AG/GT/AT corresponding to the xylem xylose content is found to be the highest, the xylem xylose content corresponding to the GA/AG/TT/AT genotype combination is the lowest, and the xylem xylose content can be rapidly screened in the populus tomentosa seedling stage.

Drawings

FIG. 1 shows phenotype effect values of each genotype in a populus tomentosa xylose content breeding gene module;

FIG. 2 is an analytical flow chart of the identification method of the present application.

Detailed Description

The application firstly obtains SNP genotype data of each individual whole genome of a forest germplasm resource group to be detected, and the SNP genotype data of the forest germplasm resource group to be detected is obtained based on the forest whole genome resequencing. The number of the forest germplasm resource groups is more than 200, and the number is the same and is not repeated. The application requires that the SNP genotype frequency is greater than 10%, and is the same and not described in detail.

The application also needs to obtain the phenotype value of each individual of the specific character of the forest tree to be detected in the germplasm resource group, and the method for obtaining the phenotype value of the specific character is not particularly limited.

Carrying out whole genome association analysis on genome SNP genotype data of each individual of a forest germplasm resource group to be detected and phenotype values of specific characters of each individual of the germplasm resource group, and determining SNP loci obviously associated with the characteristic characters; the determined conditions include: SNP genotypic loci in genomes are significantly associated with phenotypic traits, reaching significance levels in biometrics. According to the method for carrying out whole genome association analysis, the mixed linear model in TASSEL v5.0 is optimized, and the significance level of association between each SNP locus and a specific phenotype is obtained by using software to obtain a P value; multiple hypothesis testing is performed on the P value by using 1/n (n represents the total number of SNP in the whole genome; bonferroni method), and SNP sites with the P value less than 1/n are screened, so that SNP sites which are obviously associated with specific characters are determined.

The application carries out functional annotation on the transcription module where the SNP locus which is obviously associated with the specific character is located, and defines the transcription module as a candidate gene. The method of obtaining the annotated transcription module is not limited in the present application, and the annotated transcription module preferably includes, but is not limited to, protein coding genes, long non-coding RNAs (lncRNA), micro RNAs (miRNA), and the like.

The application determines the expression quantity data of the candidate genes in each individual of the forest germplasm resource group to be detected. The method for obtaining the population expression level of the candidate gene is not particularly limited in the present application.

The application detects the Pelson correlation coefficient r between the population expression quantity of candidate genes and the population phenotype value of the specific character, detects the 'expression and phenotype' correlation between the candidate genes and the specific character, determines the candidate genes with highly correlated expression modes with the phenotype variation of the specific character, defines the candidate genes as key genes, and indicates that the expression level of the key genes influences the phenotype variation of the specific character to a great extent; the determined conditions include: the pearson correlation coefficient r >0.4 or r < -0.4. The application preferably uses SPSS v19.0 to calculate the pearson correlation coefficient r.

The application is based on SNP genotype data which are obviously associated in a key gene, combines with a specific character germplasm resource group phenotype value, detects the superior interaction effect which affects the phenotype variation among the obviously associated SNP loci in the key gene, determines a key breeding gene module which affects the specific phenotype character, evaluates the genetic effect of each genotype combination in the key breeding gene module on the phenotype variation, and identifies the optimal genotype combination in the key breeding gene module; the determined conditions include: the episodic interaction combinatorial relationship affecting a particular phenotypic variation needs to meet a level of significance in biometrics.

In the application, the software for detecting the superior interaction effect is preferably an EPISNP1 program package in the epi SNP software, and the significance P value associated with the phenotype between SNP and SNP is obtained by software calculation; SNP-SNP interaction pairs that are significantly associated with a particular trait are determined using a screening criteria of P.ltoreq.0.001. In the present application, only when there is a key gene corresponding to a SNP having a significant interaction relationship, it can be incorporated into a key breeding gene module. In evaluating the phenotypic inheritance of each genotype combination within a critical breeding gene module, it is preferred that the frequency of each genotype combination be greater than 10% of the germplasm resource population.

The method for efficiently identifying key breeding gene modules of woods according to the application is described in further detail below with reference to specific examples, and the technical scheme of the application includes but is not limited to the following examples.

Example 1

By using the method for efficiently identifying the key breeding gene module of the forest, the key breeding gene module of the xylem xylose content of the populus tomentosa is identified, and the phenotypic effect of each genotype combination in the gene module on the xylem xylose content is analyzed, so that the method is used for screening the seedling stage of the character and establishing a tree molecular design breeding technology system.

Step S1, obtaining genome-wide SNP genotype data of populus tomentosa germplasm resource group (303 individuals) based on genome-wide resequencing technology, wherein the genome-wide SNP genotype data comprises the following specific steps:

extracting leaf DNAs of all individuals from a resource group 303 individuals in populus tomentosa as a material for genome resequencing, performing sequence comparison by taking a populus tomentosa reference genome as a reference to obtain whole genome SNP data and the position of the whole genome SNP data in the genome, and screening SNP with genotype frequency of more than 10% for subsequent analysis to obtain 12,800,000 SNP data.

Step S2, obtaining xylem xylose content of populus tomentosa germplasm resource groups (303 individuals), wherein the specific operation steps are as follows:

collecting mature xylem materials of each individual of the populus tomentosa germplasm resource group, immediately placing the materials into liquid nitrogen (-196 ℃) for preservation after collection, and determining the xylem xylose content of the populus tomentosa germplasm resource group by adopting a high performance liquid chromatography according to the specifications of national standard methods GB2677.7-81, GB2677.8-81 and GB 2677.10-81; the analysis shows that the xylem xylose content ranges from 2.03% to 31.95% in populus tomentosa germplasm resource groups, the average value is 14.08%, and the xylem xylose content accords with normal distribution, and is suitable for carrying out whole genome association analysis.

Step S3, carrying out whole Genome association analysis (Genome-wide association Study, GWAS) on the whole Genome SNP data of the populus tomentosa germplasm resource group in step S1 and the xylose data of the populus tomentosa germplasm resource group in step S2 by using a mixed Linear Model (Mix Linear Model) in TASSEL v5.0 to obtain a significance value P of the xylose content of each SNP and the group, and screening SNP loci smaller than the P value as remarkably associated loci by taking P <7.81E-08 (1/n, n is the number of the whole Genome SNP and accords with a Bonferroni test method) as a screening standard; as a result, 14 SNP sites in total were found to form a remarkable correlation with xylem xylose content of populus tomentosa (P < 7.81E-08), and specific results are shown in Table 1.

Table 1 Gene information significantly correlated with xylem xylose content of Populus tomentosa

And S4, carrying out gene annotation on the obviously-related SNP loci based on the coding gene annotation information of the populus tomentosa genome protein, namely positioning the genes of the obviously-related SNP loci, and carrying out annotation to obtain 9 candidate genes, wherein the specific results are shown in Table 1.

In step S5, since the trait of interest in this example is xylem xylose content, the expression level data of 9 candidate genes obtained in step S4 in xylem of populus tomentosa germplasm resource group needs to be detected, and the specific steps are as follows:

collecting mature xylem of populus tomentosa germplasm resource group (303 individuals), storing in liquid nitrogen after collecting, extracting the collected mature xylem RNA by using a Plant Qiagen RNAeasy kit (Qiagen China, shanghai, china) kit, and carrying out transcriptome sequencing by a biological company after quality evaluation to obtain the expression quantity of 9 candidate genes in the populus tomentosa germplasm resource group xylem obtained in the step S4.

Step S6, calculating the pearson correlation coefficient r of expression and phenotype between the group expression quantity of 9 candidate genes in the step S5 and the group xylose content in the step S2 by using SPSS v19.0 software, and finding that the expression level of total 6 genes in the group is highly correlated with the group xylose content (r >0.4 or r < -0.4), wherein the 6 key genes are respectively: ptoARF8, ptoWRKY41, ptoGAO1, ptoDOF2, ptoCAMTA5 and PtoC3H3, and the specific information is shown in Table 1.

TABLE 2 significant marker locus to locus episodic interactions (P < 0.001)

TABLE 3 analysis of Key seed-breeding Module Effect of xylem xylose content of Populus tomentosa

Step S7, combining the SNP genotype data which are obviously associated and correspond to the 6 key genes in the step S6 with the xylem xylose content data of the population, detecting the SNP-SNP interaction effect by using an EPISNP1 program package in the epi SNP software, and finding that a obvious upper-level interaction relationship (P < 0.001) exists among four obviously associated SNPs among obvious marker loci, namely, chr1_34278210, chr6_5959112, chr10_12844616 and Chr2_21224438, which indicates that upper-level interaction exists among the 4 key genes PtoGAO1, ptoCAMTA5, ptoC3H3 and PtoDOF2 (Table 2), and defining that the 4 key genes are incorporated into a key breeding gene module, namely, the xylem xylose content key breeding gene module of aspen is PtoGAO1-PtoCAMTA5-PtoC3H3-PtoDOF2. Further, each genotype combination (each combination has a minimum frequency of more than 10%) of the corresponding significantly associated SNPs (chr1_ 34278210-chr6_5959112-chr10_12844616-chr2_ 21224438) in the breeding module consisting of these 5 key genes was evaluated, and the genetic effect on xylem xylose content was found to be highest for the AA/AG/GT/AT genotype combination, and lowest for the GA/AG/TT/AT genotype combination, as shown in fig. 1 and table 3 for specific information.

From the above, the xylem xylose content breeding gene module of populus tomentosa is as follows: ptoGAO1-PtoCAMTA5-PtoC3H3-PtoDOF2, wherein a genotype combination AA/AG/GT/AT composed of 5 breeding genes with obvious association SNP Chr1_34278210-Chr6_5959112-Chr10_12844616-Chr2_21224438 has the highest xylem xylose content, and a GA/AG/TT/AT genotype combination has the lowest xylem xylose content.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A method for efficiently identifying key breeding gene modules of trees comprises the following steps:

1) Carrying out whole genome association analysis on the genome single nucleotide polymorphism genotype data of each individual in the wood germplasm resource group to be detected and the phenotype value of each individual in the germplasm resource group of the wood specific character to be detected, and determining SNP loci obviously associated with the characteristic characters;

2. The method of claim 1, wherein the number of forest germplasm resources is greater than 200 plants per step.

3. The method of claim 1, wherein the SNP genotype frequencies in steps 1), 2), and 5) are greater than 10%.

4. The method according to claim 1, wherein the method for performing the whole genome association analysis in step 1) is a mixed linear model in TASSEL v5.0, and the significance level of the association between each SNP site and a specific phenotype is obtained by using software, so as to obtain a P value; multiple hypothesis testing is performed on the P value by using 1/n (n represents the total number of SNP in the whole genome; bonferroni method), and SNP sites with the P value less than 1/n are screened, so that SNP sites which are obviously associated with specific characters are determined.

5. The method of claim 1, wherein the annotated transcription module of step 2) comprises a protein-encoding gene, long non-coding RNA, and microrna.

6. The method according to claim 1, wherein the software for calculating the pearson correlation coefficient r in step 4) comprises SPSS v19.0.

7. The method according to claim 1, wherein the software for detecting the epistatic interaction effect in step 5) is an epinp 1 package in the epiSNP software, and the significance P value associated with the phenotype between SNPs is calculated by the software; SNP-SNP interaction pairs that are significantly associated with a particular trait are determined using a screening criteria of P.ltoreq.0.001.

8. The method according to claim 1, wherein in the step 5), only SNPs having a significant episodic interaction relationship can have their corresponding key genes incorporated into the key breeding gene module.

9. The method of claim 1, wherein in step 5), the frequency of each genotype combination in the critical breeding gene module is greater than 10% of the population of germplasm resources when evaluating the phenotypic inheritance of each genotype combination.

10. Use of the method of any one of claims 1 to 9 in molecular design breeding of forests.