CN107058525B

CN107058525B - Method for predicting unknown gene function of corn based on dynamic correlation of gene expression quantity and character

Info

Publication number: CN107058525B
Application number: CN201710169145.8A
Authority: CN
Inventors: 李慧; 许秀勤; 车荣会; 李鹏; 裴腊明; 高幸幸; 何琳琳
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2017-03-21
Filing date: 2017-03-21
Publication date: 2020-12-29
Anticipated expiration: 2037-03-21
Also published as: CN107058525A

Abstract

The invention belongs to the field of plant molecular biotechnology and genetic engineering, and particularly relates to a method for predicting unknown gene functions of corn based on dynamic correlation of gene expression quantity and characters, which is realized by the following steps: firstly, collecting seed transcript sequencing 15 days after pollination of a maize inbred line to obtain gene expression data; establishing a dynamic association analysis LA model; LA significance assessment; excavating dynamic association of a corn whole genome gene co-expression mode; functional annotation was performed on genes with significant LA outcome to predict the function of unknown genes. The invention takes the phenomenon that the genes in the corn grains are dynamically associated with the co-expression mode as a breakthrough to predict the function of the unknown genes. Compared with the traditional co-expression network construction, the dynamic association analysis can quickly find the regulatory gene for regulating the co-expression mode.

Description

Method for predicting unknown gene function of corn based on dynamic correlation of gene expression quantity and character

Technical Field

The invention belongs to the field of plant molecular biotechnology and genetic engineering, and particularly relates to a method for predicting unknown gene functions of corn based on dynamic correlation of gene expression quantity and characters.

Background

Corn is one of the three major crops in the world, and since the 90 s of the 20 th century, the total world corn yield surpasses that of rice and wheat for the first time and becomes the first food crop. Corn kernel accumulates large amounts of storage materials including starch, oil and protein. With the improvement of living standard and the change of dietary structure of people and the development of starch and grease processing industry, the corn varieties gradually change from yield type to quality type, and the corn quality and the specificity thereof become more and more important.

The complex trait is regulated by multiple gene loci, and the interaction between genes forms a complex gene regulation network to control the progress of various biological reactions in the cell. The development of high throughput sequencing technology has enabled us to obtain large-scale and massive omics data, such as genotype data, gene expression data, protein interaction data, and the like. Research shows that the expression patterns of the genes with similar functions are related. Therefore, the construction of the co-expression network provides an idea for predicting the function of the unknown gene. However, in the process of constructing a co-expression network, we find many genes with similar functions, and the expression patterns of the genes are not related. Therefore, there is a limitation in predicting the function of an unknown gene using co-expression analysis. Researches show that a single gene/protein has limited influence on complex quantitative traits, the single gene/protein often needs to function through a high-order cell tissue form, the expression quantities of a plurality of functionally related genes are not related, genetic loci for controlling phenotypic traits are excavated, the genetic loci are relatively independent, the regulation and control relationship between the genetic loci is unknown, and the traditional analysis method needs years of multi-point phenotypic identification and wastes time and labor.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method for predicting the unknown gene function of corn based on the dynamic correlation between the gene expression quantity and the character.

The invention is realized by the following technical scheme:

the invention provides a method for predicting unknown gene functions of corn based on dynamic correlation of gene expression quantity and traits, which comprises the following steps:

(1) collecting seed transcripts 15 days after pollination of a maize inbred line, and sequencing to obtain gene expression data;

(2) establishing a dynamic correlation analysis LA model;

(3) LA significance assessment;

(4) excavating dynamic association of a corn whole genome gene co-expression mode;

(5) functional annotation was performed on genes with significant LA outcome to predict the function of unknown genes.

Further, the maize inbred lines were divided into 2 groups: and in the tropical zone, subtropical zone and temperate zone, a complete random block method is adopted in a group, 2 repeats are set, 1 row is repeatedly sown in each selfing line, all materials are selfed, immature grains after pollination for 15 days are harvested, 3-4 ears are respectively taken in two repeats of each selfing line, 1-2 grains are taken in each ear, total RNA of the grains is extracted in a mixed mode, and samples of the number of the selfing lines are randomly selected for RNA-seq.

The RNA-seq comprises the following steps: firstly, extracting all RNA with Pol y (A) tail, mainly mRNA, from total RNA by using Pol y (T) oligonucleotide, randomly breaking the intercepted mRNA into fragments, synthesizing a cDNA first chain by using a six-base random primer, adding reverse transcriptase to synthesize a cDNA second chain, purifying the cDNA fragments by using a kit, carrying out end modification on the cDNA fragments, connecting sequencing joints, recovering target large and small fragments by agarose gel electrophoresis, carrying out PCR amplification, and carrying out sequence determination and analysis by using an Illumina GA II gene analysis system to obtain gene expression data.

Further, the dynamic association analysis LA model is specifically established by the following method: the mathematical definition of LA is as follows:

LA (X, Y | Z) = Eg' (Z) formula 1

Wherein X, Y and Z are data of gene expression quantity in corn grains; assuming that X, Y and Z are continuous random variables with a mean value of 0 and a variance of 1, the correlation of X and Y is expressed as E (XY); when Z = Z, g (Z) = E (XY | Z = Z), g (Z) the co-expression pattern of the XY gene pair when Z = Z is detected. The derivative of g (z) is denoted g' (z), which value can be used for the desired determination of the change in the co-expression pattern,

when Z conforms to a standard normal distribution, the LA value can be simply expressed as LA (X, Y | Z) = e (xyz).

X, Y, Z represent three genes with normally distributed expression profiles, then LA (X, Y | Z) is expressed as: e (xyz) = (x)₁y₁z₁+x₂y₂z₂+...+x_my_mz_m) Equation 2 of/m

LA is used for reflecting the dynamic change of the co-expression mode of the gene pair, namely when the Z gene expression level is higher, the expression level of the XY gene pair is in positive correlation (co-regulated), and E (XY | Z =1) is a positive number; when the expression level of the Z gene is low, the expression levels of the XY gene pair are negatively correlated (curve-regulated), and E (XY | Z =0) is a negative number, so that the expression regulation pattern of the gene pair is changed from a positive correlation (co-regulated) to a negative correlation (curve-regulated), and the LA value is recorded as positive; in contrast, the expression control pattern of the gene pair was changed from negative (cotra-regulated) to positive (co-regulated), and the LA value was recorded as negative.

The evaluation steps of the dynamic association analysis model established by the invention are as follows: mixing expression values of all genes; in each simulation, expression quantity values of a pair of genes (X, Y) are randomly extracted by a back-put random sampling method, Z genes take all genes of a whole genome, and LA values of XY genes in the whole genome are calculated to respectively obtain a positive large value and a negative small value of the LA; and repeating the simulation for one million times to respectively obtain the positive value reference distribution and the negative value reference distribution of the LA, and taking the 99% quantile of the positive and negative LA reference distributions as the positive and negative LA significance threshold.

Further, the result of the whole genome dynamic association analysis is filtered according to the size of the LA value, the genes with obvious LA are annotated with functions, and the function of unknown genes is predicted.

Researches show that the reasons for irrelevant expression patterns of genes with similar functions mainly comprise two hypotheses, namely that the expression regulation of the genes with similar functions is not on the mRNA level, and the expression patterns of the genes with similar functions are only relevant in a specific cell environment, namely the dynamic correlation of co-expression patterns, and the dynamic correlation analysis (LA) provides theoretical support for verifying the second hypothesis. The invention is based on scientific hypothesis that genes with similar functions and expression patterns are related, adopts an LA method to identify the dynamic association of the corn whole genome gene co-expression pattern, predicts the function of an unknown gene according to the function annotation of the gene in an obvious LA result, verifies the LA prediction result according to the homologous gene function of the unknown gene in arabidopsis thaliana, is innovative in thought, and has no report in the research of the field of botany.

The invention has the beneficial effects that:

(1) the invention takes the phenomenon that the genes in the corn grains are dynamically associated with the co-expression mode as a breakthrough to predict the function of the unknown genes. Compared with the traditional co-expression network construction, the dynamic correlation analysis can quickly find the regulatory gene for regulating the co-expression mode;

(2) the invention conjectures the function of the unknown gene by annotating the function of the gene with obvious LA result and verifies the predicted result by the function of the homologous gene, thus being an effective method for predicting the function of the unknown gene.

Drawings

Figure 1 is a random simulation generated LA value to assess the significance of LA analysis.

Detailed Description

The invention is further described with reference to the following figures and specific examples, which are intended to be illustrative only and do not limit the scope of the invention.

Example 1

The invention discloses a method for predicting unknown gene functions of corn based on dynamic correlation analysis.

(1) Collecting gene expression amount data:

368 parts of inbred lines (the maize variety used in the invention can be any variety, including 35 parts of high-oil maize inbred lines (Yang et al, 2010 b) cultivated by Song and Mingzhiu, university of agriculture in China) are planted in Hubei Jingzhou in 2010, and are divided into 2 groups (tropical zone, subtropical zone and temperate zone) according to pedigree information, a complete random block method is adopted in the group, 2 times of inbred lines are set, and each inbred line is sown for 1 line repeatedly. All materials are selfed, immature grains 15 days after pollination (15 DAP) are harvested, 3-4 ears are respectively taken for two repetitions of each selfing line, 1-2 grains are taken for each ear, total RNA of grains is mixed and extracted, and 368 samples are randomly selected for RNA-seq. The RNA-Seq work of the sample was performed by Shenzhen Hua Dagen Institute (BGI), and the sequencing method is briefly described as follows: firstly, extracting all RNA with Pol y (A) tail, mainly mRNA, from total RNA by using Pol y (T) oligonucleotide, randomly breaking the intercepted mRNA into fragments, synthesizing a cDNA first chain by using hexabasic random primers (random hexamers), adding reverse transcriptase and the like to synthesize a cDNA second chain, purifying the cDNA fragments by using a kit (Ampure XP beads), carrying out end modification on the cDNA fragments, connecting sequencing joints, recovering target size fragments by agarose gel electrophoresis, carrying out PCR amplification, thus finishing the construction work of the whole library, and carrying out sequence determination and analysis on the constructed library by using an Illumina GA II gene analysis system. The deletion value pretreatment of gene expression data sets is as follows for expression quantity data of 28769 genes in 368 maize inbred lines obtained by transcript sequencing: gene expression data is missing due to noise in the experiment, detection techniques, etc. For each gene in the dataset, if its expression value is missing in more than 30% of the samples, the gene is discarded in subsequent analyses. 24,907 gene expression data (part of the data can be directly downloaded from a database as required) are obtained for subsequent genome-wide LA analysis;

(2) establishing a dynamic association analysis LA model:

the dynamic association analysis LA model is specifically established by adopting the following method: the mathematical definition of LA is as follows:

LA (X, Y | Z) = Eg' (Z) formula 1

(3) LA significance assessment

Mixing expression values of all genes; in each simulation, expression quantity values of a pair of genes (X, Y) are randomly extracted by a back-put random sampling method, Z genes take all genes of a whole genome, and LA values of XY genes in the whole genome are calculated to respectively obtain a positive large value and a negative small value of the LA; and repeating the simulation for one million times to respectively obtain a positive value reference distribution and a negative value reference distribution of the LA, and taking 99% quantiles of the positive and negative LA reference distributions as LA positive and negative significance thresholds, wherein specific results are shown in figure 1.

(4) Whole genome LA analysis

LA analysis was performed with X = whole genome gene, Y = whole genome gene, and Z = whole genome gene, focusing on the list of the first 50 co-expressed gene pairs (LAP) with the largest absolute LA value. Functional notes X, Y and Z, Table 1GRMZM5G858880List of genes involved in the process of protein translation that are regulated. GeneGRMZM5G858880The function of regulating multiple pairs of co-expressed gene pairs (LAP) and Maize genomic Database (Maize Genome Database) for this gene is annotated as "encoding a protein comprising the WW domain". In thatGRMZM5G858880In the list of regulated LAPs, some genes were found to be involved in the protein translation process, including ribosomal protein synthesis, initiation of protein translation, and protein phosphorylation, and occurred many times,GRMZM2G092663 (encoding ribosomal S5 protein family, 4 times),GRMZM2G099352(encoding the ribosomal S3 protein family),GRMZM2G168149(encoding the ribosomal S5 protein family),GRMZM2G129015(encoding ribosomal S26e protein family, 2 times),GRMZM2G164352(encoding protein phosphatase 2A subunit A2, 4 times),GRMZM2G122135(encoding protein phosphatase 2A subunit A2, 2 times),GRMZM2G064133(encoding eukaryotic translation initiation factor 3G 1), thus the regulatory gene was presumedGRMZM5G858880Also involved in the protein translation process. Research tableIn the light of the above, it is clear that,GRMZM5G858880the homologous gene in Arabidopsis (AT 3G 13225) regulates the protein translation process by ribosome deceleration and reduced reinitiation efficiency (Tran et al, BMC Genomics, 2008).

TABLE 1GRMZM5G858880List of regulated genes involved in the protein translation process

The results prove the effectiveness of the invention, and the unknown gene function is predicted by dynamic association analysis of the whole genome gene on the co-expression mode and combining function annotation, so that a new thought and method is provided for the corn functional genomics research.

Claims

1. A method for predicting the unknown gene function of corn based on the dynamic correlation between gene expression level and character is characterized by comprising the following steps:

(2) establishing a dynamic correlation analysis LA model;

LA (X, Y | Z) = Eg' (Z) formula 1

Wherein X, Y and Z are data of gene expression quantity in corn grains; assuming that X, Y and Z are continuous random variables with a mean value of 0 and a variance of 1, the correlation of X and Y is expressed as E (XY); when Z = Z, g (Z) = E (XY | Z = Z), g (Z) is detected the co-expression pattern of the XY gene pair when Z = Z, the derivative of g (Z) is denoted as g' (Z), which value can be used for the desired determination of the co-expression pattern change, and when Z follows a standard normal distribution, the LA value can be simply denoted as LA (X, Y | Z) = E (xyz);

LA is used for reflecting the dynamic change of the co-expression mode of the gene pair, namely when the Z gene expression level is higher, the expression level of the XY gene pair is in positive correlation (co-regulated), and E (XY | Z =1) is a positive number; when the expression level of the Z gene is low, the expression levels of the XY gene pair are negatively correlated (curve-regulated), and E (XY | Z =0) is a negative number, so that the expression control pattern of the gene pair is changed from a positive correlation (co-regulated) to a negative correlation (curve-regulated), and the LA value is recorded as positive; in contrast, the expression control pattern of the gene pair changed from negative (cotra-regulated) to positive (co-regulated), and the LA value was recorded as negative;

(3) LA significance assessment;

(5) performing functional annotation on the gene with the obvious LA result, and predicting the function of an unknown gene;

the maize inbred lines were divided into two groups: one group is tropical, one group is subtropical and temperate, 2 repetitions are set in the group by adopting a complete random block method, 1 row is repeatedly sown in each selfing line, all materials are selfed, immature pollinated grains are harvested 15 days later, 3-4 ears are respectively taken in two repetitions of each selfing line, 1-2 grains are taken in each ear, total RNA of the grains are mixed and extracted, and 368 samples are randomly selected for RNA-seq.

2. The method of claim 1, wherein the RNA-seq comprises the steps of: firstly, extracting all RNA (mainly mRNA) with Po ly (A) tail from total RNA by using Po ly (T) oligonucleotide, randomly breaking the intercepted mRNA into fragments, synthesizing a cDNA first chain by using a six-base random primer, adding reverse transcriptase to synthesize a cDNA second chain, purifying the cDNA fragments by using a kit, performing end modification on the cDNA fragments, connecting sequencing joints, recovering target large and small fragments by agarose gel electrophoresis, performing PCR amplification, and performing sequence determination and analysis by using an Illumina GA II gene analysis system to obtain gene expression data.

3. The method of claim 1, wherein the step of evaluating the dynamic association analysis LA model is as follows: mixing expression values of all genes; in each simulation, expression quantity values of a pair of genes (X, Y) are randomly extracted by a back-put random sampling method, Z genes take all genes of a whole genome, and LA values of XY genes in the whole genome are calculated to respectively obtain a positive large value and a negative small value of the LA; and repeating the simulation for one million times to respectively obtain the positive value reference distribution and the negative value reference distribution of the LA, and taking the 99% quantile of the positive and negative LA reference distributions as the positive and negative LA significance threshold.

4. The method of claim 1, wherein the results of the dynamic association analysis of the genome-wide gene co-expression pattern are filtered according to the magnitude of the LA value, and the genes with significant LA are functionally annotated to predict unknown gene function.