CN112837751B

CN112837751B - High-throughput transcriptome sequencing data and trait association analysis system and method

Info

Publication number: CN112837751B
Application number: CN202110081269.7A
Authority: CN
Inventors: 康慧敏; 李华
Original assignee: Foshan University
Current assignee: Foshan University
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2024-02-09
Anticipated expiration: 2041-01-21
Also published as: CN112837751A

Abstract

The invention provides a high-throughput transcriptome sequencing data and trait association analysis method, which comprises the following steps: obtaining high throughput transcriptome sequencing data of a subject; obtaining the normalized expression quantity of each gene of the object according to the high-flux transcriptome sequencing data; fitting the relation between the character form value of the object and the normalized expression quantity of each gene through a linear regression model; solving a linear regression model and taking all genes with non-zero effects as genes associated with the trait. The invention can effectively excavate candidate genes with important characters, improves the gene excavation efficacy and reduces the false positive rate. Correspondingly, the invention also provides a high-throughput transcriptome sequencing data and trait association analysis system.

Description

High-throughput transcriptome sequencing data and trait association analysis system and method

Technical Field

The invention relates to the technical field of biological information, in particular to a high-throughput transcriptome sequencing data and trait association analysis system and method.

Background

Transcriptome broadly refers to the collection of all transcripts in a cell under a physiological condition, including messenger RNA, ribosomal RNA, transfer RNA, and non-coding RNA; in a narrow sense, refers to the collection of all mRNAs. Proteins are the main contributors to cellular function, proteomes are the most direct description of cellular function and status, transcriptional composition is the main means to study gene expression, transcriptomes are the necessary tie of proteomes connecting genomic genetic information with biological function, regulation of transcriptional levels is the most studied at present, and is also the most important regulation way for organisms. The high-throughput sequencing technology is also called as 'next generation' sequencing technology, and takes hundreds of thousands to millions of DNA molecules can be sequenced in parallel at a time and the common reading length is shorter as a mark.

Candidate genes for mining important characters are a main research content in the field of animal and plant genetic breeding, and have important significance for molecular assisted breeding, including genome selection and gene editing. Currently, high throughput transcriptome sequencing has become one of the mainstream methods used in the field of genetic breeding to mine candidate genes for important traits.

For quantitative traits, the prior art does not fully utilize the phenotype information of individuals, and continuously-changed data types are simply processed according to classification traits, so that the efficacy of gene mining is reduced, and the false positive rate is increased. Therefore, it is necessary to develop a high throughput transcriptome sequencing data and trait association analysis method to improve gene mining efficacy and reduce false positive rate.

Disclosure of Invention

Based on the above, in order to solve the problem that the phenotype information of an individual is not fully utilized in the prior art, continuously-changed data types are simply processed according to classification characters to reduce the gene mining efficacy and increase the false positive rate, the invention provides a high-throughput transcriptome sequencing data and character association analysis system and method, and the specific technical scheme is as follows:

a high throughput transcriptome sequencing data and trait association analysis system comprising:

the data acquisition module is used for acquiring high-throughput transcriptome sequencing data and character phenotype values of the object;

the expression quantity acquisition module is used for acquiring the normalized expression quantity of each gene of the object according to the high-flux transcriptome sequencing data;

a fitting module for fitting a relationship between a trait phenotype value of the subject and a normalized expression level of each of the genes by a linear regression model;

and the solving and analyzing module is used for solving the linear regression model and taking all genes with non-zero effects as genes associated with the characters.

Further, the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X _i Is the expression level of the ith gene, b _i And (3) the partial regression coefficient of the ith gene expression quantity to the character phenotype value, m is the gene number, and e is the residual error.

Further, the solution analysis module comprises an algorithm unit for solving the linear regression model according to an elastic network algorithm.

Further, the minimum objective function of the elastic network algorithm is thatWherein λ and α are both adjustment parameters.

The invention also provides a high-throughput transcriptome sequencing data and trait association analysis method, which comprises the following steps:

obtaining high-throughput transcriptome sequencing data and trait phenotype values of a subject;

obtaining normalized expression levels of each gene of the subject from the high throughput transcriptome sequencing data;

fitting a relationship between the trait phenotype value of the subject and the normalized expression level of each gene by a linear regression model;

solving the linear regression model and taking all genes with non-zero effects as genes associated with the trait.

According to the high-throughput transcriptome sequencing data and trait association analysis method, the normalized expression quantity of each gene of the object is obtained according to the high-throughput transcriptome sequencing data, the phenotype information of the object is fully utilized, the problem that the phenotype information of an individual is not fully utilized in the prior art, the continuously-changed data types are simply processed according to classification traits to reduce the gene mining efficacy and increase the false positive rate is solved, candidate genes of important traits can be effectively mined, the gene mining efficacy is improved, and the false positive rate is reduced.

Further, the linear regression model is solved according to an elastic network algorithm.

Further, before fitting the relation between the property phenotype value of the subject and the normalized expression level of each of the genes, the property phenotype value and the expression level of each of the genes are normalized so that the mean value of the expression level of each of the genes and the variance of the expression level of each of the genes are respectively 0 and 1.

The invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements a high throughput transcriptome sequencing data and trait association analysis method as described above.

Drawings

The invention will be further understood from the following description taken in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. Like reference numerals designate corresponding parts throughout the different views.

FIG. 1 is a schematic overall flow chart of a method for high throughput transcriptome sequencing data and trait association analysis in accordance with an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples thereof in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the invention.

It will be understood that when an element is referred to as being "fixed to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like are used herein for illustrative purposes only and are not meant to be the only embodiment.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

The terms "first" and "second" in this specification do not denote a particular quantity or order, but rather are used for distinguishing between similar or identical items.

In one embodiment of the present invention, a high throughput transcriptome sequencing data and trait association analysis system comprises:

In one embodiment, the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X _i Is the expression level of the ith gene, b _i And (3) the partial regression coefficient of the ith gene expression quantity to the character phenotype value, m is the gene number, and e is the residual error.

In one embodiment, the solution analysis module includes an algorithm unit for solving the linear regression model according to an elastic network algorithm.

In one example, the subject is chicken breast muscle weight, the test is designed to randomly select 400 test chickens, then the breast muscle weight of the 400 chickens is determined, and high throughput transcriptome sequencing is performed using the chicken breast muscle weight as a sample.

In one embodiment, the minimum objective function of the elastic network algorithm isWhere λ and α are both adjustment parameters, xb represents the cumulative sum, (y- μ1-Xb)' (y- μ1-Xb) represents the product of the transpose of the residual vector and the residual vector.

In one embodiment, as shown in fig. 1, the present invention provides a method for high throughput transcriptome sequencing data and trait association analysis, comprising the steps of:

In one embodiment, the linear regression model is solved according to an elastic network algorithm. The elastic network algorithm is a regression algorithm of a comprehensive Lasso regression algorithm and a ridge regression algorithm, and the influence of a single coefficient on a result is controlled by adding an L1 regular term and an L2 regular term in a loss function. The elastic network algorithm allows the stability of the ridge regression algorithm to be inherited in the circulation process, so that the gene mining effect can be further improved, and the false positive rate is reduced.

In one embodiment, the minimum objective function of the elastic network algorithm isWherein λ and α are both adjustment parameters.

In one embodiment, the λ and α are adjustment parameters having an error not greater than the minimum error plus a standard error, and are determined by a cross-validation method.

In one embodiment, before fitting the relation between the trait phenotype value of the subject and the normalized expression level of each of the genes, the trait phenotype value and the expression level of each of the genes are normalized so that the average value of the expression level of each of the genes and the variance of the expression level of each of the genes are 0 and 1, respectively.

In one embodiment, the present invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements a high throughput transcriptome sequencing data and trait correlation analysis method as described above.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A high throughput transcriptome sequencing data and trait correlation analysis system comprising:

the solving and analyzing module is used for solving the linear regression model and taking genes with all effects not zero as genes associated with the characters;

the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X _i Is the expression level of the ith gene, b _i A partial regression coefficient of the ith gene expression quantity to the character phenotype value, m is the gene number, and e is the residual error;

the solution analysis module comprises an algorithm unit for solving the linear regression model according to an elastic network algorithm.

2. The high throughput transcriptome sequencing data and trait correlation analysis system of claim 1, wherein a minimum objective function of said elastic network algorithm isWherein λ and α are both adjustment parameters.

3. A method for correlating high throughput transcriptome sequencing data with traits, comprising the steps of:

solving the linear regression model and taking all genes with non-zero effects as genes associated with the trait;

the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X _i Is the expression level of the ith gene, b _i Is the partial regression coefficient of the ith gene expression quantity to the character phenotype value, and m is the gene numberE is the residual error;

and solving the linear regression model according to an elastic network algorithm.

4. A high throughput transcriptome sequencing data and trait correlation analysis method according to claim 3, wherein said elastic network algorithm has a minimum objective function ofWherein λ and α are both adjustment parameters.

5. The method according to claim 4, wherein the normalization of the expression level of each gene and the average value of the expression level of each gene and the variance of the expression level of each gene are performed so as to be 0 and 1, respectively, before the correlation between the expression level of each gene and the expression level of each gene of the subject is fitted.

6. A computer readable storage medium, characterized in that it stores a computer program, which when executed by a processor implements a high throughput transcriptome sequencing data and trait correlation analysis system according to any of the preceding claims 1 to 2.