CN112837751B - High-throughput transcriptome sequencing data and trait association analysis system and method - Google Patents

High-throughput transcriptome sequencing data and trait association analysis system and method Download PDF

Info

Publication number
CN112837751B
CN112837751B CN202110081269.7A CN202110081269A CN112837751B CN 112837751 B CN112837751 B CN 112837751B CN 202110081269 A CN202110081269 A CN 202110081269A CN 112837751 B CN112837751 B CN 112837751B
Authority
CN
China
Prior art keywords
gene
sequencing data
transcriptome sequencing
trait
linear regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110081269.7A
Other languages
Chinese (zh)
Other versions
CN112837751A (en
Inventor
康慧敏
李华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan University
Original Assignee
Foshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan University filed Critical Foshan University
Priority to CN202110081269.7A priority Critical patent/CN112837751B/en
Publication of CN112837751A publication Critical patent/CN112837751A/en
Application granted granted Critical
Publication of CN112837751B publication Critical patent/CN112837751B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a high-throughput transcriptome sequencing data and trait association analysis method, which comprises the following steps: obtaining high throughput transcriptome sequencing data of a subject; obtaining the normalized expression quantity of each gene of the object according to the high-flux transcriptome sequencing data; fitting the relation between the character form value of the object and the normalized expression quantity of each gene through a linear regression model; solving a linear regression model and taking all genes with non-zero effects as genes associated with the trait. The invention can effectively excavate candidate genes with important characters, improves the gene excavation efficacy and reduces the false positive rate. Correspondingly, the invention also provides a high-throughput transcriptome sequencing data and trait association analysis system.

Description

High-throughput transcriptome sequencing data and trait association analysis system and method
Technical Field
The invention relates to the technical field of biological information, in particular to a high-throughput transcriptome sequencing data and trait association analysis system and method.
Background
Transcriptome broadly refers to the collection of all transcripts in a cell under a physiological condition, including messenger RNA, ribosomal RNA, transfer RNA, and non-coding RNA; in a narrow sense, refers to the collection of all mRNAs. Proteins are the main contributors to cellular function, proteomes are the most direct description of cellular function and status, transcriptional composition is the main means to study gene expression, transcriptomes are the necessary tie of proteomes connecting genomic genetic information with biological function, regulation of transcriptional levels is the most studied at present, and is also the most important regulation way for organisms. The high-throughput sequencing technology is also called as 'next generation' sequencing technology, and takes hundreds of thousands to millions of DNA molecules can be sequenced in parallel at a time and the common reading length is shorter as a mark.
Candidate genes for mining important characters are a main research content in the field of animal and plant genetic breeding, and have important significance for molecular assisted breeding, including genome selection and gene editing. Currently, high throughput transcriptome sequencing has become one of the mainstream methods used in the field of genetic breeding to mine candidate genes for important traits.
For quantitative traits, the prior art does not fully utilize the phenotype information of individuals, and continuously-changed data types are simply processed according to classification traits, so that the efficacy of gene mining is reduced, and the false positive rate is increased. Therefore, it is necessary to develop a high throughput transcriptome sequencing data and trait association analysis method to improve gene mining efficacy and reduce false positive rate.
Disclosure of Invention
Based on the above, in order to solve the problem that the phenotype information of an individual is not fully utilized in the prior art, continuously-changed data types are simply processed according to classification characters to reduce the gene mining efficacy and increase the false positive rate, the invention provides a high-throughput transcriptome sequencing data and character association analysis system and method, and the specific technical scheme is as follows:
a high throughput transcriptome sequencing data and trait association analysis system comprising:
the data acquisition module is used for acquiring high-throughput transcriptome sequencing data and character phenotype values of the object;
the expression quantity acquisition module is used for acquiring the normalized expression quantity of each gene of the object according to the high-flux transcriptome sequencing data;
a fitting module for fitting a relationship between a trait phenotype value of the subject and a normalized expression level of each of the genes by a linear regression model;
and the solving and analyzing module is used for solving the linear regression model and taking all genes with non-zero effects as genes associated with the characters.
Further, the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X i Is the expression level of the ith gene, b i And (3) the partial regression coefficient of the ith gene expression quantity to the character phenotype value, m is the gene number, and e is the residual error.
Further, the solution analysis module comprises an algorithm unit for solving the linear regression model according to an elastic network algorithm.
Further, the minimum objective function of the elastic network algorithm is thatWherein λ and α are both adjustment parameters.
The invention also provides a high-throughput transcriptome sequencing data and trait association analysis method, which comprises the following steps:
obtaining high-throughput transcriptome sequencing data and trait phenotype values of a subject;
obtaining normalized expression levels of each gene of the subject from the high throughput transcriptome sequencing data;
fitting a relationship between the trait phenotype value of the subject and the normalized expression level of each gene by a linear regression model;
solving the linear regression model and taking all genes with non-zero effects as genes associated with the trait.
According to the high-throughput transcriptome sequencing data and trait association analysis method, the normalized expression quantity of each gene of the object is obtained according to the high-throughput transcriptome sequencing data, the phenotype information of the object is fully utilized, the problem that the phenotype information of an individual is not fully utilized in the prior art, the continuously-changed data types are simply processed according to classification traits to reduce the gene mining efficacy and increase the false positive rate is solved, candidate genes of important traits can be effectively mined, the gene mining efficacy is improved, and the false positive rate is reduced.
Further, the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X i Is the expression level of the ith gene, b i And (3) the partial regression coefficient of the ith gene expression quantity to the character phenotype value, m is the gene number, and e is the residual error.
Further, the linear regression model is solved according to an elastic network algorithm.
Further, the minimum objective function of the elastic network algorithm is thatWherein λ and α are both adjustment parameters.
Further, before fitting the relation between the property phenotype value of the subject and the normalized expression level of each of the genes, the property phenotype value and the expression level of each of the genes are normalized so that the mean value of the expression level of each of the genes and the variance of the expression level of each of the genes are respectively 0 and 1.
The invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements a high throughput transcriptome sequencing data and trait association analysis method as described above.
Drawings
The invention will be further understood from the following description taken in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. Like reference numerals designate corresponding parts throughout the different views.
FIG. 1 is a schematic overall flow chart of a method for high throughput transcriptome sequencing data and trait association analysis in accordance with an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples thereof in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the invention.
It will be understood that when an element is referred to as being "fixed to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like are used herein for illustrative purposes only and are not meant to be the only embodiment.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
The terms "first" and "second" in this specification do not denote a particular quantity or order, but rather are used for distinguishing between similar or identical items.
In one embodiment of the present invention, a high throughput transcriptome sequencing data and trait association analysis system comprises:
the data acquisition module is used for acquiring high-throughput transcriptome sequencing data and character phenotype values of the object;
the expression quantity acquisition module is used for acquiring the normalized expression quantity of each gene of the object according to the high-flux transcriptome sequencing data;
a fitting module for fitting a relationship between a trait phenotype value of the subject and a normalized expression level of each of the genes by a linear regression model;
and the solving and analyzing module is used for solving the linear regression model and taking all genes with non-zero effects as genes associated with the characters.
In one embodiment, the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X i Is the expression level of the ith gene, b i And (3) the partial regression coefficient of the ith gene expression quantity to the character phenotype value, m is the gene number, and e is the residual error.
In one embodiment, the solution analysis module includes an algorithm unit for solving the linear regression model according to an elastic network algorithm.
In one example, the subject is chicken breast muscle weight, the test is designed to randomly select 400 test chickens, then the breast muscle weight of the 400 chickens is determined, and high throughput transcriptome sequencing is performed using the chicken breast muscle weight as a sample.
In one embodiment, the minimum objective function of the elastic network algorithm isWhere λ and α are both adjustment parameters, xb represents the cumulative sum, (y- μ1-Xb)' (y- μ1-Xb) represents the product of the transpose of the residual vector and the residual vector.
In one embodiment, as shown in fig. 1, the present invention provides a method for high throughput transcriptome sequencing data and trait association analysis, comprising the steps of:
obtaining high-throughput transcriptome sequencing data and trait phenotype values of a subject;
obtaining normalized expression levels of each gene of the subject from the high throughput transcriptome sequencing data;
fitting a relationship between the trait phenotype value of the subject and the normalized expression level of each gene by a linear regression model;
solving the linear regression model and taking all genes with non-zero effects as genes associated with the trait.
According to the high-throughput transcriptome sequencing data and trait association analysis method, the normalized expression quantity of each gene of the object is obtained according to the high-throughput transcriptome sequencing data, the phenotype information of the object is fully utilized, the problem that the phenotype information of an individual is not fully utilized in the prior art, the continuously-changed data types are simply processed according to classification traits to reduce the gene mining efficacy and increase the false positive rate is solved, candidate genes of important traits can be effectively mined, the gene mining efficacy is improved, and the false positive rate is reduced.
In one embodiment, the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X i Is the expression level of the ith gene, b i And (3) the partial regression coefficient of the ith gene expression quantity to the character phenotype value, m is the gene number, and e is the residual error.
In one embodiment, the linear regression model is solved according to an elastic network algorithm. The elastic network algorithm is a regression algorithm of a comprehensive Lasso regression algorithm and a ridge regression algorithm, and the influence of a single coefficient on a result is controlled by adding an L1 regular term and an L2 regular term in a loss function. The elastic network algorithm allows the stability of the ridge regression algorithm to be inherited in the circulation process, so that the gene mining effect can be further improved, and the false positive rate is reduced.
In one embodiment, the minimum objective function of the elastic network algorithm isWherein λ and α are both adjustment parameters.
In one embodiment, the λ and α are adjustment parameters having an error not greater than the minimum error plus a standard error, and are determined by a cross-validation method.
In one embodiment, before fitting the relation between the trait phenotype value of the subject and the normalized expression level of each of the genes, the trait phenotype value and the expression level of each of the genes are normalized so that the average value of the expression level of each of the genes and the variance of the expression level of each of the genes are 0 and 1, respectively.
In one embodiment, the present invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements a high throughput transcriptome sequencing data and trait correlation analysis method as described above.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (6)

1. A high throughput transcriptome sequencing data and trait correlation analysis system comprising:
the data acquisition module is used for acquiring high-throughput transcriptome sequencing data and character phenotype values of the object;
the expression quantity acquisition module is used for acquiring the normalized expression quantity of each gene of the object according to the high-flux transcriptome sequencing data;
a fitting module for fitting a relationship between a trait phenotype value of the subject and a normalized expression level of each of the genes by a linear regression model;
the solving and analyzing module is used for solving the linear regression model and taking genes with all effects not zero as genes associated with the characters;
the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X i Is the expression level of the ith gene, b i A partial regression coefficient of the ith gene expression quantity to the character phenotype value, m is the gene number, and e is the residual error;
the solution analysis module comprises an algorithm unit for solving the linear regression model according to an elastic network algorithm.
2. The high throughput transcriptome sequencing data and trait correlation analysis system of claim 1, wherein a minimum objective function of said elastic network algorithm isWherein λ and α are both adjustment parameters.
3. A method for correlating high throughput transcriptome sequencing data with traits, comprising the steps of:
obtaining high-throughput transcriptome sequencing data and trait phenotype values of a subject;
obtaining normalized expression levels of each gene of the subject from the high throughput transcriptome sequencing data;
fitting a relationship between the trait phenotype value of the subject and the normalized expression level of each gene by a linear regression model;
solving the linear regression model and taking all genes with non-zero effects as genes associated with the trait;
the expression of the linear regression model isWherein y is a character phenotype value vector, mu 1 is a population mean value and X i Is the expression level of the ith gene, b i Is the partial regression coefficient of the ith gene expression quantity to the character phenotype value, and m is the gene numberE is the residual error;
and solving the linear regression model according to an elastic network algorithm.
4. A high throughput transcriptome sequencing data and trait correlation analysis method according to claim 3, wherein said elastic network algorithm has a minimum objective function ofWherein λ and α are both adjustment parameters.
5. The method according to claim 4, wherein the normalization of the expression level of each gene and the average value of the expression level of each gene and the variance of the expression level of each gene are performed so as to be 0 and 1, respectively, before the correlation between the expression level of each gene and the expression level of each gene of the subject is fitted.
6. A computer readable storage medium, characterized in that it stores a computer program, which when executed by a processor implements a high throughput transcriptome sequencing data and trait correlation analysis system according to any of the preceding claims 1 to 2.
CN202110081269.7A 2021-01-21 2021-01-21 High-throughput transcriptome sequencing data and trait association analysis system and method Active CN112837751B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110081269.7A CN112837751B (en) 2021-01-21 2021-01-21 High-throughput transcriptome sequencing data and trait association analysis system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110081269.7A CN112837751B (en) 2021-01-21 2021-01-21 High-throughput transcriptome sequencing data and trait association analysis system and method

Publications (2)

Publication Number Publication Date
CN112837751A CN112837751A (en) 2021-05-25
CN112837751B true CN112837751B (en) 2024-02-09

Family

ID=75929202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110081269.7A Active CN112837751B (en) 2021-01-21 2021-01-21 High-throughput transcriptome sequencing data and trait association analysis system and method

Country Status (1)

Country Link
CN (1) CN112837751B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012009952A1 (en) * 2010-07-22 2012-01-26 深圳华大基因科技有限公司 Quality control method and apparatus for rna sequencing of gene expression
CN108004302A (en) * 2017-12-12 2018-05-08 中国农业科学院麻类研究所 A kind of association analysis method of transcript profile reference and its application
CN109182538A (en) * 2018-09-29 2019-01-11 南京农业大学 Mastadenitis of cow key SNPs site rs88640083 and 2b-RAD Genotyping and analysis method
CN110564832A (en) * 2019-09-12 2019-12-13 广东省农业科学院动物科学研究所 Genome breeding value estimation method based on high-throughput sequencing platform and application

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130289890A1 (en) * 2012-04-30 2013-10-31 International Business Machines Corporation Rank Normalization for Differential Expression Analysis of Transcriptome Sequencing Data
US10385394B2 (en) * 2013-03-15 2019-08-20 The Translational Genomics Research Institute Processes of identifying and characterizing X-linked disorders

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012009952A1 (en) * 2010-07-22 2012-01-26 深圳华大基因科技有限公司 Quality control method and apparatus for rna sequencing of gene expression
CN108004302A (en) * 2017-12-12 2018-05-08 中国农业科学院麻类研究所 A kind of association analysis method of transcript profile reference and its application
CN109182538A (en) * 2018-09-29 2019-01-11 南京农业大学 Mastadenitis of cow key SNPs site rs88640083 and 2b-RAD Genotyping and analysis method
CN110564832A (en) * 2019-09-12 2019-12-13 广东省农业科学院动物科学研究所 Genome breeding value estimation method based on high-throughput sequencing platform and application

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Quantitative Allelic Test-A Fast Test for Very Large Association Studies;Lee, SM,等;GENETIC EPIDEMIOLOGY;第37卷(第8期);第831-839页 *
基于高通量测序的骨骼肌分化相关长非编码RNA的鉴定(英文);杨琴,等;Journal of Chinese Pharmaceutical Sciences;第26卷(第06期);第423-431页 *
基因转录表达数据的生物信息挖掘研究;郭安源;中国科学:生命科学;第51卷(第01期);第70-82页 *
羊经济性状全基因组关联分析与基因组育种的研究进展;王俊杰,等;家畜生态学报;第38卷(第11期);第1-7, 20页 *

Also Published As

Publication number Publication date
CN112837751A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
AU2021282482B2 (en) Deep learning-based aberrant splicing detection
Parejo et al. Using whole-genome sequence information to foster conservation efforts for the European dark honey bee, Apis mellifera mellifera
Thornton et al. Progress and prospects in mapping recent selection in the genome
Fu et al. Genome-wide analyses of introgression between two sympatric Asian oak species
US20170199959A1 (en) Genetic analysis systems and methods
Lewis et al. Tracing cattle breeds with principal components analysis ancestry informative SNPs
Weber et al. Species delimitation in the presence of strong incomplete lineage sorting and hybridization: Lessons from Ophioderma (Ophiuroidea: Echinodermata)
Qu et al. The evolution of ancestral and species-specific adaptations in snowfinches at the Qinghai–Tibet Plateau
Bossert et al. Gene tree estimation error with ultraconserved elements: an empirical study on Pseudapis bees
Rogers et al. Mitochondrial pseudogenes in the nuclear genomes of Drosophila
Lozano-Fernandez A practical guide to design and assess a phylogenomic study
CN114360651A (en) Genome prediction method, prediction system and application
Wang et al. Gigantic genomes provide empirical tests of transposable element dynamics models
CN112837751B (en) High-throughput transcriptome sequencing data and trait association analysis system and method
Choi et al. Genotype-free individual genome reconstruction of Multiparental Population Models by RNA sequencing data
CN116844641A (en) Method for predicting hybrid vigor of brassica napus based on whole genome selection
Song et al. Scaphopoda is the sister taxon to Bivalvia: Evidence of ancient incomplete lineage sorting
Sottile et al. Penalized classification for optimal statistical selection of markers from high-throughput genotyping: application in sheep breeds
Espinosa de los Monteros Phylogenetics and systematics in a nutshell
Nayak et al. Coalescence: An anti-clockwise travel
Wang et al. Gigantic genomes can provide empirical tests of TE dynamics models—An example from Amphibians
Martínez-Rocha et al. Genome-wide assessment of genetic diversity in Mexican Sardo Negro breed
CN102168091A (en) Qinghai-Tibet Plateau wild barley HsCIPK5 gene
Groß et al. Evolutionarily conserved non-protein-coding regions in the chicken genome harbor functionally important variation
Hofmeister et al. Environmental correlates of genetic variation in the invasive and largely panmictic European starling in North America

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant