CN108363902B - Accurate prediction method for pathogenic genetic variation - Google Patents

Accurate prediction method for pathogenic genetic variation Download PDF

Info

Publication number
CN108363902B
CN108363902B CN201810088147.9A CN201810088147A CN108363902B CN 108363902 B CN108363902 B CN 108363902B CN 201810088147 A CN201810088147 A CN 201810088147A CN 108363902 B CN108363902 B CN 108363902B
Authority
CN
China
Prior art keywords
phenotype
data
variation
phenotypes
patient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810088147.9A
Other languages
Chinese (zh)
Other versions
CN108363902A (en
Inventor
李其刚
赵科研
马欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genomcan Inc
Original Assignee
Genomcan Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genomcan Inc filed Critical Genomcan Inc
Priority to CN201810088147.9A priority Critical patent/CN108363902B/en
Publication of CN108363902A publication Critical patent/CN108363902A/en
Application granted granted Critical
Publication of CN108363902B publication Critical patent/CN108363902B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses an accurate prediction method of pathogenic genetic variation, which divides the known pathogenic variation into two types: the method comprises the steps of performing database variation and training set positive variation, obtaining partial evidence in an ACMG guide through the database variation, simulating training set positive genetic variation data and corresponding phenotype data of a patient through a random extraction method, calculating characteristics related to the guide, calculating characteristics related to the phenotype by using a calculation method based on ERIC, combining the existing characteristics related to pathogenicity judgment, and realizing variation pathogenicity prediction by comprehensively considering genotype data and phenotype data by using a machine learning algorithm; the method solves the problems that accurate prediction of the pathogenicity of the variation cannot be carried out due to incomplete clinical phenotype data, noise and inaccurate description in an actual scene.

Description

Accurate prediction method for pathogenic genetic variation
Technical Field
The invention relates to a prediction method, in particular to an accurate prediction method of pathogenic genetic variation.
Background
Rare disease genetic prediction refers to the process of finding pathogenic genetic variations from a patient's genome that account for the clinical phenotype of the patient. Whether the genetic prediction can be accurately and quickly carried out is related to the later treatment, nursing and even life of the patient. However, the difficulty of accurately predicting the pathogenicity genetic variation is very large, and in an actual scene, a series of problems of incomplete clinical phenotype data, noise and inaccurate description exist, so that the accurate prediction of the pathogenicity of the variation cannot be performed.
Disclosure of Invention
Aiming at the defects in the prior art, the accurate prediction method for the pathogenic genetic variation provided by the invention solves the problem that the accurate prediction of the pathogenicity of the variation cannot be carried out due to incomplete clinical phenotype data, noise and inaccurate description in an actual scene.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that:
a method for accurately predicting pathogenic genetic variation, comprising the following steps:
s1: the reported and confirmed pathogenic variants were collected and the known pathogenic variants were classified into two categories according to the time of discovery: database variation and training set positive variation;
s2: obtaining evidence in the ACMG guideline according to the database variation obtained in step S1;
s3: simulating genetic variation data and corresponding phenotype data of the patient by a random extraction method according to the positive variation of the training set obtained in the step S1;
s4: calculating simulated genetic variation data according to the evidence in the ACMG guideline obtained in the step S2 to obtain the relevant characteristics of the ACMG guideline and realize the relevant characteristic extraction of the guideline;
s5: calculating the similarity between the phenotype data of the simulated patient and the known phenotype aggregate data of each gene by using an ERIC-based calculation method to obtain phenotype-related characteristics, and realizing phenotype-related characteristic extraction;
s6: and (4) according to the characteristics related to the guideline obtained in the step S4 and the characteristics related to the phenotype obtained in the step S5, combining the existing characteristics related to pathogenicity judgment, and utilizing a machine learning algorithm to realize mutation pathogenicity prediction by comprehensively considering genotype data and phenotype data.
The invention has the beneficial effects that:
the interpretability and the accuracy of a prediction result are improved by the guide-based characteristics; the random extraction of the phenotype more truly simulates the complexity of the clinical phenotype, and the reliability and the clinical practicability of the prediction method are improved; the introduced ERIC-based phenotype similarity calculation method can better resist the uncertainty caused by incomplete phenotype, inaccuracy and noise, and further improve the accuracy of the prediction method.
Further, the random extraction method for simulating genetic variation data and corresponding phenotype data of the patient in step S3 includes the following steps:
s3-1: randomly extracting W negative variants from population variants from non-rare patients, inserting 1 known pathogenic variant from the positive variants in the training set, and forming simulated genetic variant data of the patients by the W negative variants and the 1 positive pathogenic variant;
s3-2: randomly extracting a phenotypes from the known phenotypes of the genes with positive pathogenic variation, then randomly extracting b phenotypes, performing imprecision treatment, and finally randomly extracting c unrelated noise phenotypes, simulating a + b + c phenotypes of the patient, and forming phenotype data of the patient;
s3-3: steps S3-1 to S3-2 were repeated to simulate the genetic variation data and corresponding phenotypic data of all patients.
The beneficial effects of the above further scheme are:
random phenotype extraction, inaccuracy and noise processing restore the reality of clinical phenotypes, and improve the reliability and clinical practicability of the prediction method.
Further, in step S5, the similarity between the simulated patient phenotype data and the known phenotype aggregate data for each gene is calculated by the following formula:
Figure BDA0001562973150000031
in the formula t1、t2To simulate two different clinical phenotypes of a patient; t is1A set of simulated patient phenotypes; t is2Is a set of known phenotypes for a gene; sim (t)1,t2) Is a phenotype t1And t2The similarity between them.
Further, similarity sim (t) between phenotypes was calculated1,t2) The calculation formula used is:
sim(t1,t2)=2IC(tMICA)-min(IC(t1),IC(t2))
in the formula tMICAIs a phenotype t1And t2The maximum information amount common ancestor node of (1); IC (t)MICA) Is two phenotypes t1And t2Common ancestortMICAThe amount of information of (a); IC (t)1) And IC (t)2) Are respectively a phenotype t1And t2The amount of information of (2).
Further, the amount of information ic (t) that simulates the phenotype t of a patient is calculated using the formula:
IC(t)=log(N/Nt)
wherein N is the total number of genes; n is a radical oftThe total number of genes that result in phenotype t.
The beneficial effects of the above further scheme are:
the phenotype similarity calculation method based on the ERIC is more accurate, can effectively resist the influence of inaccuracy and noise phenotype, and improves the accuracy of the prediction method.
Further, in step S6, a GBDT model in a machine learning algorithm is used to realize a prediction of mutation pathogenicity by comprehensively considering genotype data and phenotype data.
The beneficial effects of the above further scheme are:
the GBDT model is a nonlinear model, and compared with a linear model, the GBDT model can better integrate information from a plurality of characteristic variables, so that the accuracy and the practicability of the prediction method are improved.
Drawings
FIG. 1 is a flow chart of a method for accurately predicting pathogenic genetic variation.
FIG. 2 is a graph of the prediction of the variation (2016-2017 newly found variation) in the test set.
FIG. 3 is a graph of ranking of different methods for different phenotypic sampling patterns.
Fig. 4 is a graph of ranking of different methods on actual clinical data EJHG2017 pathogenic variation.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
In an embodiment of the present invention, a method for accurately predicting a pathogenic genetic variation, as shown in fig. 1, includes the following steps:
s1: pathogenic variants that have been discovered and confirmed are collected from the ClinVar database and are classified into three categories according to the time of discovery: database variants (found before 2013), training set positive variants (2013 to 2015), test set positive variants (2016-2017 for 6 months);
s2: obtaining ACMG guidelines based on database variation to obtain a judgment basis of each evidence;
s3: simulating genetic variation data and corresponding phenotype data of 1 ten thousand patients by a random extraction method according to the positive variation of the training set obtained in the step S1;
the random extraction method comprises the following steps:
s3-1: randomly extracting W negative variants from population variants from non-rare patients, inserting 1 known pathogenic variant from the positive variants in the training set, and forming simulated genetic variant data of the patients by the W negative variants and the 1 positive pathogenic variant;
s3-2: randomly extracting a phenotypes from the known phenotypes of the genes with positive pathogenic variation, then randomly extracting b phenotypes, performing imprecision treatment, and finally randomly extracting c unrelated noise phenotypes, simulating a + b + c phenotypes of the patient, and forming phenotype data of the patient;
s3-3: steps S3-1 to S3-2 were repeated, simulating genetic variation data and corresponding phenotypic data for 1 million patients.
S4: calculating simulated genetic variation data according to the evidence in the ACMG guideline obtained in the step S2 to obtain the relevant characteristics of the ACMG guideline and realize the relevant characteristic extraction of the guideline;
s5: calculating the similarity between the phenotype data of the simulated patient and the known phenotype aggregate data of each gene by using an ERIC-based calculation method to obtain phenotype-related characteristics, and realizing phenotype-related characteristic extraction;
the calculation formula for calculating the similarity between the simulated patient phenotype data and the known phenotype aggregate data for each gene is:
Figure BDA0001562973150000051
in the formula t1、t2To simulate two different clinical phenotypes of a patient; t is1A set of simulated patient phenotypes; t is2Is a set of known phenotypes for a gene; sim (t)1,t2) Is a phenotype t1And t2The similarity between them.
Calculating the similarity sim (t) between phenotypes1,t2) The calculation formula used is:
sim(t1,t2)=2IC(tMICA)-min(IC(t1),IC(t2))
in the formula tMICAIs a phenotype t1And t2The maximum information amount common ancestor node of (1); IC (t)MICA) Is two phenotypes t1And t2Common ancestor tMICAThe amount of information of (a); IC (t)1) And IC (t)2) Are respectively a phenotype t1And t2The amount of information of (2).
The amount of information ic (t) that mimics the phenotype t of a patient is calculated using the formula:
IC(t)=log(N/Nt)
wherein N is the total number of genes; n is a radical oftThe total number of genes that result in phenotype t.
S6: according to the characteristics related to the guideline obtained in the step S4 and the characteristics related to the phenotype obtained in the step S5, the characteristics of each simulated genetic variation in each dimension are obtained by combining the existing other data which are helpful for predicting pathogenic variation, such as CADD, PhylloP and the like, as supplementary characteristics, and the prediction of the pathogenicity of the variation comprehensively considering genotype data and phenotype data is realized by utilizing a GBDT model in a machine learning algorithm; and (4) carrying out steps S3-S6 on the positive variation of the test set, so as to realize variation pathogenicity prediction by comprehensively considering genotype data and phenotype data, and be used for evaluating the effects of the prediction method and other methods.
Example (b): to demonstrate the high accuracy of the present method, the performance of the present method and other existing methods on test data consisting of 830 pathogenic variants found in 2016 to 2017 was compared, as shown in fig. 2. Currently, there are a large number of methods commonly used in the industry, which simply use the data information of Genotype (Genotype Only) to predict pathogenicity, such as MCAP, CADD, mutationmaster. These methods predict pathogenicity based primarily on evolutionary conservation of gene sequences and calculation of the degree of functional impact on protein-encoded amino acids. As can be seen from FIG. 2, the accuracy of this type of method is more than 20% lower than the method that considers both genotype and phenotype (Exomiser). The result shows that the method provided by the invention is obviously improved by more than 30 percent compared with other methods and compared with a method (Exomiser) which considers the genotype and the phenotype simultaneously. Furthermore, the pure usage phenotype characteristic (Xrare _ phenotype) and the pure usage guideline evidence characteristic (Xrare _ ACMG) are found to have good performances, which shows that the introduced new phenotype measurement method and the guideline-based characteristic improve the model accuracy. It can be seen from fig. 3 that the performance of the new phenotypic similarity measure clearly tracks when phenotypic information is missing, inaccurate and phenotypic noise is present. To further evaluate the differences between the results of the predictive method and other methods and expert-guided methods (clinical-drive) analysis, the performance of the methods was compared using real clinical history and genetic data. 54 pathogenic sites verified by clinical experts published in 2017 are used as tests, and the results in FIG. 4 show that the GBDT model has a more obvious effect than the expert-oriented method (clinical-drive).
The invention has the beneficial effects that:
the interpretability and the accuracy of a prediction result are improved by the guide-based characteristics; random extraction, inaccuracy and noise treatment of the phenotype restore the reality of the clinical phenotype, and improve the reliability and clinical practicability of the prediction method; the introduced ERIC-based phenotype similarity calculation method enables the prediction method to better resist uncertainty caused by incomplete phenotype, inaccuracy and noise, thereby improving the accuracy of the prediction method; the accuracy and the practicability of the prediction method are further improved by adopting the nonlinear GBDT model.

Claims (6)

1. An accurate prediction method of pathogenic genetic variation, which is characterized by comprising the following steps:
s1: the reported and confirmed pathogenic variants were collected and the known pathogenic variants were classified into two categories according to the time of discovery: database variation and training set positive variation;
s2: obtaining evidence in the ACMG guideline according to the database variation obtained in step S1;
s3: simulating genetic variation data and corresponding phenotype data of the patient by a random extraction method according to the positive variation of the training set obtained in the step S1;
s4: calculating simulated genetic variation data according to the evidence in the ACMG guideline obtained in the step S2 to obtain the relevant characteristics of the ACMG guideline and realize the relevant characteristic extraction of the guideline;
s5: calculating the similarity between the phenotype data of the simulated patient and the known phenotype set data of each gene by using an ERIC-based calculation method to obtain phenotype-related characteristics, and realizing phenotype-related characteristic extraction;
s6: and (4) according to the characteristics related to the guideline obtained in the step S4 and the characteristics related to the phenotype obtained in the step S5, combining the existing characteristics related to pathogenicity judgment, and utilizing a machine learning algorithm to realize mutation pathogenicity prediction by comprehensively considering genotype data and phenotype data.
2. The prediction method according to claim 1, wherein the random drawing method for simulating the genetic variation data and corresponding phenotype data of the patient in step S3 comprises the following steps:
s3-1: randomly extracting W negative variants from population variants from non-rare patients, inserting 1 known pathogenic variant from the positive variants in the training set, and forming simulated genetic variant data of the patients by the W negative variants and the 1 positive pathogenic variant;
s3-2: randomly extracting a phenotypes from the known phenotypes of the genes with positive pathogenic variation, then randomly extracting b phenotypes, performing imprecision treatment, and finally randomly extracting c unrelated noise phenotypes, simulating a + b + c phenotypes of the patient, and forming phenotype data of the patient;
s3-3: steps S3-1 to S3-2 were repeated to simulate the genetic variation data and corresponding phenotypic data of all patients.
3. The method of predicting according to claim 1, wherein the step S5 is performed by calculating the similarity between the simulated patient phenotype data and the known phenotype aggregate data for each gene according to the following formula:
Figure FDA0001562973140000021
in the formula t1、t2To simulate two different clinical phenotypes of a patient; t is1A set of simulated patient phenotypes; t is2Is a set of known phenotypes for a gene; sim (t)1,t2) Is a phenotype t1And t2The similarity between them.
4. Prediction method according to claim 3, characterized in that the similarity sim (t) between phenotypes is calculated1,t2) The calculation formula used is:
sim(t1,t2)=2IC(tMICA)-min(IC(t1),IC(t2))
in the formula tMICAIs a phenotype t1And t2The maximum information amount common ancestor node of (1); IC (t)MICA) Is two phenotypes t1And t2Common ancestor tMICAThe amount of information of (a); IC (t)1) And IC (t)2) Are respectively a phenotype t1And t2The amount of information of (2).
5. The prediction method according to claim 4, wherein the calculation formula for the amount of information IC (t) simulating the phenotype t of the patient is:
IC(t)=log(N/Nt)
wherein N is a geneTotal number; n is a radical oftThe total number of genes that result in phenotype t.
6. The prediction method according to claim 1, wherein the prediction of the pathogenicity of the mutation comprehensively considering the genotype data and the phenotype data is implemented by using a GBDT model in a machine learning algorithm in step S6.
CN201810088147.9A 2018-01-30 2018-01-30 Accurate prediction method for pathogenic genetic variation Active CN108363902B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810088147.9A CN108363902B (en) 2018-01-30 2018-01-30 Accurate prediction method for pathogenic genetic variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810088147.9A CN108363902B (en) 2018-01-30 2018-01-30 Accurate prediction method for pathogenic genetic variation

Publications (2)

Publication Number Publication Date
CN108363902A CN108363902A (en) 2018-08-03
CN108363902B true CN108363902B (en) 2022-02-25

Family

ID=63007672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810088147.9A Active CN108363902B (en) 2018-01-30 2018-01-30 Accurate prediction method for pathogenic genetic variation

Country Status (1)

Country Link
CN (1) CN108363902B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109493917A (en) * 2018-09-02 2019-03-19 上海市儿童医院 A kind of evil component level calculation method of gene mutation harmfulness predicted value
CN111862091A (en) * 2020-08-05 2020-10-30 昆山杜克大学 Early syndrome discovery system based on phenotype measurement
CN112863605A (en) * 2021-02-03 2021-05-28 中国人民解放军总医院第七医学中心 Platform, method, computer device and medium for determining dysnoesia genes
CN112951324A (en) * 2021-02-05 2021-06-11 广州医科大学 Pathogenic synonymous mutation prediction method based on undersampling
CN113241118A (en) * 2021-07-12 2021-08-10 法玛门多(常州)生物科技有限公司 Method for predicting harmfulness of gene mutation
CN116343913B (en) * 2023-03-15 2023-11-14 昆明市延安医院 Analysis method for predicting potential pathogenic mechanism of single-gene genetic disease based on phenotype semantic association gene cluster regulation network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016123692A1 (en) * 2015-02-04 2016-08-11 The University Of British Columbia Methods and devices for analyzing particles
CN106980749A (en) * 2017-02-21 2017-07-25 成都奇恩生物科技有限公司 The quick assisted location method of disease
CN107169310A (en) * 2017-03-20 2017-09-15 上海基银生物科技有限公司 A kind of genetic test construction of knowledge base method and system
CN107341366A (en) * 2017-07-19 2017-11-10 西安交通大学 A kind of method that complex disease susceptibility loci is predicted using machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016123692A1 (en) * 2015-02-04 2016-08-11 The University Of British Columbia Methods and devices for analyzing particles
CN106980749A (en) * 2017-02-21 2017-07-25 成都奇恩生物科技有限公司 The quick assisted location method of disease
CN107169310A (en) * 2017-03-20 2017-09-15 上海基银生物科技有限公司 A kind of genetic test construction of knowledge base method and system
CN107341366A (en) * 2017-07-19 2017-11-10 西安交通大学 A kind of method that complex disease susceptibility loci is predicted using machine learning

Also Published As

Publication number Publication date
CN108363902A (en) 2018-08-03

Similar Documents

Publication Publication Date Title
CN108363902B (en) Accurate prediction method for pathogenic genetic variation
CN109931678B (en) Air conditioner fault diagnosis method based on deep learning LSTM
CN110491441A (en) A kind of gene sequencing data simulation system and method for simulation crowd background information
CN107943874A (en) Knowledge mapping processing method, device, computer equipment and storage medium
CN103186716A (en) Metagenomics-based unknown pathogeny rapid identification system and analysis method
Effenberger et al. Measuring difficulty of introductory programming tasks
CN105512510B (en) A method of genetic force is assessed by genomic data
US11449781B2 (en) Plant abnormality prediction system and method
CN112466416B (en) Material data cleaning method combining nickel-based alloy priori knowledge
CN113889274B (en) Method and device for constructing risk prediction model of autism spectrum disorder
CN113673811B (en) On-line learning performance evaluation method and device based on session
CN107688727B (en) Method and device for identifying transcript subtypes in biological sequence clustering and full-length transcription group
CN101710364A (en) Method for calculating and identifying protein-RNA interaction sites
TWI542828B (en) Running simulator
CN113918471A (en) Test case processing method and device and computer readable storage medium
CN114297642A (en) Side channel attack method based on data aggregation
CN111210278A (en) Coal industry stock price prediction method based on time series
CN109101793A (en) A kind of personal identification method and system based on static text keystroke characteristic
CN107292213B (en) Handwriting quantitative inspection and identification method
CN114612453B (en) Method for detecting foundation surface defects based on deep learning and sparse representation model
CN109187772A (en) It is applied to the method for impact elasticity wave analysis based on speech recognition
CN117393171B (en) Method and system for constructing prediction model of LARS development track after rectal cancer operation
CN110021357A (en) Simulate cancer gene group sequencing data generating means
CN116312798B (en) Metagenome sequencing data species verification method and application
CN102272762B (en) Interaction force variation prediction device and interaction force variation prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant