CN108363902B

CN108363902B - Accurate prediction method for pathogenic genetic variation

Info

Publication number: CN108363902B
Application number: CN201810088147.9A
Authority: CN
Inventors: 李其刚; 赵科研; 马欣
Original assignee: Genomcan Inc
Current assignee: Genomcan Inc
Priority date: 2018-01-30
Filing date: 2018-01-30
Publication date: 2022-02-25
Anticipated expiration: 2038-01-30
Also published as: CN108363902A

Abstract

The invention discloses an accurate prediction method of pathogenic genetic variation, which divides the known pathogenic variation into two types: the method comprises the steps of performing database variation and training set positive variation, obtaining partial evidence in an ACMG guide through the database variation, simulating training set positive genetic variation data and corresponding phenotype data of a patient through a random extraction method, calculating characteristics related to the guide, calculating characteristics related to the phenotype by using a calculation method based on ERIC, combining the existing characteristics related to pathogenicity judgment, and realizing variation pathogenicity prediction by comprehensively considering genotype data and phenotype data by using a machine learning algorithm; the method solves the problems that accurate prediction of the pathogenicity of the variation cannot be carried out due to incomplete clinical phenotype data, noise and inaccurate description in an actual scene.

Description

Accurate prediction method for pathogenic genetic variation

Technical Field

The invention relates to a prediction method, in particular to an accurate prediction method of pathogenic genetic variation.

Background

Rare disease genetic prediction refers to the process of finding pathogenic genetic variations from a patient's genome that account for the clinical phenotype of the patient. Whether the genetic prediction can be accurately and quickly carried out is related to the later treatment, nursing and even life of the patient. However, the difficulty of accurately predicting the pathogenicity genetic variation is very large, and in an actual scene, a series of problems of incomplete clinical phenotype data, noise and inaccurate description exist, so that the accurate prediction of the pathogenicity of the variation cannot be performed.

Disclosure of Invention

Aiming at the defects in the prior art, the accurate prediction method for the pathogenic genetic variation provided by the invention solves the problem that the accurate prediction of the pathogenicity of the variation cannot be carried out due to incomplete clinical phenotype data, noise and inaccurate description in an actual scene.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a method for accurately predicting pathogenic genetic variation, comprising the following steps:

s1: the reported and confirmed pathogenic variants were collected and the known pathogenic variants were classified into two categories according to the time of discovery: database variation and training set positive variation;

s2: obtaining evidence in the ACMG guideline according to the database variation obtained in step S1;

s3: simulating genetic variation data and corresponding phenotype data of the patient by a random extraction method according to the positive variation of the training set obtained in the step S1;

s4: calculating simulated genetic variation data according to the evidence in the ACMG guideline obtained in the step S2 to obtain the relevant characteristics of the ACMG guideline and realize the relevant characteristic extraction of the guideline;

s5: calculating the similarity between the phenotype data of the simulated patient and the known phenotype aggregate data of each gene by using an ERIC-based calculation method to obtain phenotype-related characteristics, and realizing phenotype-related characteristic extraction;

s6: and (4) according to the characteristics related to the guideline obtained in the step S4 and the characteristics related to the phenotype obtained in the step S5, combining the existing characteristics related to pathogenicity judgment, and utilizing a machine learning algorithm to realize mutation pathogenicity prediction by comprehensively considering genotype data and phenotype data.

The invention has the beneficial effects that:

the interpretability and the accuracy of a prediction result are improved by the guide-based characteristics; the random extraction of the phenotype more truly simulates the complexity of the clinical phenotype, and the reliability and the clinical practicability of the prediction method are improved; the introduced ERIC-based phenotype similarity calculation method can better resist the uncertainty caused by incomplete phenotype, inaccuracy and noise, and further improve the accuracy of the prediction method.

Further, the random extraction method for simulating genetic variation data and corresponding phenotype data of the patient in step S3 includes the following steps:

s3-1: randomly extracting W negative variants from population variants from non-rare patients, inserting 1 known pathogenic variant from the positive variants in the training set, and forming simulated genetic variant data of the patients by the W negative variants and the 1 positive pathogenic variant;

s3-2: randomly extracting a phenotypes from the known phenotypes of the genes with positive pathogenic variation, then randomly extracting b phenotypes, performing imprecision treatment, and finally randomly extracting c unrelated noise phenotypes, simulating a + b + c phenotypes of the patient, and forming phenotype data of the patient;

s3-3: steps S3-1 to S3-2 were repeated to simulate the genetic variation data and corresponding phenotypic data of all patients.

The beneficial effects of the above further scheme are:

random phenotype extraction, inaccuracy and noise processing restore the reality of clinical phenotypes, and improve the reliability and clinical practicability of the prediction method.

Further, in step S5, the similarity between the simulated patient phenotype data and the known phenotype aggregate data for each gene is calculated by the following formula:

in the formula t₁、t₂To simulate two different clinical phenotypes of a patient; t is₁A set of simulated patient phenotypes; t is₂Is a set of known phenotypes for a gene; sim (t)₁,t₂) Is a phenotype t₁And t₂The similarity between them.

Further, similarity sim (t) between phenotypes was calculated₁,t₂) The calculation formula used is:

sim(t₁,t₂)＝2IC(t_MICA)-min(IC(t₁),IC(t₂))

in the formula t_MICAIs a phenotype t₁And t₂The maximum information amount common ancestor node of (1); IC (t)_MICA) Is two phenotypes t₁And t₂Common ancestort_MICAThe amount of information of (a); IC (t)₁) And IC (t)₂) Are respectively a phenotype t₁And t₂The amount of information of (2).

Further, the amount of information ic (t) that simulates the phenotype t of a patient is calculated using the formula:

IC(t)＝log(N/N_t)

wherein N is the total number of genes; n is a radical of_tThe total number of genes that result in phenotype t.

The beneficial effects of the above further scheme are:

the phenotype similarity calculation method based on the ERIC is more accurate, can effectively resist the influence of inaccuracy and noise phenotype, and improves the accuracy of the prediction method.

Further, in step S6, a GBDT model in a machine learning algorithm is used to realize a prediction of mutation pathogenicity by comprehensively considering genotype data and phenotype data.

The beneficial effects of the above further scheme are:

the GBDT model is a nonlinear model, and compared with a linear model, the GBDT model can better integrate information from a plurality of characteristic variables, so that the accuracy and the practicability of the prediction method are improved.

Drawings

FIG. 1 is a flow chart of a method for accurately predicting pathogenic genetic variation.

FIG. 2 is a graph of the prediction of the variation (2016-2017 newly found variation) in the test set.

FIG. 3 is a graph of ranking of different methods for different phenotypic sampling patterns.

Fig. 4 is a graph of ranking of different methods on actual clinical data EJHG2017 pathogenic variation.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

In an embodiment of the present invention, a method for accurately predicting a pathogenic genetic variation, as shown in fig. 1, includes the following steps:

s1: pathogenic variants that have been discovered and confirmed are collected from the ClinVar database and are classified into three categories according to the time of discovery: database variants (found before 2013), training set positive variants (2013 to 2015), test set positive variants (2016-2017 for 6 months);

s2: obtaining ACMG guidelines based on database variation to obtain a judgment basis of each evidence;

s3: simulating genetic variation data and corresponding phenotype data of 1 ten thousand patients by a random extraction method according to the positive variation of the training set obtained in the step S1;

the random extraction method comprises the following steps:

s3-3: steps S3-1 to S3-2 were repeated, simulating genetic variation data and corresponding phenotypic data for 1 million patients.

the calculation formula for calculating the similarity between the simulated patient phenotype data and the known phenotype aggregate data for each gene is:

Calculating the similarity sim (t) between phenotypes₁,t₂) The calculation formula used is:

sim(t₁,t₂)＝2IC(t_MICA)-min(IC(t₁),IC(t₂))

in the formula t_MICAIs a phenotype t₁And t₂The maximum information amount common ancestor node of (1); IC (t)_MICA) Is two phenotypes t₁And t₂Common ancestor t_MICAThe amount of information of (a); IC (t)₁) And IC (t)₂) Are respectively a phenotype t₁And t₂The amount of information of (2).

The amount of information ic (t) that mimics the phenotype t of a patient is calculated using the formula:

IC(t)＝log(N/N_t)

S6: according to the characteristics related to the guideline obtained in the step S4 and the characteristics related to the phenotype obtained in the step S5, the characteristics of each simulated genetic variation in each dimension are obtained by combining the existing other data which are helpful for predicting pathogenic variation, such as CADD, PhylloP and the like, as supplementary characteristics, and the prediction of the pathogenicity of the variation comprehensively considering genotype data and phenotype data is realized by utilizing a GBDT model in a machine learning algorithm; and (4) carrying out steps S3-S6 on the positive variation of the test set, so as to realize variation pathogenicity prediction by comprehensively considering genotype data and phenotype data, and be used for evaluating the effects of the prediction method and other methods.

Example (b): to demonstrate the high accuracy of the present method, the performance of the present method and other existing methods on test data consisting of 830 pathogenic variants found in 2016 to 2017 was compared, as shown in fig. 2. Currently, there are a large number of methods commonly used in the industry, which simply use the data information of Genotype (Genotype Only) to predict pathogenicity, such as MCAP, CADD, mutationmaster. These methods predict pathogenicity based primarily on evolutionary conservation of gene sequences and calculation of the degree of functional impact on protein-encoded amino acids. As can be seen from FIG. 2, the accuracy of this type of method is more than 20% lower than the method that considers both genotype and phenotype (Exomiser). The result shows that the method provided by the invention is obviously improved by more than 30 percent compared with other methods and compared with a method (Exomiser) which considers the genotype and the phenotype simultaneously. Furthermore, the pure usage phenotype characteristic (Xrare _ phenotype) and the pure usage guideline evidence characteristic (Xrare _ ACMG) are found to have good performances, which shows that the introduced new phenotype measurement method and the guideline-based characteristic improve the model accuracy. It can be seen from fig. 3 that the performance of the new phenotypic similarity measure clearly tracks when phenotypic information is missing, inaccurate and phenotypic noise is present. To further evaluate the differences between the results of the predictive method and other methods and expert-guided methods (clinical-drive) analysis, the performance of the methods was compared using real clinical history and genetic data. 54 pathogenic sites verified by clinical experts published in 2017 are used as tests, and the results in FIG. 4 show that the GBDT model has a more obvious effect than the expert-oriented method (clinical-drive).

The invention has the beneficial effects that:

the interpretability and the accuracy of a prediction result are improved by the guide-based characteristics; random extraction, inaccuracy and noise treatment of the phenotype restore the reality of the clinical phenotype, and improve the reliability and clinical practicability of the prediction method; the introduced ERIC-based phenotype similarity calculation method enables the prediction method to better resist uncertainty caused by incomplete phenotype, inaccuracy and noise, thereby improving the accuracy of the prediction method; the accuracy and the practicability of the prediction method are further improved by adopting the nonlinear GBDT model.

Claims

1. An accurate prediction method of pathogenic genetic variation, which is characterized by comprising the following steps:

s5: calculating the similarity between the phenotype data of the simulated patient and the known phenotype set data of each gene by using an ERIC-based calculation method to obtain phenotype-related characteristics, and realizing phenotype-related characteristic extraction;

2. The prediction method according to claim 1, wherein the random drawing method for simulating the genetic variation data and corresponding phenotype data of the patient in step S3 comprises the following steps:

3. The method of predicting according to claim 1, wherein the step S5 is performed by calculating the similarity between the simulated patient phenotype data and the known phenotype aggregate data for each gene according to the following formula:

4. Prediction method according to claim 3, characterized in that the similarity sim (t) between phenotypes is calculated₁,t₂) The calculation formula used is:

sim(t₁,t₂)＝2IC(t_MICA)-min(IC(t₁),IC(t₂))

5. The prediction method according to claim 4, wherein the calculation formula for the amount of information IC (t) simulating the phenotype t of the patient is:

IC(t)＝log(N/N_t)

wherein N is a geneTotal number; n is a radical of_tThe total number of genes that result in phenotype t.

6. The prediction method according to claim 1, wherein the prediction of the pathogenicity of the mutation comprehensively considering the genotype data and the phenotype data is implemented by using a GBDT model in a machine learning algorithm in step S6.