CN113241118A

CN113241118A - Method for predicting harmfulness of gene mutation

Info

Publication number: CN113241118A
Application number: CN202110782580.4A
Authority: CN
Inventors: 谢敬聃; 王腾蛟; 王志伟; 郝建龙; 喻东
Original assignee: Famundo Changzhou Biotechnology Co ltd
Current assignee: Famundo Changzhou Biotechnology Co ltd
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2021-08-10

Abstract

The invention provides a method for predicting the harmfulness of gene mutation, which comprises the following steps: providing a mutation public database; acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence; annotating the first data set with a plurality of features; training by using the plurality of features and the second data set to obtain a training model; obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model; the method greatly improves the prediction accuracy and the prediction efficiency of the harmfulness of the mutant genes.

Description

Method for predicting harmfulness of gene mutation

Technical Field

The invention relates to the field of bioinformatics, in particular to a method for predicting harmfulness of gene mutation.

Background

Current standard methods of gene sequence mutation interpretation are based on time-consuming manual integration of multiple data sources, including extensive database and literature searches, the use of computational methods, and multiple rounds of review, and the prior art lacks an effective and highly accurate method of gene mutation hazard prediction. The development of new generation sequencing technology opens a new situation for the research of modern genomics, however, the cost of whole genome sequencing and the complexity of analysis are still difficult for researchers. With the completion of the human genome project in 2002 and the sequential mapping of the whole genome maps of other species of organisms, it was found that the gene difference between different populations was only 1%, mainly manifested as the difference of exons. Exons as coding regions for proteins are important functional sequences in DNA, which account for only about 1% of the human genome, but cover most of the functional variations associated with an individual's phenotype, associated with about 85% of the gene mutations that cause disease in humans. The whole exome (exome) is a general name of the whole exome region of the human genome, and the whole exome sequencing (also called targeted exome capture) can selectively sequence the coding region of the human genome, thereby finding the abnormal genes related to rare and common diseases.

In the past, the method for identifying the pathogenic genes of Mendelian diseases mostly adopts a linkage analysis combined candidate gene method, which not only has long time consumption, but also has low success rate. In complex diseases, conventional association studies have identified a large number of common variants, but have limited ability to detect low-frequency and rare variants. The exome sequencing technology is developed at the same time, thereby greatly promoting the research of genetic variation of disease occurrence and overcoming the difficulty in the traditional research.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method for predicting the harmfulness of gene mutation, which improves the accuracy of prediction of the harmfulness of the gene mutation and obtains a set of efficient method for predicting the harmfulness of the mutant gene through reasonable data classification, feature annotation, model training and evaluation.

In order to achieve the above objects and other objects, the present invention includes the following technical solutions: the invention firstly provides a prediction method of gene mutation, which comprises the following steps: providing a mutation public database; acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence; annotating the first data set with a plurality of features; training by using the plurality of features and the second data set to obtain a training model; obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model; and predicting the harmfulness of the gene mutation based on the effective model.

In one embodiment, the first data set comprises positive mutation data and negative mutation data, the second data set is positive mutation data, and the third data set is positive mutation data.

In one embodiment, the publishing time of the first data set, the publishing time of the second data set and the publishing time of the third data set are continuous, and the publishing time span of the second data set and the third data set is 3-5 years.

In one embodiment, the plurality of features includes features associated with population allelic frequencies, gene-phenotype similarity scores, features based on ACMG/AMP guidelines, constraint scores at the gene level, existing pathogenicity computer prediction scores, features that functionally affect the variation, and gene level features associated with the database.

In one embodiment, the gene-phenotype similarity score is performed using the ERIC method: the ERIC method comprises utilizing

To express a phenotype t₁And t₂While setting the distance therebetween

Is 0;

the calculation of (a) is carried out by the following method:

calculating the amount of information

Wherein:

n is the total gene number, N is the number of genes associated with phenotype t;

by using

To represent the distance between phenotype t and the ancestor of phenotype t, wherein

Represents an ancestor of phenotype t, and

the calculation formula of (2) is as follows:

；

and calculating a phenotype t based on the calculation formula₁And t₂The distance between them is:

or the following steps:

therein using

Represents a phenotype t₁And t₂The most informative phenotype among the common ancestral phenotypes;

then the phenotype t is obtained₁And t₂The distance between

Expressed as:

。

in an embodiment, the training with the plurality of features and the second data set to obtain the training model further includes: the model was trained using mock patients and phenotypes from a negative sample of the UK10K database.

In an embodiment, obtaining mutation prediction data of a third data set according to the training model, where the mutation prediction data is consistent with the mutation data in the third data set, indicating that the training model is an effective model further includes: the mutation data in the third data set was also processed using phenotypic data from mock patients and negative samples from the UK10K database prior to consistency comparison with the mutation prediction data.

In an embodiment, the training with the plurality of features and the second data set to obtain the training model includes: and (3) training a model by adopting a gradient enhancement algorithm to predict the harmfulness of the gene mutation.

In another aspect, the present invention also provides a computer-readable storage medium, which can be used for executing the computer-executable instructions in the method described above.

Yet another aspect of the present invention provides a system, comprising: one or more processors; a memory; and one or more programs; wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising computer-executable instructions for performing the method as described above.

As described above, the method predicts the pathogenicity of the gene mutation sequence of the object by utilizing the trained machine learning model, and has high intelligence degree and accurate and efficient prediction. The accuracy of the method in the result detection rate reaches over 90 percent by utilizing phenotype-phenotype similarity score and gene-phenotype association score and other characteristics of various dimensions in cooperation with the XGboost algorithm, and the method has higher accuracy compared with various other software, namely, the demosser, the mutatranfer, the clinicaly _ drivers and the like.

Drawings

FIG. 1 shows a flow chart of the method of the present invention.

FIG. 2 is a process diagram of a prediction method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the description provided herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Please refer to fig. 1-2. The present invention provides, in a first aspect, a method of gene mutation hazard prediction, which in some embodiments is a computer-implemented method performed at an electronic device having at least one processor and memory. In some embodiments, the genetic mutation is a human genetic mutation. The method comprises the following steps of S1-S6:

s1: providing a mutation public database;

s2: acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence;

s3: annotating the first data set with a plurality of features;

s4: training by using the plurality of features and the second data set to obtain a training model;

s5: obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model;

s6: and predicting the harmfulness of the gene mutation based on the effective model.

In step S1, the mutation public database may be any one of ClinVar, OMIM, HGMD, CIVIC, endogen, dbVar, DGV, and decorher, and further may be ClinVar mutation database, and in one embodiment, 49021 known pathological mutation sites in the ClinVar mutation database may be used for model training. In some embodiments, the mutation can be an acquired or a deleted gene mutation.

Referring to fig. 1, in step S2, the mutation data sets may include a first data set, a second data set, and a third data set, the data sets may be divided according to a distribution time sequence of mutation data in the mutation public database, the first data set, the second data set, and the third data set may be continuous in time, the first data set may be mutation data taken from a year farther away from us, in some embodiments, the first data set may be mutation data taken from a database of 2012, the second data set may be mutation data taken from a database of 2013 to 2015, and the third data set may be mutation data taken from 2016 to 2017. In an embodiment, the time span of the data sets in the second data set and the third data set may be 3 to 5 years, and the time span may be divided according to the time for updating the different mutation public database data to obtain the most appropriate data amount, so as to improve the accuracy of the model. The first data set may include positive mutation data and negative mutation data, the second data set may be the positive mutation data, the third data set may be the positive mutation data, the positive mutation data in the third data set may include new mutation data for model evaluation, in one embodiment, the first data set includes 41590 mutation data, the second data set includes 6576 positive mutation data, the third data set includes 855 positive mutation data, and the 855 positive mutation data includes 417 new mutation data.

Referring to fig. 1, in an embodiment, the first data set and the second data set may be model training sets, and the third data set may be a model evaluation set for evaluating accuracy of training models obtained by the model training sets. The new mutation data published in the third data set relative to the second data set can be used to evaluate the accuracy of the model of the invention. The method can be used for carrying out model training by using mutation data in the second data set based on mutation data in the first data set as basic training data, and obtaining prediction data of the harmfulness of the novel mutant gene on the basis.

Referring to fig. 1, in step S3, annotating the first data set with a plurality of features may include a plurality of categories of features, such as demographic features based on databases such as ExAC, 1KG, etc., features based on ACMG/AMP guidelines, features based on interactor group prediction phenotypes, etc., features based on ERIC method phenotype similarity scores, gene level restrictions, functional outcomes, etc. In one example, the features can include 6 features associated with population allelic frequency, 5 gene-phenotype similarity scores, 15 features based on ACMG/AMP guidelines, 9 gene-level constraint scores, 12 existing pathogenicity computer prediction scores, 2 features that functionally affect variation, and 2 database-related gene-level features. Genetic mutations in the training data are annotated with a number of features as described in the present invention. These features will be assigned to each gene mutation sequence, which is then used for model training, so that the pathogenicity of the gene mutation sequence to be tested can be predicted by the trained model.

Wherein, the population allele frequency characteristic data mainly comes from 3 population frequency databases: 1000 genes Project, ESP and ExAC, based on 15 characteristics in ACMG/AMP guidelines, optimized their scoring system, including: PVS1:6, PS1:4, PM1:2, PM2:2, PM4:2, PM5:2, PP2:1, PP3:1, BA1: -9, BS1: -3, BS2: 3, BP3: 1, BP4: 1, BP7: 2; 12 existing computer methods for predicting pathogenicity include Polyphen2_ HVAR, LRT, MutationTaster, GERP + + _ RS, phyloP, SPIDEX, CADD, and DANN, the data of which are derived from the dbNFSFP and ANNOVAR databases, respectively. Constraint scores at the 9 gene levels were derived from the ExAC projects, including pDom, pRec, syn _ z, mis _ z, pLI, and CADD databases, including CADD _ a1qt005, CADD _ a1qt01, CADD _ a2qt005, CADD _ a2qt01, respectively; the 2 characteristics of functionally-affected variation were derived from the annotated results of the VEP tool and the predicted results of the LOFTEE algorithm, respectively, and the 2 database-related gene-level characteristics included the organized rare-related gene entry information and the similarity score between genes in the STRING database. 5 gene-phenotype similarity scores corresponding to the phenotype similarity score calculated by the ERIC algorithm, the gene-phenotype similarity score, the phenotype ranking score calculated by the ERIC algorithm, the gene-phenotype similarity ranking score, and the ERIC normalized score as described below.

Referring to FIG. 1, in step S3, the ERIC method includes utilizing

To express a phenotype t₁And t₂While setting the distance therebetween

Is 0;

the calculation of (a) is carried out by the following method: calculating the amount of information

Wherein:

n is a total geneNumber, n is the number of genes associated with phenotype t;

by using

Represents an ancestor of phenotype t, and

the calculation formula of (2) is as follows:

；

or the following steps:

，

in which use

then the phenotype t is obtained₁And t₂The distance between

Expressed as:

。

the correlation between the phenotypes can be obtained by calculating the distance between different phenotypes, and further the correlation between genes and the phenotypes can be obtained.

Referring to fig. 1, in step S4, training is performed using the plurality of features and the second data set to obtain a training model, and the invention trains the feature-annotated first data set with the second data set to predict whether the gene mutation is harmful.

In step S4, genetic mutations in the training data are annotated with a plurality of features as described in the present invention, and then all the above feature information is integrated, and a gradient enhancement algorithm is used to train a model to predict the pathogenicity of the mutation. The Gradient Boosting Decision Tree (GBDT) algorithm is a set of classification and regression tree models, and the principle is as follows:

if there are K trees, the model is

I.e. a set of prediction trees. Prediction tree

Is defined as follows

W is the score vector on the leaves of the tree, the function q assigns each data point to a leaf, and T is the number of leaves.

The objective function includes 2 parts:

wherein the loss function is trained

Is a loss of logic for logistic regression.

In addition regularization term

Is that

Wherein T is the number of leaves,

is the first

Score of each leaf.

The model is trained using an additive strategy by fixing the learned trees and adding a new tree at a time. First, the

The objects of the step are:

first and second order Taylor expansions of the loss function are considered, then

The object of the step becomes:

here, the first and second liquid crystal display panels are,

，

thus, the object at step t can be simplified to:

wherein the content of the first and second substances,

，

by minimizing, the optimum can be obtained

And optimal target reduction:

now that there is a method of scoring a tree, a strategy for creating a tree with small target reduction is described next. Spanning tree is a process that iteratively divides a leaf into two leaves. The following formula is used to determine whether a leaf should be split:

the specific implementation process can be as follows: 51 features as x, marker variable y representing variation associated with a phenotype; the prediction model is trained using the GBDT algorithm in the python XGboost toolkit. In model training, the complexity of the model is controlled using max.depth and min _ child _ weight, the overfitting is reduced using sumample and lambda functions, and finally the parameters are determined by network learning: learning rate, maximum tree depth, sampling rate, etc. In some embodiments, to prevent overfitting in the network, 10 cross-validations may also be employed.

Referring to fig. 1, in step S4, in order to improve the accuracy of the model, the training using the plurality of features and the second data set may further include: the model is trained using simulated patients and phenotypes from negative samples of the UK10K database, which further removes negative mutation phenotype data, making the model more accurate and reliable and improving prediction efficiency.

Referring to fig. 2, in some embodiments, the method extracts a first data set 101, a second data set 102, and a third data set 103 from the ClinVar database 100, annotates the first data set 101 with features 200 to obtain the annotated data set 104, performs model training 106 on the annotated data set 104 using the second data set 102 and phenotypes 105 simulating negative sampling, performs iteration 300 on the model training 106 step to finally generate a training model 107, outputs mutation prediction data 108 through the training model 107, and finally performs model evaluation 109 on the mutation prediction data 108 through the evaluation data set, i.e., the third data set 103, wherein the third data set 103 may also be processed to simulate the phenotypes 105 of negative sampling before evaluation, if the mutation prediction data 108 result is consistent with the true mutation result in the third data set 103, the training model 107 is shown to be an effective model that can be used to evaluate the harmfulness of gene mutation.

Referring to fig. 1, in step S5, the trained model may be evaluated by using the third data set, that is, the trained model may be a positive gene that may be mutated to obtain a relevant phenotype based on the trained model, the positive gene obtained by the model may be compared with the positive gene in the third data set according to a known positive gene disclosed in the third data set, if the positive gene obtained by the model is consistent with the positive gene in the third data set, it is indicated that the method is effective, and the consistency may be that the predicted mutated gene error is within 10%, that is, the accuracy of the prediction method of the present invention may be more than 90%. In some embodiments, the mutation data in the third data set may also be subjected to a process of phenotypic data processing using simulated patients and phenotypic data from UK10K database negative samples before being compared to the mutation prediction data, and the accuracy and timeliness of the prediction method may be improved using the third data set after being processed with simulated patients and phenotypic data from UK10K database negative samples.

Referring to FIG. 1, in step S6, the prediction of the harmfulness of the gene mutation based on the effective model includes predicting the possible harmfulness of a mutated gene based on the trained model. In some embodiments, the invention can provide a prediction of likely disease genes for a certain rare patient.

The present invention also provides, in one aspect, a computer-readable storage medium, which may be used to execute computer-executable instructions in the above-described method. Yet another aspect of the present invention provides a system, comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described above.

The storage medium may be used to assist a computer in executing one or more computer programs of any of the above methods. The computer program may be written in, for example, a general purpose programming language such as Pascal, C + +, Java, Python, JSON, etc., or in some specific application specific language.

The system may be a computer system that may include, for example, a processor, memory, storage, and input/output devices (e.g., a monitor, keyboard, disk drive, internet connection, etc.). However, the computing system may include circuitry or other dedicated hardware for performing some or all aspects of the process. In some operating settings, the computing system may be configured as a system comprising one or more units, each of which is configured to perform some aspects of the processes in software, hardware, or some combination thereof.

Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value. The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for predicting the harmfulness of a gene mutation, comprising the steps of:

providing a mutation public database;

acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence;

annotating the first data set with a plurality of features; and

training by using the plurality of features and the second data set to obtain a training model;

obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model;

and predicting the harmfulness of the gene mutation based on the effective model.

2. The method of claim 1, wherein: the first data set comprises positive mutation data and negative mutation data, the second data set is positive mutation data, and the third data set is positive mutation data.

3. The method of claim 1, wherein: the issuing time of the first data set, the issuing time of the second data set and the issuing time of the third data set are continuous, and the issuing time span of the second data set and the issuing time span of the third data set are 3-5 years.

4. The method of claim 1, wherein: the plurality of features includes features associated with population allelic frequencies, gene-phenotype similarity scores, features based on ACMG/AMP guidelines, constraint scores at the gene level, existing pathogenicity computer prediction scores, features functionally affecting variations, and gene level features associated with a database.

5. The method of claim 4, wherein: the gene-phenotype similarity scoring was performed using the ERIC method: the ERIC method comprises utilizing

To express a phenotype t₁And t₂While setting the distance therebetween

Is 0;

the calculation of (a) is carried out by the following method:

calculating the amount of information

Wherein:

by using

Ancestors representing phenotype tAnd is and

the calculation formula of (2) is as follows:

；

or the following steps:

therein using

then the phenotype t is obtained₁And t₂The distance between

Expressed as:

。

6. the method of claim 1, wherein: training with the plurality of features and the second data set to obtain a training model comprising: and (3) training a model by adopting a gradient enhancement algorithm to predict the harmfulness of the gene mutation.

7. The method of claim 1, wherein: training with the plurality of features and the second data set to obtain a training model further comprises: the model was trained using mock patients and phenotypes from a negative sample of the UK10K database.

8. The method of claim 7, wherein: obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model and further comprises: the mutation data in the third data set was also processed using phenotypic data from mock patients and negative samples from the UK10K database prior to consistency comparison with the mutation prediction data.

9. A computer-readable storage medium characterized by: comprising computer-executable instructions for performing the method of any of claims 1 to 8.

10. A system, characterized by: the system comprises:

one or more processors;

a memory; and one or more programs;

wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising computer-executable instructions for performing the method of any of claims 1-8.