CN113241118A - Method for predicting harmfulness of gene mutation - Google Patents

Method for predicting harmfulness of gene mutation Download PDF

Info

Publication number
CN113241118A
CN113241118A CN202110782580.4A CN202110782580A CN113241118A CN 113241118 A CN113241118 A CN 113241118A CN 202110782580 A CN202110782580 A CN 202110782580A CN 113241118 A CN113241118 A CN 113241118A
Authority
CN
China
Prior art keywords
mutation
data set
data
phenotype
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110782580.4A
Other languages
Chinese (zh)
Inventor
谢敬聃
王腾蛟
王志伟
郝建龙
喻东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Famundo Changzhou Biotechnology Co ltd
Original Assignee
Famundo Changzhou Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Famundo Changzhou Biotechnology Co ltd filed Critical Famundo Changzhou Biotechnology Co ltd
Priority to CN202110782580.4A priority Critical patent/CN113241118A/en
Publication of CN113241118A publication Critical patent/CN113241118A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The invention provides a method for predicting the harmfulness of gene mutation, which comprises the following steps: providing a mutation public database; acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence; annotating the first data set with a plurality of features; training by using the plurality of features and the second data set to obtain a training model; obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model; the method greatly improves the prediction accuracy and the prediction efficiency of the harmfulness of the mutant genes.

Description

Method for predicting harmfulness of gene mutation
Technical Field
The invention relates to the field of bioinformatics, in particular to a method for predicting harmfulness of gene mutation.
Background
Current standard methods of gene sequence mutation interpretation are based on time-consuming manual integration of multiple data sources, including extensive database and literature searches, the use of computational methods, and multiple rounds of review, and the prior art lacks an effective and highly accurate method of gene mutation hazard prediction. The development of new generation sequencing technology opens a new situation for the research of modern genomics, however, the cost of whole genome sequencing and the complexity of analysis are still difficult for researchers. With the completion of the human genome project in 2002 and the sequential mapping of the whole genome maps of other species of organisms, it was found that the gene difference between different populations was only 1%, mainly manifested as the difference of exons. Exons as coding regions for proteins are important functional sequences in DNA, which account for only about 1% of the human genome, but cover most of the functional variations associated with an individual's phenotype, associated with about 85% of the gene mutations that cause disease in humans. The whole exome (exome) is a general name of the whole exome region of the human genome, and the whole exome sequencing (also called targeted exome capture) can selectively sequence the coding region of the human genome, thereby finding the abnormal genes related to rare and common diseases.
In the past, the method for identifying the pathogenic genes of Mendelian diseases mostly adopts a linkage analysis combined candidate gene method, which not only has long time consumption, but also has low success rate. In complex diseases, conventional association studies have identified a large number of common variants, but have limited ability to detect low-frequency and rare variants. The exome sequencing technology is developed at the same time, thereby greatly promoting the research of genetic variation of disease occurrence and overcoming the difficulty in the traditional research.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for predicting the harmfulness of gene mutation, which improves the accuracy of prediction of the harmfulness of the gene mutation and obtains a set of efficient method for predicting the harmfulness of the mutant gene through reasonable data classification, feature annotation, model training and evaluation.
In order to achieve the above objects and other objects, the present invention includes the following technical solutions: the invention firstly provides a prediction method of gene mutation, which comprises the following steps: providing a mutation public database; acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence; annotating the first data set with a plurality of features; training by using the plurality of features and the second data set to obtain a training model; obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model; and predicting the harmfulness of the gene mutation based on the effective model.
In one embodiment, the first data set comprises positive mutation data and negative mutation data, the second data set is positive mutation data, and the third data set is positive mutation data.
In one embodiment, the publishing time of the first data set, the publishing time of the second data set and the publishing time of the third data set are continuous, and the publishing time span of the second data set and the third data set is 3-5 years.
In one embodiment, the plurality of features includes features associated with population allelic frequencies, gene-phenotype similarity scores, features based on ACMG/AMP guidelines, constraint scores at the gene level, existing pathogenicity computer prediction scores, features that functionally affect the variation, and gene level features associated with the database.
In one embodiment, the gene-phenotype similarity score is performed using the ERIC method: the ERIC method comprises utilizing
Figure DEST_PATH_IMAGE001
To express a phenotype t1And t2While setting the distance therebetween
Figure DEST_PATH_IMAGE002
Is 0;
Figure DEST_PATH_IMAGE003
the calculation of (a) is carried out by the following method:
calculating the amount of information
Figure DEST_PATH_IMAGE004
Wherein:
Figure DEST_PATH_IMAGE005
n is the total gene number, N is the number of genes associated with phenotype t;
by using
Figure DEST_PATH_IMAGE006
To represent the distance between phenotype t and the ancestor of phenotype t, wherein
Figure DEST_PATH_IMAGE007
Represents an ancestor of phenotype t, and
Figure DEST_PATH_IMAGE008
the calculation formula of (2) is as follows:
Figure DEST_PATH_IMAGE009
and calculating a phenotype t based on the calculation formula1And t2The distance between them is:
Figure DEST_PATH_IMAGE010
or the following steps:
Figure DEST_PATH_IMAGE011
therein using
Figure DEST_PATH_IMAGE012
Represents a phenotype t1And t2The most informative phenotype among the common ancestral phenotypes;
then the phenotype t is obtained1And t2The distance between
Figure DEST_PATH_IMAGE013
Expressed as:
Figure DEST_PATH_IMAGE014
in an embodiment, the training with the plurality of features and the second data set to obtain the training model further includes: the model was trained using mock patients and phenotypes from a negative sample of the UK10K database.
In an embodiment, obtaining mutation prediction data of a third data set according to the training model, where the mutation prediction data is consistent with the mutation data in the third data set, indicating that the training model is an effective model further includes: the mutation data in the third data set was also processed using phenotypic data from mock patients and negative samples from the UK10K database prior to consistency comparison with the mutation prediction data.
In an embodiment, the training with the plurality of features and the second data set to obtain the training model includes: and (3) training a model by adopting a gradient enhancement algorithm to predict the harmfulness of the gene mutation.
In another aspect, the present invention also provides a computer-readable storage medium, which can be used for executing the computer-executable instructions in the method described above.
Yet another aspect of the present invention provides a system, comprising: one or more processors; a memory; and one or more programs; wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising computer-executable instructions for performing the method as described above.
As described above, the method predicts the pathogenicity of the gene mutation sequence of the object by utilizing the trained machine learning model, and has high intelligence degree and accurate and efficient prediction. The accuracy of the method in the result detection rate reaches over 90 percent by utilizing phenotype-phenotype similarity score and gene-phenotype association score and other characteristics of various dimensions in cooperation with the XGboost algorithm, and the method has higher accuracy compared with various other software, namely, the demosser, the mutatranfer, the clinicaly _ drivers and the like.
Drawings
FIG. 1 shows a flow chart of the method of the present invention.
FIG. 2 is a process diagram of a prediction method according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the description provided herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
Please refer to fig. 1-2. The present invention provides, in a first aspect, a method of gene mutation hazard prediction, which in some embodiments is a computer-implemented method performed at an electronic device having at least one processor and memory. In some embodiments, the genetic mutation is a human genetic mutation. The method comprises the following steps of S1-S6:
s1: providing a mutation public database;
s2: acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence;
s3: annotating the first data set with a plurality of features;
s4: training by using the plurality of features and the second data set to obtain a training model;
s5: obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model;
s6: and predicting the harmfulness of the gene mutation based on the effective model.
In step S1, the mutation public database may be any one of ClinVar, OMIM, HGMD, CIVIC, endogen, dbVar, DGV, and decorher, and further may be ClinVar mutation database, and in one embodiment, 49021 known pathological mutation sites in the ClinVar mutation database may be used for model training. In some embodiments, the mutation can be an acquired or a deleted gene mutation.
Referring to fig. 1, in step S2, the mutation data sets may include a first data set, a second data set, and a third data set, the data sets may be divided according to a distribution time sequence of mutation data in the mutation public database, the first data set, the second data set, and the third data set may be continuous in time, the first data set may be mutation data taken from a year farther away from us, in some embodiments, the first data set may be mutation data taken from a database of 2012, the second data set may be mutation data taken from a database of 2013 to 2015, and the third data set may be mutation data taken from 2016 to 2017. In an embodiment, the time span of the data sets in the second data set and the third data set may be 3 to 5 years, and the time span may be divided according to the time for updating the different mutation public database data to obtain the most appropriate data amount, so as to improve the accuracy of the model. The first data set may include positive mutation data and negative mutation data, the second data set may be the positive mutation data, the third data set may be the positive mutation data, the positive mutation data in the third data set may include new mutation data for model evaluation, in one embodiment, the first data set includes 41590 mutation data, the second data set includes 6576 positive mutation data, the third data set includes 855 positive mutation data, and the 855 positive mutation data includes 417 new mutation data.
Referring to fig. 1, in an embodiment, the first data set and the second data set may be model training sets, and the third data set may be a model evaluation set for evaluating accuracy of training models obtained by the model training sets. The new mutation data published in the third data set relative to the second data set can be used to evaluate the accuracy of the model of the invention. The method can be used for carrying out model training by using mutation data in the second data set based on mutation data in the first data set as basic training data, and obtaining prediction data of the harmfulness of the novel mutant gene on the basis.
Referring to fig. 1, in step S3, annotating the first data set with a plurality of features may include a plurality of categories of features, such as demographic features based on databases such as ExAC, 1KG, etc., features based on ACMG/AMP guidelines, features based on interactor group prediction phenotypes, etc., features based on ERIC method phenotype similarity scores, gene level restrictions, functional outcomes, etc. In one example, the features can include 6 features associated with population allelic frequency, 5 gene-phenotype similarity scores, 15 features based on ACMG/AMP guidelines, 9 gene-level constraint scores, 12 existing pathogenicity computer prediction scores, 2 features that functionally affect variation, and 2 database-related gene-level features. Genetic mutations in the training data are annotated with a number of features as described in the present invention. These features will be assigned to each gene mutation sequence, which is then used for model training, so that the pathogenicity of the gene mutation sequence to be tested can be predicted by the trained model.
Wherein, the population allele frequency characteristic data mainly comes from 3 population frequency databases: 1000 genes Project, ESP and ExAC, based on 15 characteristics in ACMG/AMP guidelines, optimized their scoring system, including: PVS1:6, PS1:4, PM1:2, PM2:2, PM4:2, PM5:2, PP2:1, PP3:1, BA1: -9, BS1: -3, BS2: 3, BP3: 1, BP4: 1, BP7: 2; 12 existing computer methods for predicting pathogenicity include Polyphen2_ HVAR, LRT, MutationTaster, GERP + + _ RS, phyloP, SPIDEX, CADD, and DANN, the data of which are derived from the dbNFSFP and ANNOVAR databases, respectively. Constraint scores at the 9 gene levels were derived from the ExAC projects, including pDom, pRec, syn _ z, mis _ z, pLI, and CADD databases, including CADD _ a1qt005, CADD _ a1qt01, CADD _ a2qt005, CADD _ a2qt01, respectively; the 2 characteristics of functionally-affected variation were derived from the annotated results of the VEP tool and the predicted results of the LOFTEE algorithm, respectively, and the 2 database-related gene-level characteristics included the organized rare-related gene entry information and the similarity score between genes in the STRING database. 5 gene-phenotype similarity scores corresponding to the phenotype similarity score calculated by the ERIC algorithm, the gene-phenotype similarity score, the phenotype ranking score calculated by the ERIC algorithm, the gene-phenotype similarity ranking score, and the ERIC normalized score as described below.
Referring to FIG. 1, in step S3, the ERIC method includes utilizing
Figure DEST_PATH_IMAGE015
To express a phenotype t1And t2While setting the distance therebetween
Figure DEST_PATH_IMAGE016
Is 0;
Figure 457226DEST_PATH_IMAGE003
the calculation of (a) is carried out by the following method: calculating the amount of information
Figure DEST_PATH_IMAGE017
Wherein:
Figure DEST_PATH_IMAGE018
n is a total geneNumber, n is the number of genes associated with phenotype t;
by using
Figure DEST_PATH_IMAGE019
To represent the distance between phenotype t and the ancestor of phenotype t, wherein
Figure DEST_PATH_IMAGE020
Represents an ancestor of phenotype t, and
Figure DEST_PATH_IMAGE021
the calculation formula of (2) is as follows:
Figure DEST_PATH_IMAGE022
and calculating a phenotype t based on the calculation formula1And t2The distance between them is:
Figure DEST_PATH_IMAGE023
or the following steps:
Figure DEST_PATH_IMAGE024
in which use
Figure DEST_PATH_IMAGE025
Represents a phenotype t1And t2The most informative phenotype among the common ancestral phenotypes;
then the phenotype t is obtained1And t2The distance between
Figure DEST_PATH_IMAGE026
Expressed as:
Figure 791343DEST_PATH_IMAGE014
the correlation between the phenotypes can be obtained by calculating the distance between different phenotypes, and further the correlation between genes and the phenotypes can be obtained.
Referring to fig. 1, in step S4, training is performed using the plurality of features and the second data set to obtain a training model, and the invention trains the feature-annotated first data set with the second data set to predict whether the gene mutation is harmful.
In step S4, genetic mutations in the training data are annotated with a plurality of features as described in the present invention, and then all the above feature information is integrated, and a gradient enhancement algorithm is used to train a model to predict the pathogenicity of the mutation. The Gradient Boosting Decision Tree (GBDT) algorithm is a set of classification and regression tree models, and the principle is as follows:
if there are K trees, the model is
Figure DEST_PATH_IMAGE027
I.e. a set of prediction trees. Prediction tree
Figure DEST_PATH_IMAGE028
Is defined as follows
Figure DEST_PATH_IMAGE029
W is the score vector on the leaves of the tree, the function q assigns each data point to a leaf, and T is the number of leaves.
The objective function includes 2 parts:
Figure DEST_PATH_IMAGE030
wherein the loss function is trained
Figure DEST_PATH_IMAGE031
Is a loss of logic for logistic regression.
Figure DEST_PATH_IMAGE032
In addition regularization term
Figure DEST_PATH_IMAGE033
Is that
Figure DEST_PATH_IMAGE034
Wherein T is the number of leaves,
Figure DEST_PATH_IMAGE035
is the first
Figure DEST_PATH_IMAGE036
Score of each leaf.
The model is trained using an additive strategy by fixing the learned trees and adding a new tree at a time. First, the
Figure DEST_PATH_IMAGE037
The objects of the step are:
Figure DEST_PATH_IMAGE038
first and second order Taylor expansions of the loss function are considered, then
Figure 587392DEST_PATH_IMAGE037
The object of the step becomes:
Figure DEST_PATH_IMAGE039
here, the first and second liquid crystal display panels are,
Figure DEST_PATH_IMAGE040
Figure DEST_PATH_IMAGE041
thus, the object at step t can be simplified to:
Figure DEST_PATH_IMAGE042
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE043
Figure DEST_PATH_IMAGE044
Figure DEST_PATH_IMAGE045
by minimizing, the optimum can be obtained
Figure DEST_PATH_IMAGE046
And optimal target reduction:
Figure DEST_PATH_IMAGE047
Figure DEST_PATH_IMAGE048
now that there is a method of scoring a tree, a strategy for creating a tree with small target reduction is described next. Spanning tree is a process that iteratively divides a leaf into two leaves. The following formula is used to determine whether a leaf should be split:
Figure DEST_PATH_IMAGE049
the specific implementation process can be as follows: 51 features as x, marker variable y representing variation associated with a phenotype; the prediction model is trained using the GBDT algorithm in the python XGboost toolkit. In model training, the complexity of the model is controlled using max.depth and min _ child _ weight, the overfitting is reduced using sumample and lambda functions, and finally the parameters are determined by network learning: learning rate, maximum tree depth, sampling rate, etc. In some embodiments, to prevent overfitting in the network, 10 cross-validations may also be employed.
Referring to fig. 1, in step S4, in order to improve the accuracy of the model, the training using the plurality of features and the second data set may further include: the model is trained using simulated patients and phenotypes from negative samples of the UK10K database, which further removes negative mutation phenotype data, making the model more accurate and reliable and improving prediction efficiency.
Referring to fig. 2, in some embodiments, the method extracts a first data set 101, a second data set 102, and a third data set 103 from the ClinVar database 100, annotates the first data set 101 with features 200 to obtain the annotated data set 104, performs model training 106 on the annotated data set 104 using the second data set 102 and phenotypes 105 simulating negative sampling, performs iteration 300 on the model training 106 step to finally generate a training model 107, outputs mutation prediction data 108 through the training model 107, and finally performs model evaluation 109 on the mutation prediction data 108 through the evaluation data set, i.e., the third data set 103, wherein the third data set 103 may also be processed to simulate the phenotypes 105 of negative sampling before evaluation, if the mutation prediction data 108 result is consistent with the true mutation result in the third data set 103, the training model 107 is shown to be an effective model that can be used to evaluate the harmfulness of gene mutation.
Referring to fig. 1, in step S5, the trained model may be evaluated by using the third data set, that is, the trained model may be a positive gene that may be mutated to obtain a relevant phenotype based on the trained model, the positive gene obtained by the model may be compared with the positive gene in the third data set according to a known positive gene disclosed in the third data set, if the positive gene obtained by the model is consistent with the positive gene in the third data set, it is indicated that the method is effective, and the consistency may be that the predicted mutated gene error is within 10%, that is, the accuracy of the prediction method of the present invention may be more than 90%. In some embodiments, the mutation data in the third data set may also be subjected to a process of phenotypic data processing using simulated patients and phenotypic data from UK10K database negative samples before being compared to the mutation prediction data, and the accuracy and timeliness of the prediction method may be improved using the third data set after being processed with simulated patients and phenotypic data from UK10K database negative samples.
Referring to FIG. 1, in step S6, the prediction of the harmfulness of the gene mutation based on the effective model includes predicting the possible harmfulness of a mutated gene based on the trained model. In some embodiments, the invention can provide a prediction of likely disease genes for a certain rare patient.
The present invention also provides, in one aspect, a computer-readable storage medium, which may be used to execute computer-executable instructions in the above-described method. Yet another aspect of the present invention provides a system, comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described above.
The storage medium may be used to assist a computer in executing one or more computer programs of any of the above methods. The computer program may be written in, for example, a general purpose programming language such as Pascal, C + +, Java, Python, JSON, etc., or in some specific application specific language.
The system may be a computer system that may include, for example, a processor, memory, storage, and input/output devices (e.g., a monitor, keyboard, disk drive, internet connection, etc.). However, the computing system may include circuitry or other dedicated hardware for performing some or all aspects of the process. In some operating settings, the computing system may be configured as a system comprising one or more units, each of which is configured to perform some aspects of the processes in software, hardware, or some combination thereof.
Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value. The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A method for predicting the harmfulness of a gene mutation, comprising the steps of:
providing a mutation public database;
acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence;
annotating the first data set with a plurality of features; and
training by using the plurality of features and the second data set to obtain a training model;
obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model;
and predicting the harmfulness of the gene mutation based on the effective model.
2. The method of claim 1, wherein: the first data set comprises positive mutation data and negative mutation data, the second data set is positive mutation data, and the third data set is positive mutation data.
3. The method of claim 1, wherein: the issuing time of the first data set, the issuing time of the second data set and the issuing time of the third data set are continuous, and the issuing time span of the second data set and the issuing time span of the third data set are 3-5 years.
4. The method of claim 1, wherein: the plurality of features includes features associated with population allelic frequencies, gene-phenotype similarity scores, features based on ACMG/AMP guidelines, constraint scores at the gene level, existing pathogenicity computer prediction scores, features functionally affecting variations, and gene level features associated with a database.
5. The method of claim 4, wherein: the gene-phenotype similarity scoring was performed using the ERIC method: the ERIC method comprises utilizing
Figure 946137DEST_PATH_IMAGE001
To express a phenotype t1And t2While setting the distance therebetween
Figure 955682DEST_PATH_IMAGE002
Is 0;
Figure 192628DEST_PATH_IMAGE003
the calculation of (a) is carried out by the following method:
calculating the amount of information
Figure 262215DEST_PATH_IMAGE004
Wherein:
Figure 725558DEST_PATH_IMAGE005
n is the total gene number, N is the number of genes associated with phenotype t;
by using
Figure 538793DEST_PATH_IMAGE006
To represent the distance between phenotype t and the ancestor of phenotype t, wherein
Figure 380978DEST_PATH_IMAGE007
Ancestors representing phenotype tAnd is and
Figure 887046DEST_PATH_IMAGE008
the calculation formula of (2) is as follows:
Figure 634422DEST_PATH_IMAGE009
and calculating a phenotype t based on the calculation formula1And t2The distance between them is:
Figure 861135DEST_PATH_IMAGE010
or the following steps:
Figure 948040DEST_PATH_IMAGE011
therein using
Figure 625009DEST_PATH_IMAGE012
Represents a phenotype t1And t2The most informative phenotype among the common ancestral phenotypes;
then the phenotype t is obtained1And t2The distance between
Figure 921998DEST_PATH_IMAGE013
Expressed as:
Figure 811456DEST_PATH_IMAGE014
6. the method of claim 1, wherein: training with the plurality of features and the second data set to obtain a training model comprising: and (3) training a model by adopting a gradient enhancement algorithm to predict the harmfulness of the gene mutation.
7. The method of claim 1, wherein: training with the plurality of features and the second data set to obtain a training model further comprises: the model was trained using mock patients and phenotypes from a negative sample of the UK10K database.
8. The method of claim 7, wherein: obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model and further comprises: the mutation data in the third data set was also processed using phenotypic data from mock patients and negative samples from the UK10K database prior to consistency comparison with the mutation prediction data.
9. A computer-readable storage medium characterized by: comprising computer-executable instructions for performing the method of any of claims 1 to 8.
10. A system, characterized by: the system comprises:
one or more processors;
a memory; and one or more programs;
wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising computer-executable instructions for performing the method of any of claims 1-8.
CN202110782580.4A 2021-07-12 2021-07-12 Method for predicting harmfulness of gene mutation Pending CN113241118A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110782580.4A CN113241118A (en) 2021-07-12 2021-07-12 Method for predicting harmfulness of gene mutation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110782580.4A CN113241118A (en) 2021-07-12 2021-07-12 Method for predicting harmfulness of gene mutation

Publications (1)

Publication Number Publication Date
CN113241118A true CN113241118A (en) 2021-08-10

Family

ID=77135343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110782580.4A Pending CN113241118A (en) 2021-07-12 2021-07-12 Method for predicting harmfulness of gene mutation

Country Status (1)

Country Link
CN (1) CN113241118A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009101625A2 (en) * 2008-02-12 2009-08-20 Ramot At Tel-Aviv University Ltd. Method for searching for homing endonucleases, their genes and their targets
CN105404793A (en) * 2015-12-07 2016-03-16 浙江大学 Method for rapidly discovering phenotype related gene based on probabilistic framework and resequencing technology
CN106980749A (en) * 2017-02-21 2017-07-25 成都奇恩生物科技有限公司 The quick assisted location method of disease
CN108363902A (en) * 2018-01-30 2018-08-03 成都奇恩生物科技有限公司 A kind of accurate prediction technique of pathogenic hereditary variation
WO2019136364A1 (en) * 2018-01-05 2019-07-11 Illumina, Inc. Process for aligning targeted nucleic acid sequencing data
CN110010196A (en) * 2019-03-19 2019-07-12 北京工业大学 A kind of gene similarity searching algorithm based on heterogeneous network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009101625A2 (en) * 2008-02-12 2009-08-20 Ramot At Tel-Aviv University Ltd. Method for searching for homing endonucleases, their genes and their targets
CN105404793A (en) * 2015-12-07 2016-03-16 浙江大学 Method for rapidly discovering phenotype related gene based on probabilistic framework and resequencing technology
CN106980749A (en) * 2017-02-21 2017-07-25 成都奇恩生物科技有限公司 The quick assisted location method of disease
WO2019136364A1 (en) * 2018-01-05 2019-07-11 Illumina, Inc. Process for aligning targeted nucleic acid sequencing data
CN108363902A (en) * 2018-01-30 2018-08-03 成都奇恩生物科技有限公司 A kind of accurate prediction technique of pathogenic hereditary variation
CN110010196A (en) * 2019-03-19 2019-07-12 北京工业大学 A kind of gene similarity searching algorithm based on heterogeneous network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LI Q ET AL: "《Xrare: a machine learning method jointly modeling phenotypes and genetic evidence for rare disease diagnosis》", 《GENET MED 21》 *
周源等: "《1例以心脏受累为特点的DES基因突变鉴定及表型分析》", 《儿科药学杂志》 *
石芳等: "《基于随机森林的有害同义突变预测方法研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
Nguyen et al. A comprehensive survey of regulatory network inference methods using single cell RNA sequencing data
Kelley Cross-species regulatory sequence activity prediction
Friedman et al. Data analysis with Bayesian networks: A bootstrap approach
Pavlidis et al. A survey of methods and tools to detect recent and strong positive selection
Günther et al. Robust identification of local adaptation from allele frequencies
Fu et al. A gene prioritization method based on a swine multi-omics knowledgebase and a deep learning model
CN109411023B (en) Method for mining inter-gene interaction relation based on Bayesian network inference
CN113519028A (en) Methods and compositions for estimating or predicting genotypes and phenotypes
CN109727637B (en) Method for identifying key proteins based on mixed frog-leaping algorithm
US20150025861A1 (en) Genetic screening computing systems and methods
Zhao et al. Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer
D’Agaro Artificial intelligence used in genome analysis studies
Teixeira et al. Learning influential genes on cancer gene expression data with stacked denoising autoencoders
Zhan et al. ProbPFP: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function
Zhao et al. MFCNV: a new method to detect copy number variations from next-generation sequencing data
Zhang et al. Cancer survival prognosis with deep Bayesian perturbation cox network
Huang et al. CNV-MEANN: a neural network and mind evolutionary algorithm-based detection of copy number variations from next-generation sequencing data
Quan et al. Developing parallel ant colonies filtered by deep learned constrains for predicting RNA secondary structure with pseudo-knots
Sheehan et al. Improved maximum likelihood reconstruction of complex multi-generational pedigrees
CN113241118A (en) Method for predicting harmfulness of gene mutation
Passafaro et al. Would large dataset sample size unveil the potential of deep neural networks for improved genome-enabled prediction of complex traits? The case for body weight in broilers
Cooke et al. Fine-tuning of approximate Bayesian computation for human population genomics
Zhang et al. Inferring historical introgression with deep learning
Niu et al. CircRNA identification and feature interpretability analysis
CN116959561B (en) Gene interaction prediction method and device based on neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210810