CN113241118A - Method for predicting harmfulness of gene mutation - Google Patents
Method for predicting harmfulness of gene mutation Download PDFInfo
- Publication number
- CN113241118A CN113241118A CN202110782580.4A CN202110782580A CN113241118A CN 113241118 A CN113241118 A CN 113241118A CN 202110782580 A CN202110782580 A CN 202110782580A CN 113241118 A CN113241118 A CN 113241118A
- Authority
- CN
- China
- Prior art keywords
- mutation
- data set
- data
- phenotype
- gene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Abstract
The invention provides a method for predicting the harmfulness of gene mutation, which comprises the following steps: providing a mutation public database; acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence; annotating the first data set with a plurality of features; training by using the plurality of features and the second data set to obtain a training model; obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model; the method greatly improves the prediction accuracy and the prediction efficiency of the harmfulness of the mutant genes.
Description
Technical Field
The invention relates to the field of bioinformatics, in particular to a method for predicting harmfulness of gene mutation.
Background
Current standard methods of gene sequence mutation interpretation are based on time-consuming manual integration of multiple data sources, including extensive database and literature searches, the use of computational methods, and multiple rounds of review, and the prior art lacks an effective and highly accurate method of gene mutation hazard prediction. The development of new generation sequencing technology opens a new situation for the research of modern genomics, however, the cost of whole genome sequencing and the complexity of analysis are still difficult for researchers. With the completion of the human genome project in 2002 and the sequential mapping of the whole genome maps of other species of organisms, it was found that the gene difference between different populations was only 1%, mainly manifested as the difference of exons. Exons as coding regions for proteins are important functional sequences in DNA, which account for only about 1% of the human genome, but cover most of the functional variations associated with an individual's phenotype, associated with about 85% of the gene mutations that cause disease in humans. The whole exome (exome) is a general name of the whole exome region of the human genome, and the whole exome sequencing (also called targeted exome capture) can selectively sequence the coding region of the human genome, thereby finding the abnormal genes related to rare and common diseases.
In the past, the method for identifying the pathogenic genes of Mendelian diseases mostly adopts a linkage analysis combined candidate gene method, which not only has long time consumption, but also has low success rate. In complex diseases, conventional association studies have identified a large number of common variants, but have limited ability to detect low-frequency and rare variants. The exome sequencing technology is developed at the same time, thereby greatly promoting the research of genetic variation of disease occurrence and overcoming the difficulty in the traditional research.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for predicting the harmfulness of gene mutation, which improves the accuracy of prediction of the harmfulness of the gene mutation and obtains a set of efficient method for predicting the harmfulness of the mutant gene through reasonable data classification, feature annotation, model training and evaluation.
In order to achieve the above objects and other objects, the present invention includes the following technical solutions: the invention firstly provides a prediction method of gene mutation, which comprises the following steps: providing a mutation public database; acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence; annotating the first data set with a plurality of features; training by using the plurality of features and the second data set to obtain a training model; obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model; and predicting the harmfulness of the gene mutation based on the effective model.
In one embodiment, the first data set comprises positive mutation data and negative mutation data, the second data set is positive mutation data, and the third data set is positive mutation data.
In one embodiment, the publishing time of the first data set, the publishing time of the second data set and the publishing time of the third data set are continuous, and the publishing time span of the second data set and the third data set is 3-5 years.
In one embodiment, the plurality of features includes features associated with population allelic frequencies, gene-phenotype similarity scores, features based on ACMG/AMP guidelines, constraint scores at the gene level, existing pathogenicity computer prediction scores, features that functionally affect the variation, and gene level features associated with the database.
In one embodiment, the gene-phenotype similarity score is performed using the ERIC method: the ERIC method comprises utilizingTo express a phenotype t1And t2While setting the distance therebetweenIs 0;the calculation of (a) is carried out by the following method:
calculating the amount of informationWherein:n is the total gene number, N is the number of genes associated with phenotype t;
by usingTo represent the distance between phenotype t and the ancestor of phenotype t, whereinRepresents an ancestor of phenotype t, andthe calculation formula of (2) is as follows:
and calculating a phenotype t based on the calculation formula1And t2The distance between them is:
or the following steps:
therein usingRepresents a phenotype t1And t2The most informative phenotype among the common ancestral phenotypes;
in an embodiment, the training with the plurality of features and the second data set to obtain the training model further includes: the model was trained using mock patients and phenotypes from a negative sample of the UK10K database.
In an embodiment, obtaining mutation prediction data of a third data set according to the training model, where the mutation prediction data is consistent with the mutation data in the third data set, indicating that the training model is an effective model further includes: the mutation data in the third data set was also processed using phenotypic data from mock patients and negative samples from the UK10K database prior to consistency comparison with the mutation prediction data.
In an embodiment, the training with the plurality of features and the second data set to obtain the training model includes: and (3) training a model by adopting a gradient enhancement algorithm to predict the harmfulness of the gene mutation.
In another aspect, the present invention also provides a computer-readable storage medium, which can be used for executing the computer-executable instructions in the method described above.
Yet another aspect of the present invention provides a system, comprising: one or more processors; a memory; and one or more programs; wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising computer-executable instructions for performing the method as described above.
As described above, the method predicts the pathogenicity of the gene mutation sequence of the object by utilizing the trained machine learning model, and has high intelligence degree and accurate and efficient prediction. The accuracy of the method in the result detection rate reaches over 90 percent by utilizing phenotype-phenotype similarity score and gene-phenotype association score and other characteristics of various dimensions in cooperation with the XGboost algorithm, and the method has higher accuracy compared with various other software, namely, the demosser, the mutatranfer, the clinicaly _ drivers and the like.
Drawings
FIG. 1 shows a flow chart of the method of the present invention.
FIG. 2 is a process diagram of a prediction method according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the description provided herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
Please refer to fig. 1-2. The present invention provides, in a first aspect, a method of gene mutation hazard prediction, which in some embodiments is a computer-implemented method performed at an electronic device having at least one processor and memory. In some embodiments, the genetic mutation is a human genetic mutation. The method comprises the following steps of S1-S6:
s1: providing a mutation public database;
s2: acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence;
s3: annotating the first data set with a plurality of features;
s4: training by using the plurality of features and the second data set to obtain a training model;
s5: obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model;
s6: and predicting the harmfulness of the gene mutation based on the effective model.
In step S1, the mutation public database may be any one of ClinVar, OMIM, HGMD, CIVIC, endogen, dbVar, DGV, and decorher, and further may be ClinVar mutation database, and in one embodiment, 49021 known pathological mutation sites in the ClinVar mutation database may be used for model training. In some embodiments, the mutation can be an acquired or a deleted gene mutation.
Referring to fig. 1, in step S2, the mutation data sets may include a first data set, a second data set, and a third data set, the data sets may be divided according to a distribution time sequence of mutation data in the mutation public database, the first data set, the second data set, and the third data set may be continuous in time, the first data set may be mutation data taken from a year farther away from us, in some embodiments, the first data set may be mutation data taken from a database of 2012, the second data set may be mutation data taken from a database of 2013 to 2015, and the third data set may be mutation data taken from 2016 to 2017. In an embodiment, the time span of the data sets in the second data set and the third data set may be 3 to 5 years, and the time span may be divided according to the time for updating the different mutation public database data to obtain the most appropriate data amount, so as to improve the accuracy of the model. The first data set may include positive mutation data and negative mutation data, the second data set may be the positive mutation data, the third data set may be the positive mutation data, the positive mutation data in the third data set may include new mutation data for model evaluation, in one embodiment, the first data set includes 41590 mutation data, the second data set includes 6576 positive mutation data, the third data set includes 855 positive mutation data, and the 855 positive mutation data includes 417 new mutation data.
Referring to fig. 1, in an embodiment, the first data set and the second data set may be model training sets, and the third data set may be a model evaluation set for evaluating accuracy of training models obtained by the model training sets. The new mutation data published in the third data set relative to the second data set can be used to evaluate the accuracy of the model of the invention. The method can be used for carrying out model training by using mutation data in the second data set based on mutation data in the first data set as basic training data, and obtaining prediction data of the harmfulness of the novel mutant gene on the basis.
Referring to fig. 1, in step S3, annotating the first data set with a plurality of features may include a plurality of categories of features, such as demographic features based on databases such as ExAC, 1KG, etc., features based on ACMG/AMP guidelines, features based on interactor group prediction phenotypes, etc., features based on ERIC method phenotype similarity scores, gene level restrictions, functional outcomes, etc. In one example, the features can include 6 features associated with population allelic frequency, 5 gene-phenotype similarity scores, 15 features based on ACMG/AMP guidelines, 9 gene-level constraint scores, 12 existing pathogenicity computer prediction scores, 2 features that functionally affect variation, and 2 database-related gene-level features. Genetic mutations in the training data are annotated with a number of features as described in the present invention. These features will be assigned to each gene mutation sequence, which is then used for model training, so that the pathogenicity of the gene mutation sequence to be tested can be predicted by the trained model.
Wherein, the population allele frequency characteristic data mainly comes from 3 population frequency databases: 1000 genes Project, ESP and ExAC, based on 15 characteristics in ACMG/AMP guidelines, optimized their scoring system, including: PVS1:6, PS1:4, PM1:2, PM2:2, PM4:2, PM5:2, PP2:1, PP3:1, BA1: -9, BS1: -3, BS2: 3, BP3: 1, BP4: 1, BP7: 2; 12 existing computer methods for predicting pathogenicity include Polyphen2_ HVAR, LRT, MutationTaster, GERP + + _ RS, phyloP, SPIDEX, CADD, and DANN, the data of which are derived from the dbNFSFP and ANNOVAR databases, respectively. Constraint scores at the 9 gene levels were derived from the ExAC projects, including pDom, pRec, syn _ z, mis _ z, pLI, and CADD databases, including CADD _ a1qt005, CADD _ a1qt01, CADD _ a2qt005, CADD _ a2qt01, respectively; the 2 characteristics of functionally-affected variation were derived from the annotated results of the VEP tool and the predicted results of the LOFTEE algorithm, respectively, and the 2 database-related gene-level characteristics included the organized rare-related gene entry information and the similarity score between genes in the STRING database. 5 gene-phenotype similarity scores corresponding to the phenotype similarity score calculated by the ERIC algorithm, the gene-phenotype similarity score, the phenotype ranking score calculated by the ERIC algorithm, the gene-phenotype similarity ranking score, and the ERIC normalized score as described below.
Referring to FIG. 1, in step S3, the ERIC method includes utilizingTo express a phenotype t1And t2While setting the distance therebetweenIs 0;the calculation of (a) is carried out by the following method: calculating the amount of informationWherein:n is a total geneNumber, n is the number of genes associated with phenotype t;
by usingTo represent the distance between phenotype t and the ancestor of phenotype t, whereinRepresents an ancestor of phenotype t, andthe calculation formula of (2) is as follows:
and calculating a phenotype t based on the calculation formula1And t2The distance between them is:
or the following steps:
in which useRepresents a phenotype t1And t2The most informative phenotype among the common ancestral phenotypes;
the correlation between the phenotypes can be obtained by calculating the distance between different phenotypes, and further the correlation between genes and the phenotypes can be obtained.
Referring to fig. 1, in step S4, training is performed using the plurality of features and the second data set to obtain a training model, and the invention trains the feature-annotated first data set with the second data set to predict whether the gene mutation is harmful.
In step S4, genetic mutations in the training data are annotated with a plurality of features as described in the present invention, and then all the above feature information is integrated, and a gradient enhancement algorithm is used to train a model to predict the pathogenicity of the mutation. The Gradient Boosting Decision Tree (GBDT) algorithm is a set of classification and regression tree models, and the principle is as follows:
if there are K trees, the model isI.e. a set of prediction trees. Prediction treeIs defined as follows
W is the score vector on the leaves of the tree, the function q assigns each data point to a leaf, and T is the number of leaves.
The objective function includes 2 parts:wherein the loss function is trainedIs a loss of logic for logistic regression.
In addition regularization termIs thatWherein T is the number of leaves,is the firstScore of each leaf.
The model is trained using an additive strategy by fixing the learned trees and adding a new tree at a time. First, theThe objects of the step are:
first and second order Taylor expansions of the loss function are considered, thenThe object of the step becomes:
thus, the object at step t can be simplified to:
now that there is a method of scoring a tree, a strategy for creating a tree with small target reduction is described next. Spanning tree is a process that iteratively divides a leaf into two leaves. The following formula is used to determine whether a leaf should be split:
the specific implementation process can be as follows: 51 features as x, marker variable y representing variation associated with a phenotype; the prediction model is trained using the GBDT algorithm in the python XGboost toolkit. In model training, the complexity of the model is controlled using max.depth and min _ child _ weight, the overfitting is reduced using sumample and lambda functions, and finally the parameters are determined by network learning: learning rate, maximum tree depth, sampling rate, etc. In some embodiments, to prevent overfitting in the network, 10 cross-validations may also be employed.
Referring to fig. 1, in step S4, in order to improve the accuracy of the model, the training using the plurality of features and the second data set may further include: the model is trained using simulated patients and phenotypes from negative samples of the UK10K database, which further removes negative mutation phenotype data, making the model more accurate and reliable and improving prediction efficiency.
Referring to fig. 2, in some embodiments, the method extracts a first data set 101, a second data set 102, and a third data set 103 from the ClinVar database 100, annotates the first data set 101 with features 200 to obtain the annotated data set 104, performs model training 106 on the annotated data set 104 using the second data set 102 and phenotypes 105 simulating negative sampling, performs iteration 300 on the model training 106 step to finally generate a training model 107, outputs mutation prediction data 108 through the training model 107, and finally performs model evaluation 109 on the mutation prediction data 108 through the evaluation data set, i.e., the third data set 103, wherein the third data set 103 may also be processed to simulate the phenotypes 105 of negative sampling before evaluation, if the mutation prediction data 108 result is consistent with the true mutation result in the third data set 103, the training model 107 is shown to be an effective model that can be used to evaluate the harmfulness of gene mutation.
Referring to fig. 1, in step S5, the trained model may be evaluated by using the third data set, that is, the trained model may be a positive gene that may be mutated to obtain a relevant phenotype based on the trained model, the positive gene obtained by the model may be compared with the positive gene in the third data set according to a known positive gene disclosed in the third data set, if the positive gene obtained by the model is consistent with the positive gene in the third data set, it is indicated that the method is effective, and the consistency may be that the predicted mutated gene error is within 10%, that is, the accuracy of the prediction method of the present invention may be more than 90%. In some embodiments, the mutation data in the third data set may also be subjected to a process of phenotypic data processing using simulated patients and phenotypic data from UK10K database negative samples before being compared to the mutation prediction data, and the accuracy and timeliness of the prediction method may be improved using the third data set after being processed with simulated patients and phenotypic data from UK10K database negative samples.
Referring to FIG. 1, in step S6, the prediction of the harmfulness of the gene mutation based on the effective model includes predicting the possible harmfulness of a mutated gene based on the trained model. In some embodiments, the invention can provide a prediction of likely disease genes for a certain rare patient.
The present invention also provides, in one aspect, a computer-readable storage medium, which may be used to execute computer-executable instructions in the above-described method. Yet another aspect of the present invention provides a system, comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described above.
The storage medium may be used to assist a computer in executing one or more computer programs of any of the above methods. The computer program may be written in, for example, a general purpose programming language such as Pascal, C + +, Java, Python, JSON, etc., or in some specific application specific language.
The system may be a computer system that may include, for example, a processor, memory, storage, and input/output devices (e.g., a monitor, keyboard, disk drive, internet connection, etc.). However, the computing system may include circuitry or other dedicated hardware for performing some or all aspects of the process. In some operating settings, the computing system may be configured as a system comprising one or more units, each of which is configured to perform some aspects of the processes in software, hardware, or some combination thereof.
Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value. The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.
Claims (10)
1. A method for predicting the harmfulness of a gene mutation, comprising the steps of:
providing a mutation public database;
acquiring a plurality of mutation data sets according to the mutation public database, wherein the mutation data sets are sequentially divided into a first data set, a second data set and a third data set from far to near according to the published time sequence;
annotating the first data set with a plurality of features; and
training by using the plurality of features and the second data set to obtain a training model;
obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model;
and predicting the harmfulness of the gene mutation based on the effective model.
2. The method of claim 1, wherein: the first data set comprises positive mutation data and negative mutation data, the second data set is positive mutation data, and the third data set is positive mutation data.
3. The method of claim 1, wherein: the issuing time of the first data set, the issuing time of the second data set and the issuing time of the third data set are continuous, and the issuing time span of the second data set and the issuing time span of the third data set are 3-5 years.
4. The method of claim 1, wherein: the plurality of features includes features associated with population allelic frequencies, gene-phenotype similarity scores, features based on ACMG/AMP guidelines, constraint scores at the gene level, existing pathogenicity computer prediction scores, features functionally affecting variations, and gene level features associated with a database.
5. The method of claim 4, wherein: the gene-phenotype similarity scoring was performed using the ERIC method: the ERIC method comprises utilizingTo express a phenotype t1And t2While setting the distance therebetweenIs 0;the calculation of (a) is carried out by the following method:
calculating the amount of informationWherein:n is the total gene number, N is the number of genes associated with phenotype t;
by usingTo represent the distance between phenotype t and the ancestor of phenotype t, whereinAncestors representing phenotype tAnd is andthe calculation formula of (2) is as follows:
and calculating a phenotype t based on the calculation formula1And t2The distance between them is:
or the following steps:
therein usingRepresents a phenotype t1And t2The most informative phenotype among the common ancestral phenotypes;
6. the method of claim 1, wherein: training with the plurality of features and the second data set to obtain a training model comprising: and (3) training a model by adopting a gradient enhancement algorithm to predict the harmfulness of the gene mutation.
7. The method of claim 1, wherein: training with the plurality of features and the second data set to obtain a training model further comprises: the model was trained using mock patients and phenotypes from a negative sample of the UK10K database.
8. The method of claim 7, wherein: obtaining mutation prediction data of a third data set according to the training model, wherein the mutation prediction data is consistent with mutation data in the third data set, and the training model is an effective model and further comprises: the mutation data in the third data set was also processed using phenotypic data from mock patients and negative samples from the UK10K database prior to consistency comparison with the mutation prediction data.
9. A computer-readable storage medium characterized by: comprising computer-executable instructions for performing the method of any of claims 1 to 8.
10. A system, characterized by: the system comprises:
one or more processors;
a memory; and one or more programs;
wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising computer-executable instructions for performing the method of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110782580.4A CN113241118A (en) | 2021-07-12 | 2021-07-12 | Method for predicting harmfulness of gene mutation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110782580.4A CN113241118A (en) | 2021-07-12 | 2021-07-12 | Method for predicting harmfulness of gene mutation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113241118A true CN113241118A (en) | 2021-08-10 |
Family
ID=77135343
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110782580.4A Pending CN113241118A (en) | 2021-07-12 | 2021-07-12 | Method for predicting harmfulness of gene mutation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113241118A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009101625A2 (en) * | 2008-02-12 | 2009-08-20 | Ramot At Tel-Aviv University Ltd. | Method for searching for homing endonucleases, their genes and their targets |
CN105404793A (en) * | 2015-12-07 | 2016-03-16 | 浙江大学 | Method for rapidly discovering phenotype related gene based on probabilistic framework and resequencing technology |
CN106980749A (en) * | 2017-02-21 | 2017-07-25 | 成都奇恩生物科技有限公司 | The quick assisted location method of disease |
CN108363902A (en) * | 2018-01-30 | 2018-08-03 | 成都奇恩生物科技有限公司 | A kind of accurate prediction technique of pathogenic hereditary variation |
WO2019136364A1 (en) * | 2018-01-05 | 2019-07-11 | Illumina, Inc. | Process for aligning targeted nucleic acid sequencing data |
CN110010196A (en) * | 2019-03-19 | 2019-07-12 | 北京工业大学 | A kind of gene similarity searching algorithm based on heterogeneous network |
-
2021
- 2021-07-12 CN CN202110782580.4A patent/CN113241118A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009101625A2 (en) * | 2008-02-12 | 2009-08-20 | Ramot At Tel-Aviv University Ltd. | Method for searching for homing endonucleases, their genes and their targets |
CN105404793A (en) * | 2015-12-07 | 2016-03-16 | 浙江大学 | Method for rapidly discovering phenotype related gene based on probabilistic framework and resequencing technology |
CN106980749A (en) * | 2017-02-21 | 2017-07-25 | 成都奇恩生物科技有限公司 | The quick assisted location method of disease |
WO2019136364A1 (en) * | 2018-01-05 | 2019-07-11 | Illumina, Inc. | Process for aligning targeted nucleic acid sequencing data |
CN108363902A (en) * | 2018-01-30 | 2018-08-03 | 成都奇恩生物科技有限公司 | A kind of accurate prediction technique of pathogenic hereditary variation |
CN110010196A (en) * | 2019-03-19 | 2019-07-12 | 北京工业大学 | A kind of gene similarity searching algorithm based on heterogeneous network |
Non-Patent Citations (3)
Title |
---|
LI Q ET AL: "《Xrare: a machine learning method jointly modeling phenotypes and genetic evidence for rare disease diagnosis》", 《GENET MED 21》 * |
周源等: "《1例以心脏受累为特点的DES基因突变鉴定及表型分析》", 《儿科药学杂志》 * |
石芳等: "《基于随机森林的有害同义突变预测方法研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nguyen et al. | A comprehensive survey of regulatory network inference methods using single cell RNA sequencing data | |
Kelley | Cross-species regulatory sequence activity prediction | |
Friedman et al. | Data analysis with Bayesian networks: A bootstrap approach | |
Pavlidis et al. | A survey of methods and tools to detect recent and strong positive selection | |
Günther et al. | Robust identification of local adaptation from allele frequencies | |
Fu et al. | A gene prioritization method based on a swine multi-omics knowledgebase and a deep learning model | |
CN109411023B (en) | Method for mining inter-gene interaction relation based on Bayesian network inference | |
CN113519028A (en) | Methods and compositions for estimating or predicting genotypes and phenotypes | |
CN109727637B (en) | Method for identifying key proteins based on mixed frog-leaping algorithm | |
US20150025861A1 (en) | Genetic screening computing systems and methods | |
Zhao et al. | Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer | |
D’Agaro | Artificial intelligence used in genome analysis studies | |
Teixeira et al. | Learning influential genes on cancer gene expression data with stacked denoising autoencoders | |
Zhan et al. | ProbPFP: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function | |
Zhao et al. | MFCNV: a new method to detect copy number variations from next-generation sequencing data | |
Zhang et al. | Cancer survival prognosis with deep Bayesian perturbation cox network | |
Huang et al. | CNV-MEANN: a neural network and mind evolutionary algorithm-based detection of copy number variations from next-generation sequencing data | |
Quan et al. | Developing parallel ant colonies filtered by deep learned constrains for predicting RNA secondary structure with pseudo-knots | |
Sheehan et al. | Improved maximum likelihood reconstruction of complex multi-generational pedigrees | |
CN113241118A (en) | Method for predicting harmfulness of gene mutation | |
Passafaro et al. | Would large dataset sample size unveil the potential of deep neural networks for improved genome-enabled prediction of complex traits? The case for body weight in broilers | |
Cooke et al. | Fine-tuning of approximate Bayesian computation for human population genomics | |
Zhang et al. | Inferring historical introgression with deep learning | |
Niu et al. | CircRNA identification and feature interpretability analysis | |
CN116959561B (en) | Gene interaction prediction method and device based on neural network model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210810 |