CN107341366A

CN107341366A - A kind of method that complex disease susceptibility loci is predicted using machine learning

Info

Publication number: CN107341366A
Application number: CN201710592222.0A
Authority: CN
Inventors: 董珊珊; 杨铁林; 姚石; 陈霄; 陈一霄; 郭燕; 张钰洁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2017-07-19
Filing date: 2017-07-19
Publication date: 2017-11-10

Abstract

The invention discloses a kind of method that complex disease susceptibility loci is predicted using machine learning, comprise the following steps：(1) positive collection of the complex disease susceptibility loci known to collecting as machine learning model, speculated according to positive collection and collected with the incoherent site of complex disease as negative, and carry out the annotation of commitment element；(2) complex disease commitment model is established using machine learning；(3) according to the model of foundation, site whole in the range of full-length genome is just predicted, obtains potential susceptibility loci of the final prediction result as complex disease.The inventive method combines epigenetics information and DNA information, commitment element characteristics are extracted by machine learning, and then in the susceptibility loci of full-length genome scope interior prediction complex disease, the genetic force that the susceptibility loci found is explained is remarkably improved, potential target is provided for subsequent design medicine and disease detection.

Description

A kind of method that complex disease susceptibility loci is predicted using machine learning

Technical field

The present invention relates to complex disease susceptibility loci electric powder prediction, and in particular to one kind is multiple using machine learning prediction The screening technique in miscellaneous disease-susceptible humans site.

Background technology

In recent years, whole-genome association, which turns into, discloses complex disease susceptibility loci (Single nucleotide Polymorphism, SNP) most hot and effective research method.Profit in this way, there is now two over thousands of paper publishings in state On the high level magazine of border, nearly ten thousand complex disease susceptibility locis of successful identification.Although whole-genome association achievement is rich, But the anticipation of scientist is reached far away --- find most of disease-susceptible humans site.For specific complex disease, report The disease genetic that explanation is accumulated in disease-susceptible humans site makes a variation less than 15%, still has a large amount of unknown inherent causes, i.e., " loses Genetic force " urgently excavate.This is the common issue faced in all complex disease genetics research, reflects us to existing The utilization of data resource and excavation deficiency.In order to find unknown genetic virulence factor, there is an urgent need to propose conscientiously may be used at this stage Capable new method, new tool, deeply, human genome data are systematically excavated, its result helps to disclose the hair of complex disease Anttdisease Mechanism, the design of targeted drug and research and development and clinical early screening and individuation preventing and treating etc..

Genome includes two class hereditary information：That is DNA sequence dna hereditary information and epigenetics information.At present, apparent something lost The achievement in research learned is passed applied in the research and treatment of some diseases.Therefore, when carrying out disease-susceptible humans site estimation, It is highly desirable to include the information of epigenetics.It is existing easy based on genome commitment element characteristics prediction complex disease The method for feeling site is varied, and majority is the hereditary variation for predicting exon region or specific gene seat.But noncoding region Polymorphism can equally influence the expression of downstream gene, so as to disclose the pathogenesis of complex disease.Therefore extremely it is necessary Site in the range of full-length genome is screened, finds the site related to complex disease.At present, existing multiple databases are taken off Genome epigenetics information is shown, but hundreds of millions of genetic markers and the component information of multidimensional are to the pre- of genetic locus Survey brings huge challenge.Machine learning is nearly more than the 20 years multi-field cross disciplines risen, in order to abundant and effective Ground utilizes biological data, and the crossing research of biology and machine learning becomes increasingly active.Therefore, based on genome commitment member Part feature is very necessary using the complex disease susceptibility loci in the range of the method prediction full-length genome of machine learning.

The content of the invention

The defects of in order to overcome prior art, it is an object of the invention to provide a kind of method using machine learning, knot The Forecasting Methodology of the complex disease susceptible inheritance mark of commitment element characteristics is closed, by epigenetics information and genome DNA information combines, and commitment element characteristics are extracted by machine learning, and then complicated in full-length genome scope interior prediction The susceptibility loci of disease, explained genetic force is remarkably improved, is provided potentially for subsequent design medicine and disease detection Target.

To achieve these goals, the technical proposal of the invention is realized in this way：

A kind of method that complex disease susceptibility loci is predicted using machine learning, is comprised the following steps：

P1：Positive collection of the complex disease susceptibility loci known to collection as machine learning model, speculates according to positive collection Site incoherent with complex disease collects as negative, and carries out the annotation of commitment element；

P2：Complex disease commitment model is established using machine learning；

P3：According to the model of foundation, whole sites in the range of full-length genome are just predicted, obtain final prediction knot Potential susceptibility loci of the fruit as complex disease.

Affiliated step P1 is specifically included：

P11：A certain disease is collected using public database GWAS catalog, PheGenI and Pubmed pertinent literatures Known susceptible SNP, and the genotype data announced using thousand human genome plans is calculated and the high chain SNP of known susceptibility loci Collect as the positive；

P12：Collect for feminine gender, we screen the negative set of SNP compositions for meeting following condition in the range of full-length genome：A. With in positive set in the range of SNP certain distances；B. the difference of SNP minimum gene frequency is small in corresponding positive set In 0.05；C. independently of all SNP (r in positive set²<0.1)；

P13：All commitment component informations of genome, including transcription factor are obtained from UCSC and Roadmap databases Binding site, histone modification site and chromatin cutting state；Linked groups' gene expression quantity is obtained from GTEx databases Trait locuses information；Sequence conservation feature is obtained from ANNOVAR databases, every kind of controlling element saves as a text text Part；

P14：Using the commitment component information of acquisition, according to the physical location of genome to above-mentioned positive collection and feminine gender SNP in collection is annotated.If the principle of correspondence has overlapping for SNP with position in the room of some controlling element, then it is assumed that the SNP Arrived by this controlling element annotation.

The step P2 is specifically included：

P21：For the result after above-mentioned annotation, the correlation between controlling element is calculated using the corrplot bags in R And remove high related controlling element at random, annotation result is then randomly divided into training set and test set two parts, wherein Training set accounts for the 80% of total collection, and test set accounts for the 20% of total collection, and this step carries out 5 folding cross validations；

P22：Model established to gained training set annotation matrix of consequence in P21 with different machines learning algorithm, and with testing Collect the reliability of judgment models.Evaluation index includes sensitivity sensitivity, specific specificity, precision Precision, degree of accuracy accuracy and F1 fraction, calculation formula are as follows：

Sensitivity=TP/P=TP/ (TP+FN)

Specificity=TN/N=TN/ (TN+FP)

Precision=TP/P '=TP/ (TP+FP)

Accuracy=(TP+TN)/(P+N)

F1=2 × TP/ (2 × TP+FP+FN)

Wherein, TP is true positives, and FN is false negative, and TN is true negative, and FP is false positive；

P23：The model-evaluation index according to P22, model is optimized using element characteristics selection.Specific steps are such as Under：Importance ranking of the controlling element to model is obtained by model；Multiple character subsets, collection are built according to the importance of element Characteristic in conjunction increases to maximum from 1；The optimal subset of model is determined according to model-evaluation index, to predict new complexity Disease-susceptible humans genetic locus.

The step P3 is specifically included：

P31：The optimal subset of machine learning model is obtained by P2 steps, using the controlling element included in subset to complete Whole sites are annotated in genome range；

P32：According to the optimal models of foundation, whole sites in the range of full-length genome are predicted, finally given and sun Property controlling element annotate the potential susceptibility loci in similar site, as complex disease.

It is of the present invention to be based on genome commitment element characteristics, predict complex disease susceptibility loci using machine learning Screening technique, suitable for various complex diseases, for example, various cancers, endocrine system disease, angiocardiopathy, metabolism class disease, Immune class disease etc..The present invention carried it is a kind of using machine learning, with reference to the susceptible something lost of the complex disease of commitment element characteristics The Forecasting Methodology of mark is passed, epigenetics information and DNA information are combined, extracted by machine learning apparent Controlling element feature, and then in the susceptibility loci of full-length genome scope interior prediction complex disease, it is remarkably improved explained something lost Power transmission, potential target is provided for subsequent design medicine and disease detection.

Brief description of the drawings

Fig. 1 is the flow chart provided by the invention that complex disease susceptibility loci screening technique is predicted using machine learning.

Embodiment

Present disclosure is described in further detail below in conjunction with the accompanying drawings.

Example：By taking complex disease type ii diabetes as an example, using the method for the present invention, type ii diabetes susceptibility loci is carried out Prediction, be described in detail below.

As shown in figure 1, the present invention, which provides one kind, is based on genome commitment element characteristics, predicted using machine learning multiple Miscellaneous disease-susceptible humans site selection method, comprises the following steps P1-P3.

P1：Positive collection of the type ii diabetes susceptibility loci known to collection as machine learning model, and carry out apparent tune Control the annotation of element.

Specifically include：II is collected from the pertinent literature in public database GWAS catalog, PheGenI and Pubmed Susceptible SNP known to patients with type Ⅰ DM, totally 65, collect as the positive.The genotype number announced afterwards using thousand human genome plans The supplement collected according to calculating with the high chain SNPs of this 65 susceptibility locis as the positive, 1769 altogether.Screening meets P12 simultaneously Described in the SNPs of condition gather as negative.Obtained from UCSC, Roadmap, GTEx and ANNOVAR database and II types sugar The related commitment component information of urine disease, after removing high related elements, including 33 kinds of DNA hypersensitive sites, 202 kinds of transcriptions Factor binding site, 315 kinds of histone modification sites, 639 kinds of chromatin cutting states, 17 kinds of gene expression Quantitative Trait Genes Seat information and a kind of sequence conservation feature.Using these commitment component informations, the SNP in positive and negative set is entered Row annotation.

P2：The commitment model of type ii diabetes is established using machine learning.

Specifically include：Annotation result is randomly divided into training set and test set two parts, wherein training set accounts for total collection 80%, test set accounts for the 20% of total collection, and this step carries out 5 folding cross validations, and with a variety of models and element characteristics selection to mould Type optimizes.In type ii diabetes forecast model, performance of Random Forest model when using 60 controlling elements is optimal, Wherein sensitivity be 0.9736, specificity 0.9852, F1 values be 0.9213.

P3：According to the model of foundation, site whole in the range of full-length genome is just predicted, obtains final prediction As a result the potential susceptibility loci as type ii diabetes.

Specifically include：Whole sites in the range of full-length genome are carried out using 60 controlling elements in optimal models in P2 Annotation and prediction, finally give the potential susceptibility loci in the site, as type ii diabetes similar to positive controlling element annotation.

Experimental result：For type ii diabetes, the present invention predicts 15204 potential susceptibility locis altogether.Based on dbGaP numbers Hereditary Capacity is carried out according to the type ii diabetes data (phs000867.v1.p1) in storehouse, finding the site of prediction can dramatically increase Interpretable genetic force (P<0.05).The gene influenceed with the new susceptibility loci of prediction carries out path analysis, finds gene Significant enrichment is in type I diabetes mellitus, antigen processing and presentation, graft- versus-host disease、allograft rejection、cytokine-cytokine receptor In the paths such as interaction.These paths have been reported that the generation to type ii diabetes is related.This explanation is based on genome table Controlling element feature is seen, predicts that the screening technique of complex disease susceptibility loci is feasible using machine learning.

Particular embodiments described above, the purpose of the present invention, technical scheme and beneficial effect are carried out further Describe in detail.But protection scope of the present invention is not limited thereto.Within the spirit and principles of the invention, that is done is any Modification, equivalent substitution, improvement etc., it should all be included within protection scope of the present invention.

Claims

1. a kind of screening technique that complex disease susceptibility loci is predicted using machine learning, it is characterised in that comprise the following steps：

P1：Positive collection of the complex disease susceptibility loci known to collection as machine learning model, speculates and multiple according to positive collection The miscellaneous incoherent site of disease collects as negative, and carries out the annotation of commitment element；

P2：Complex disease commitment model is established using machine learning；

P3：According to the model of foundation, site whole in the range of full-length genome is just predicted, obtains final prediction result Potential susceptibility loci as complex disease.

2. a kind of screening technique that complex disease susceptibility loci is predicted using machine learning according to claim 1, it is special Sign is that the step P1 specifically includes following steps：

P11：A certain complex disease is collected using public database GWAS catalog, PheGenI and Pubmed pertinent literatures Known susceptible SNP, and the genotype data announced using thousand human genome plans is calculated and the high chain SNP of known susceptibility loci Collect as the positive；

P12：Collect for feminine gender, screen the negative set of SNP compositions for meeting following condition in the range of full-length genome：A. collect with the positive In conjunction in the range of SNP certain distances；B. the difference of SNP minimum gene frequency is less than 0.05 in corresponding positive set； C. independently of all SNP (r in positive set²<0.1)；After selection finishes, the ratio of positive collection and negative collection is 1:20；

P13：All commitment component informations of genome are obtained from UCSC and Roadmap databases, including transcription factor combines Site, histone modification site and chromatin cutting state；Linked groups' gene expression quantitative character is obtained from GTEx databases Locus information；Sequence conservation feature is obtained from ANNOVAR databases, every kind of controlling element saves as a text；

P14：Using the commitment component information of acquisition, according to the physical location of genome in above-mentioned positive collection and negative collection SNP annotated, if the principle of correspondence be SNP have with position in the room of some controlling element it is overlapping, then it is assumed that the SNP is by this One controlling element annotation arrives.

3. a kind of screening technique that complex disease susceptibility loci is predicted using machine learning according to claim 1, it is special Sign is that the step P2 specifically includes following steps：

P21：For the result after above-mentioned annotation, calculate the correlation between controlling element using the corrplot bags in R and incite somebody to action High related controlling element removes at random, annotation result then is randomly divided into training set and test set two parts, wherein training Collection accounts for the 80% of total collection, and test set accounts for the 20% of total collection, and this step carries out 5 folding cross validations；

P22：Model, the machine learning are established to gained training set annotation matrix of consequence in P21 with different machines learning algorithm Method includes but is not limited to random forest, decision tree, SVMs；And referred to the reliability of test set judgment models, evaluation Mark includes sensitivity sensitivity, specific specificity, precision precision, accuracy and F1 points of the degree of accuracy Number, calculation formula are as follows：

Sensitivity=TP/ (TP+FN)

Specificity=TN/ (TN+FP)

Precision=TP/ (TP+FP)

Accuracy=(TP+TN)/(TP+FN+FP+TN)

F1=2 × TP/ (2 × TP+FP+FN)

P23：The model-evaluation index according to P22, model is optimized using element characteristics selection, comprised the following steps that： Importance ranking of the controlling element to model is obtained by model；Multiple character subsets, set are built according to the importance of element In characteristic maximum be gradually reduced to 1；The optimal subset of model is determined according to model-evaluation index, to predict new complexity Disease-susceptible humans genetic locus.

4. a kind of screening technique that complex disease susceptibility loci is predicted using machine learning according to claim 1, it is special Sign is that the step P3 specifically includes herein below：

P31：The optimal subset of machine learning model is obtained by P2 steps, using the controlling element included in subset to full genome Whole sites is annotated in the range of group；

P32：According to the optimal models of foundation, site whole in the range of full-length genome is predicted, finally given and the positive Controlling element annotates the potential susceptibility loci in similar site, as complex disease.