CN107301330A

CN107301330A - A kind of method of utilization full-length genome data mining methylation patterns

Info

Publication number: CN107301330A
Application number: CN201710409105.6A
Authority: CN
Inventors: 杨利英; 杨胜楠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-06-02
Filing date: 2017-06-02
Publication date: 2017-10-27

Abstract

The invention belongs to the technical field of data processing of bioinformatics, a kind of method of utilization full-length genome data mining methylation patterns is disclosed, including：Genetic chip significance analysis SAM methods are used in a variety of set of data samples, the differential methylation site on full-length genome is filtered out respectively；The methylation differential site of multiple sample sets is taken into common factor, common difference site set is obtained；Calculate the Pearson correlation coefficients between the methylation level and corresponding gene expression in differential methylation site, identification Regulation by Methylation site；AP clusters are carried out to difference site set iteration, the cluster that methylates is obtained, pattern analysis are carried out to each cluster that methylates respectively, and proved by gene annotation and KEGG enrichment analyses.The present invention provides reference for the drug development for demethylation and used for reference；There is general character in different types of disease, have reality and clinical meaning from the relation of full-length genome angle research methylation patterns and disease really on methylation patterns.

Description

A kind of method of utilization full-length genome data mining methylation patterns

Technical field

Dug the invention belongs to the technical field of data processing of bioinformatics, more particularly to one kind using full-length genome data The method for digging methylation patterns.

Background technology

With making constant progress for high throughput sequencing technologies and biochip technology, the base of efficient magnanimity can be obtained Factor data, gene data contains many complicated biological phenomenas, makes heredity and the epigenetic for comprehensively exploring disease Basis is possibly realized, and is studied for modern life science and is provided new direction and thinking.But mass data can not be intuitively Disclose biological phenomena or reflection biological law, it is necessary to divide using complicated statistical method and some other means and technology The biological phenomena that mass data contains is explored in analysis.Thus, biological information subject has been derived.Bioinformatics is pupil life The new branch of science that science and computer science are combined, studies collection, processing, storage, propagation, analysis and the explanation of biological information Deng disclosing the biology contained of biological data of complexity by comprehensively utilizing biology, computer science and information technology Secret.Human genome actually includes two category informations：Hereditary information and epigenetic information, have thus expedited the emergence of science of heredity and table See science of heredity.Science of heredity (Genetics) research biological heredity and variation, including the variation of gene structure, function and expression rule Rule, i.e., the hereditary information for being changed and being produced by DNA sequence dna；Epigenetics (Epigenetics) is studied in nucleotides sequence On the premise of row do not change, heredity caused by gene expression changes.Heredity and epigenetic are relative concepts, Interdependence collectively forms the hereditary information of the mankind again simultaneously.DNA methylation is vital in embry ogenesis and development Life process, is also one of most common epigenetic modification.Therefore, as the DNA of epigenetic modification important component Methylating also turns into the emphasis of research, and it achieves significant effect in the early detection of disease, prevention, treatment, prognosis etc.. DNA methylation refers under dnmt rna (DNMT) catalysis, using thio methionine as methyl donor, in CpG dinucleotides The chemical modification of a methyl group is added on 5 ' carbon atoms of sour cytimidine molecule.DNA methylation can result in some genes Inactivation and some districts domain dna conformation change, and then the interaction of DNA and protein is influenceed, control gene expression.DNA first Base is also possible to cause the change of respective regions chromatin Structure in genome, causes DNA to lose core plum, restriction enzyme Cleavage site, and DNA enzymatic sensitivity site, make chromatin high degree of spiral, it is condensing agglomerating, lose transcriptional activity.Pass through analysis Methylation level and the relation of gene expression find that methylation level is negatively correlated with gene expression journey, i.e., hypomethylation promotes base Because expressing, and hyper-methylation inhibition of gene expression.Numerous studies show simultaneously, compared with normal cell, gene in disease cells The overall methylation level of group is relatively low, but the abnormal hyper-methylation of promoter regional area, and this is to detect disease using methylation level Generation provide theoretical foundation.What some genes may have that tumour-specific methylates in cancer cell or tissue simultaneously changes Become, based on this characteristic, the biomarker that can early diagnose DNA methylation as disease, molecular labeling can be further true Determine the hypotype of disease, this treatment to disease is extremely important；Furthermore, clinically can be by DNA due to the invertibity of epigenetic The novel targets methylated as disease treatment, can there are some researches show the cell by demethylation drug-treated in vitro culture To activate the gene of silence due to DNA methylation change.The limitation of sequencing technologies and microarray technology, DNA methylation data The characteristics of Statistic features of Non-Gaussian Distribution and high heterogeneity, uneven distribution of the DNA methylation data on genome, no The different dimensions for learning data with group all produce huge challenge to the data analysis that methylates.The source of DNA methylation data mainly leads to Chip and sequencing technologies are crossed, the full-length genome that can obtain multiple samples using chip methylates data, can statistically study Effect of the DNA methylation in complex disease, but its coverage rate on genome is relatively low, and not as sequencing data essence Really；Sequencing data cost is high, it is few to take many, sample size, although coverage rate is high and result is accurate, exists for cancer research One definite limitation；Conventional difference analysis method such as T is examined, and the statistical method such as ANOVA has certain requirement to data distribution, and It is not suitable for analyzing DNA methylation data, therefore when recognizing DNA methylation pattern, it is necessary to propose new statistical method or survey Degree；DNA methylation is different with the dimension of gene expression, moreover, a gene includes multiple methylation sites, how to integrate two Person, is also a major challenge that researcher faces.Exactly for these reasons, currently, though research on DNA methylation pattern It is many, but most of researchs are all based on a kind of disease or individual gene and the DNA methylation of smaller area, are seldom based on many The analysis of DNA methylation pattern on the full-length genome of kind of disease, causes the DNA methylation patterns of a variety of diseases and unintelligible, The Regulation by Methylation site having now been found that is even more few.

In summary, the problem of prior art is present be：Traditional statistical method is higher to the Spreading requirements of data, that is, requires What the distribution of data was to determine, and the distribution for the data that actually methylate and indefinite, so there is limitation in traditional statistical method Property；Difference group learns data its dimension difference, so Data Integration is also current research facing challenges.

The content of the invention

The problem of existing for prior art, full-length genome data mining methylation patterns are utilized the invention provides one kind Method.

The present invention is achieved in that a kind of method of utilization full-length genome data mining methylation patterns, the utilization The method of full-length genome data mining methylation patterns includes：Genetic chip significance analysis is used in a variety of set of data samples SAM methods, filter out the differential methylation site on full-length genome respectively；The methylation differential site of multiple sample sets is taken into friendship Collection, obtains common difference site set；Between the methylation level and corresponding gene expression that calculate differential methylation site Pearson correlation coefficients, identification Regulation by Methylation site；AP clusters are carried out to difference site set iteration, the cluster that methylates is obtained, Pattern analysis is carried out to each cluster that methylates respectively, and is proved by gene annotation and KEGG enrichment analyses.

Further, the method for the utilization full-length genome data mining methylation patterns comprises the following steps：

Step one, the methylation level and gene expression dose to a variety of disease sample data are pre-processed, pretreatment Process is divided into methylate data prediction and gene expression data pretreatment；

Step 2, differential methylation site is screened with genetic chip significance analysis SAM methods, and every kind of disease is pre-processed CpG sites afterwards methylate data, take the SAM algorithms of non-matching parameter to carry out differential methylation site screening respectively, every kind of The repetition that the normal sample of disease and ill sample carry out 100 times tests to adjust SAM threshold value, observes each threshold value corresponding False positive rate FDR values, choose corresponding value when FDR values are 0 and are used as threshold value Δ；

Step 3, by the differential methylation site of each disease screened, takes common factor, obtains differential methylation site Set；Analysis differential methylation site is integrated into the distribution of each position of gene；

Step 4, carries out AP clusters to obtained differential methylation site set, obtains the cluster that methylates；

Step 5, takes out differential methylation site and gathers corresponding gene expression dose, the Pearson between calculating is related Coefficient, threshold value, identification Regulation by Methylation site are set according to the size of coefficient；

Step 6, according to obtained methylate cluster and Regulation by Methylation site, obtains the first on a variety of disease full-length genomes Base pattern.

Further, the step one is specifically included：

1) methylate data prediction：The Beta values of each sample are mapped to the data produced on genome；Remove The entitled empty site of gene, and reach comprising 0 number more than 80% site；

2) gene expression data is pre-processed：Remove comprising 0 number reach more than 80% gene, carry out missing values fill out Fill, normalization of being taken the logarithm after standardization；

3) according to gene structure by site subregion：The methylation sites of full-length genome are divided into according to gene structure as follows Region：Promoter region, gene body region, tri- regions of 3'UTR；Promoter region is divided into TSS1500, TSS200, first Extron, tetra- zonules of 5'UTR.

Further, the step 4 is specifically included：

1) methylation level that the ill sample of corresponding every kind of disease is gathered in differential methylation site is taken out, one is obtained Behavior methylation sites, are classified as the matrix of data set sample, that is, the data set clustered；

2) similar matrix for the data that methylate is calculated, similarity measurement uses Pearson correlation coefficients, obtained similar square Battle array is symmetrical matrix；

3) similar matrix is regarded into the input that AP is clustered, is made iteratively the AP clusters of differential methylation data, every time repeatedly In generation, all generates the cluster of certain amount.

Further, it is described 3) in iteration specifically include：

Set iterations be more than or equal to 10 or clusters number be less than or equal to 10 when, cluster terminate；

When iterations is less than 10 and current clusters number is less than 10, by the methylation sites in current each cluster The methylation level of correspondence sample is averaged, and obtains new methylation sites as the representative point of the cluster；AP cluster process In, there are two kinds of information to transmit and be continuously updated between each node, Attraction Degree r and degree of membership a are constantly updated by successive ignition The Attraction Degree and degree of membership of each sample point, until producing multiple high-quality cluster centres, and other sample points are assigned to In corresponding cluster；In the first iteration, r variable updates formula is as follows：

In iterative process after first time iteration, according to information variable a value come more new formula；The renewal of a variables is then It is the support for collecting all sample points for each candidate cluster center, its more new formula is as follows：

The new methylation sites of all clusters are represented to the data matrix of point composition as the new number that methylates of next iteration According to, and its similar matrix is calculated as the input of next iteration, continue cluster process, the iteration ends bar until reaching setting Part.

Further, in the step 5 according to the Pearson correlation coefficients between methylation level and gene expression with phase relation Several absolute values 0.3 is threshold value.

Advantages of the present invention and good effect are：The present invention solves traditional distinctions using SAM difference analysis methods and analyzed To the requirement of data distribution in method, found while method and the T methods of inspection of the present invention are contrasted, difference is little, it was demonstrated that SAM The validity of method；The AP clustering methods that the present invention is used, have also abandoned and lacking for clusters number are pre-seted in traditional clustering method Fall into, not only increase cluster efficiency, also reduce FDR (false positive rate).The present invention considers the number that methylates of a variety of diseases According to expanding to polytype disease by the single disease in previous methods；Expanded to entirely by individual gene or some region Genome；With reference to its gene expression data, the DNA methylation pattern of disease is summed up, contrast various disease type methylates mould The similitude and specificity of formula, disclose important function of the methylation patterns to disease development, to methylate clinically Using offer theoretical foundation and reference.

The present invention utilizes the Pearson correlation coefficients methylated between gene expression, given threshold, screening strong correlation position Point, identification Regulation by Methylation site.These sites are relevant with a variety of diseases, it is not limited to certain disease, are a variety of diseases Type is shared.

The present invention can be used for the pathogenesis of explaination complex disease, and risk profile is carried out to disease, and for for demethyl The drug development of change provides reference and used for reference；There is general character in different types of disease, from full genome really on methylation patterns The relation of group angle research methylation patterns and disease has reality and clinical meaning.

Brief description of the drawings

Fig. 1 is the method flow diagram of utilization full-length genome data mining methylation patterns provided in an embodiment of the present invention.

Fig. 2 is that the method implementation process of utilization full-length genome data mining methylation patterns provided in an embodiment of the present invention is shown It is intended to.

Fig. 3 is the experimental result schematic diagram provided in an embodiment of the present invention in True Data；

In figure：(a) each region methylation level distribution situation in tumour cell；(b) each region methylates in normal cell Horizontal distribution situation.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.

As shown in figure 1, the method for utilization full-length genome data mining methylation patterns provided in an embodiment of the present invention includes Following steps：

S101：Genetic chip significance analysis SAM methods are used in a variety of set of data samples, full genome is filtered out respectively Differential methylation site in group；The methylation differential site of multiple sample sets is taken into common factor, common difference site set is obtained；

S102：Calculate Pearson's phase relation between the methylation level and corresponding gene expression in differential methylation site Number, identification Regulation by Methylation site；

S103：AP clusters are carried out to difference site set iteration, the cluster that methylates is obtained, each cluster that methylates is carried out respectively Pattern analysis, and proved by gene annotation and KEGG enrichment analyses.

The application principle of the present invention is further described below in conjunction with the accompanying drawings.

As shown in Fig. 2 the method for utilization full-length genome data mining methylation patterns provided in an embodiment of the present invention includes Following steps：

Step one, data prediction：

Methylate data prediction：It is empty site to remove gene name (gene-symbol), and is reached comprising 0 number To more than 80% site.

Gene expression data is pre-processed：Remove first reached comprising 0 number more than 80% gene, then lacked Value filling, takes normalization of being taken the logarithm after standardization.

According to gene structure by site subregion：The methylation sites of full-length genome are divided into promoter according to gene structure Region (promoter), genosome (Gene body) region, tri- regions of 3'UTR；Promoter region can be divided further again For TSS1500, TSS200, First Exon (1^stExon), tetra- zonules of 5'UTR.

Step 2, the differential methylation site of single disease is screened using SAM methods：

The data that methylate in CpG sites pretreated to various diseases take the SAM algorithms of non-matching parameter to enter respectively Row differential methylation site is screened, and the repetition experiment that the normal sample of every kind of disease and ill sample carry out 100 times goes to adjust SAM Threshold value, observe the corresponding FDR values of each threshold value, corresponding value is threshold value (Δ) when finally selection FDR values are 0.It is of the invention real Applying the threshold value taken in example is respectively：BLCA Δ=4.51；BRCA Δ=4.94；COAD Δ=4.62；LUAD Δ=4.90； LUSC Δ=4.69；UCEC Δ=5.03.

Step 3, by the differential methylation site of each disease screened in step 2, takes common factor, obtains difference first Gather in base site；Analysis differential methylation site is integrated into the distribution in each gene region, as shown in Figure 2.

Step 4, the differential methylation site set obtained to step 3 carries out AP clusters, obtains 9 clusters that methylate.AP The detailed process of cluster is as follows：

First, the methylation level that the ill sample of corresponding every kind of disease is gathered in differential methylation site is taken out, is obtained One behavior methylation sites, is classified as the matrix of data set sample, that is, the data set clustered；

Secondly, the similar matrix for the data that methylate is calculated, the present invention uses Pearson correlation coefficients, so obtain Similar matrix (Similarity) is symmetrical matrix；

Finally, similar matrix is regarded into the input that AP is clustered, is made iteratively the AP clusters of differential methylation data, every time Iteration all generates the cluster of certain amount；

The detailed process of its iteration is：

(1) set iteration ends condition, set iterations be more than or equal to 10 or clusters number be less than or equal to 10 when, Cluster is terminated；

(2) when iterations is less than 10 and current clusters number is less than 10, by methylating in current each cluster The methylation level of site correspondence sample is averaged, and obtains new methylation sites as the representative point of the cluster；AP is clustered During, there are two kinds of information to transmit and be continuously updated between each node, i.e. Attraction Degree r (responsibility) and degree of membership A (availability), the algorithm constantly updates the Attraction Degree and degree of membership of each sample point by successive ignition, until producing Multiple high-quality cluster centres, and other sample points are assigned in corresponding cluster；In the first iteration, r variable updates Formula is as follows：

First time iterative process is only simple data-driven, because only needing to consider that the similitude between sample point is big It is small, without removing to consider other sample points to current candidate samples point support, but in iterative process after, Need according to information variable a value come more new formula；The renewal of a variables is then to collect all sample points for each candidate The support of cluster centre, its more new formula is as follows：

(3) data matrix that the new methylation sites of all clusters are represented to point composition methylates as the new of next iteration Data, and its similar matrix is calculated as the input of next iteration, continue cluster process, the iteration ends bar until reaching setting Part.

Step 5, takes out differential methylation site in step 3 and gathers corresponding gene expression dose, the skin between calculating Ademilson coefficient correlation, Pearson's absolute coefficient is considered strong correlation more than 0.3, identifies Regulation by Methylation site.

Step 6, the Regulation by Methylation site that methylate cluster and the step 5 obtained according to step 4 is obtained sums up a variety of Methylation patterns on disease full-length genome.

The application effect of the present invention is explained in detail with reference to experiment.

1st, using true case data, full-length genome methylation patterns are excavated.

The complete genome DNA used in experiment methylates data set and gene expression dataset is all from cancer and tumour base Because of general cancer project (Pan-Cancer Initiative) data in collection of illustrative plates (TheCancer Genome Altas, TCGA) Storehouse (https://www.synapse.org/#！Synapse:Six kinds of diseases of the offer in syn300013/wiki/70804) Data set.Including：Urothelial Carcinoma of Bladder (Bladder Urothelial Ca-rcinoma, BLCA), mammary gland infiltration cancer (Breast invasive Carcinoma, BRCA), colon cancer (Colon Adenocarcinoma, COAD), lung squamous cell Cancer (Lung Squamous cell Carcinoma, LUSC), carcinoma of endometrium (Uterine Corpus Endometrial Carcinoma, UCEC), adenocarcinoma of lung (Lung Adenocarcinoma, LUAD).Data are all on Illumina platforms Level3 horizontal datas, the data that methylate are Illumina microarray platforms (Illumina InfiniumHumanMethylation 450K Array) on formed, will the Beta values of each sample be mapped to genome The data above produced；Gene expression data uses IlluminaHiSeqRNASeqV2 data.

Table 1 lists the DNA methylation initial data of this experiment use, comprising 396064 CPG sites, on each gene There may be multiple sites, i.e., the different methylation levels in each 396064 CPG sites of sample correspondence, are a series of 0 to 1 continuous Value.The ill sample and normal sample of six kinds of disease types are all uneven samples, can be lost due to being processed into balance sample Great amount of samples, ignores the error caused by sample non-control.

Table 1

Sequence number	Disease type	CpG number of sites	Ill sample	Normal sample
					1	BLCA	396064	126	18
2	BRCA	396064	578	96
					3	COAD	396064	255	38
4	LUAD	396064	306	32
					5	LUSC	396064	225	42
6	UCEC	396064	383	42

2nd, the specific implementation step of experiment is as follows：

Data in table 1 are pre-processed, the present invention uses the gene in the FEM bags provided on Bioconductor Information, it is empty site then to remove gene name (gene-symbol), and reaches comprising 0 number more than 80% site Finally obtain the value that methylates using this 248592 sites in 248592 CpG sites, following step.

The data that methylate in 248592 CpG sites pretreated to six kinds of cancers take non-matching parameter respectively SAM algorithms carry out differential methylation site screening, and the normal sample and disease sample of every kind of cancer carry out the repetition experiment of 100 times Adjustment SAM threshold value is gone, the corresponding FDR values of each threshold value are observed, it is threshold value (Δ) finally to choose corresponding value when FDR values are 0, Then the corresponding threshold value of each cancer is respectively：BLCA Δ=4.51；BRCA Δ=4.94；COAD Δ=4.62；LUAD Δs= 4.90；LUSC Δ=4.69；UCEC Δ=5.03.Table 2 gives the differential methylation result of six kinds of diseases.

Table 2

3rd, in order to analyze the methylation patterns on a variety of disease type full-length genomes, the present invention uses six kinds of disease difference first Common factor data its result after base.The difference CPG sites for taking common factor to obtain are 2184, and gene is 2728, wherein high first 1489 and 1591, base CpG sites (up) gene；692, hypomethylation CpG sites (low) CpG sites and 611 bases Cause；Gene number is less than by hyper-methylation site, some sites are inferred on multiple genes, such as gene junction.It is overall next See, differential methylation site number is less than gene number, further illustrates, same gene pairs answers multiple sites, and its The methylation level gap of different loci is larger；Thus infer, the site that methylation level differs greatly is not in geneBody Region, but gene intersection, i.e. promoter region.To sum up, take what common factor was obtained after difference being used only in ensuing experiment 2184 CPG sites and 2728 genes, correspond to and are analyzed on six regions divided before, its distribution in each region Situation is as shown in table 3.From table and Fig. 3, in oncogene, First Exon is the maximum region of methylation differential, its Secondary is that 3'UTR, geneBody, TSS1500 are the larger regions of methylation differential, be thereby it is assumed that, the DNA of this subregion Methylate and participate in the part basic function of human body, if the methylation level in these regions produces larger change, be easily caused it The disorder of correlation function, causes cancer, and this phenomenon embodies the similitude between cancer.

Table 3

4. the above-mentioned SAM variance analyses of pair process and take common factor 2184 differential methylation sites and its 2728 genes, Carry out AP clusters.The methylation level of the ill sample of the corresponding every kind of cancer of 2184 differential methylations is taken out first, is obtained One 2184 row, the matrix of 1874 row, that is, the data set clustered.Secondly the similar matrix for the data that methylate is calculated, the present invention makes It is Pearson correlation coefficients, so obtained similar matrix (Similarity) is symmetrical matrix.Similarity matrix is worked as The input of AP clusters is done, the AP clusters of differential methylation data are made iteratively, each iteration all generates the cluster of certain amount, Its specific cluster process is：The condition of iteration ends is set first, sets iterations to be more than or equal to 10 or cluster numbers here When mesh is less than or equal to 10, cluster is terminated；, will be currently each when less than 10, current clusters number is less than 10 to iterations simultaneously The methylation level of methylation sites correspondence sample in cluster is averaged, and obtains new methylation sites as the cluster Represent a little, the data matrix that then the new methylation sites of all clusters are represented to point composition methylates as the new of next iteration Data, and its similar matrix is calculated as the input of next iteration, continue cluster process, the iteration ends bar until reaching setting Part.In the present invention during iteration ends, iteration twice has been carried out altogether, finally generates 9 clusters that methylate, the generation for the cluster that each methylates Table point is the average value of the methylation level of all methylation sites in the cluster that methylates, and its result is as shown in table 4.By table 4 Understand, 2184 CPG sites are not overlapped between dividing in 9 different clusters that methylate.Observe gene number to find, 9 clusters The total number of middle gene is 1406, but 2184 CpG sites correspond to 1239 genes altogether.It could therefore be concluded that going out to have portion Gene is divided to be divided into multiple clusters that methylate.

Table 4

5. recognize Regulation by Methylation site：Respectively calculate 9 clusters that methylate in CPG sites methylation level value with it is corresponding Pearson correlation coefficients between gene expression dose.9 clusters that methylate are altogether comprising 2184 CPG sites in experiment, from TCGA numbers According to the initial data that gene expression is obtained in storehouse.By the data prediction of early stage, remove the expression value of some genes, 2184 difference CPG sites correspond to remaining 1721 sites in gene expression.Observe the methylation level in this 1721 sites Found with the Pearson correlation coefficients of gene expression dose, in general, most CPG sites methylation and gene The absolute value of coefficient correlation between expression is below 0.1, in addition have the coefficient correlation in more than 200 CPG site close to Zero, it is believed that be uncorrelated；The absolute value of the coefficient correlation in only 8 CPG sites is more than 0.3.They are located at different dyes On colour solid, and distribution concentrates on the 3rd, 4,5 and methylated in cluster.The Pearson correlation coefficients in this site of cg19883813 among these It is strong negatively correlated for -0.63, it is possible thereby to the unconventionality expression for being inferred to this 8 genes be probably by opposing bases site it is too high or Too low methylation level is extremely caused, and table 5 gives the specifying information in 8 Regulation by Methylation sites.

Table 5

6. pair each cluster that methylates carries out gene annotation using DAVID softwares by databases such as GO, and uses R software kits GOStats carries out pathway enrichment analyses with reference to KEGG databases.Pathway paths enrichment analysis result (as shown in table 6), Show that the 3rd cluster that methylates is not engaged in any path i.e. bioprocess, illustrate such relevance with each cancer may very little, this It is identical with the result of DAVID gene annotations.Thus longitudinal 2 observation table 6 infers this part base it can be found that its OR value is all higher than 1 Because being the hazards of disease, with tumour close relation.9 classes methylate, and to occur conspicuousness in 23 biological pathways rich for cluster Collection, this shows that aberrant DNA methylation level affects multiple different cancer related pathways, and related in the tumour of multiple types Key effect is played in path.

Table 6

As shown in Table 6, the primary biological process that the cluster gene that respectively methylates is participated in has：Promote acceptor and part in nerve fiber Interaction, induce arrhythmogenic right ventricular cardiomyopathy (Arrhythmogenic right ventricular Cardiomyopathy ARVC), hypertrophic cardiomyopathy (Hypertrophic cardiomyopathy, HCM), dilated cardiomyopathy The generation of the diseases such as disease, maturity onset diabetes of the young, type ii diabetes；In Ca2+ oscillations path, chemotactic factor (CF) signal path, Notch Conspicuousness is enriched with the associated signal paths such as signal path, insulin signaling pathway；Participate in scent signal conduction, cell adherence point The related biological processes such as sub (Cell adhesion molecules, CAM) adhesive connection, gastric acid secretion, amino acid metabolism.

It is enriched with result and shows that these genes not only play an important role in cancer, and its abnormal expression may also cause it The generation of his disease；This also indicates that between cancer there are some identical related genes between various diseases. Karnovsky etc. has probed into the related pathways of multiple cancer types by analyzing the specific expressed of DNA methylation, and shows cancer There is similar path between disease, this has a similar conclusion to the present invention, it was demonstrated that effectiveness of the invention.

Methylating and gene expression data for the comprehensive a variety of diseases of the present invention, contrasts various disease type methylation patterns Similitude and specificity, disclose important function of the methylation patterns to disease development, for the application methylated clinically Theoretical foundation and reference are provided.The present invention is directed to the limitation in above-mentioned research, and the angle from a variety of diseases is in full base Because analyzing DNA methylation data and gene expression pattern in group, DNA methylation and the association mode of gene expression are summed up, The similitude and specificity of DNA methylation between various disease type are found, strives the treatment method of single disease being transplanted to It is that a new way is explored in the diagnosis, treatment, prognosis of disease in the treatment of other similar diseases.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention Any modifications, equivalent substitutions and improvements made within refreshing and principle etc., should be included in the scope of the protection.

Claims

1. a kind of method of utilization full-length genome data mining methylation patterns, it is characterised in that the utilization full-length genome number Include according to the method for excavating methylation patterns：Genetic chip significance analysis SAM methods are used in a variety of set of data samples, point The differential methylation site on full-length genome is not filtered out；The methylation differential site of multiple sample sets is taken into common factor, is total to With the set of difference site；The methylation level for calculating differential methylation site is related to the Pearson between corresponding gene expression Coefficient, identification Regulation by Methylation site；AP clusters are carried out to difference site set iteration, the cluster that methylates are obtained, respectively to each The cluster that methylates carries out pattern analysis, and is proved by gene annotation and KEGG enrichment analyses.

2. the method for full-length genome data mining methylation patterns is utilized as claimed in claim 1, it is characterised in that the profit Comprised the following steps with the method for full-length genome data mining methylation patterns：

Step one, the methylation level and gene expression dose to a variety of disease sample data are pre-processed, preprocessing process It is divided into methylate data prediction and gene expression data pretreatment；

Step 2, differential methylation site is screened with genetic chip significance analysis SAM methods, pretreated to every kind of disease CpG sites methylate data, take the SAM algorithms of non-matching parameter to carry out differential methylation site screening, every kind of disease respectively Normal sample and ill sample carry out the repetition of 100 times and test to adjust SAM threshold value, observe the corresponding false sun of each threshold value Property rate FDR values, choose when FDR values are 0 it is corresponding be worth be used as threshold value Δ；

Step 3, by the differential methylation site of each disease screened, takes common factor, obtains differential methylation site collection Close；Analysis differential methylation site is integrated into the distribution of each position of gene；

Step 5, the corresponding gene expression dose of taking-up differential methylation site set, the Pearson correlation coefficients between calculating, Threshold value, identification Regulation by Methylation site are set according to the size of coefficient；

Step 6, according to obtained methylate cluster and Regulation by Methylation site, obtains methylating on a variety of disease full-length genomes Pattern.

3. the method for full-length genome data mining methylation patterns is utilized as claimed in claim 2, it is characterised in that the step Rapid one specifically includes：

1) methylate data prediction：The Beta values of each sample are mapped to the data produced on genome；Remove gene Entitled empty site, and reach comprising 0 number more than 80% site；

2) gene expression data is pre-processed：Remove comprising 0 number reach more than 80% gene, carry out Missing Data Filling, mark Taken the logarithm after standardization normalization；

3) according to gene structure by site subregion：The methylation sites of full-length genome are divided into following area according to gene structure Domain：Promoter region, gene body region, tri- regions of 3'UTR；Promoter region is divided into TSS1500, TSS200, outside first Aobvious son, tetra- zonules of 5'UTR.

4. the method for full-length genome data mining methylation patterns is utilized as claimed in claim 2, it is characterised in that the step Rapid four specifically include：

1) methylation level that the ill sample of corresponding every kind of disease is gathered in differential methylation site is taken out, a behavior is obtained Methylation sites, are classified as the matrix of data set sample, that is, the data set clustered；

2) similar matrix for the data that methylate is calculated, similarity measurement uses Pearson correlation coefficients, and obtained similar matrix is Symmetrical matrix；

3) similar matrix is regarded into the input that AP is clustered, is made iteratively the AP clusters of differential methylation data, each iteration is all Generate the cluster of certain amount.

5. as claimed in claim 4 using full-length genome data mining methylation patterns method, it is characterised in that it is described 3) Middle iteration is specifically included：

When iterations is less than 10 and current clusters number is less than 10, by the methylation sites correspondence in current each cluster The methylation level of sample is averaged, and obtains new methylation sites as the representative point of the cluster；In AP cluster process, have Two kinds of information are transmitted and are continuously updated between each node, Attraction Degree r and degree of membership a, are constantly updated by successive ignition each The Attraction Degree and degree of membership of sample point, until producing multiple high-quality cluster centres, and other sample points are assigned to accordingly Cluster in；In the first iteration, r variable updates formula is as follows：

<mrow> <mi>r</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>&LeftArrow;</mo> <mi>s</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>-</mo> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mrow> <msup> <mi>k</mi> <mo>&prime;</mo> </msup> <mi>s</mi> <mo>.</mo> <mi>t</mi> <mo>.</mo> <msup> <mi>k</mi> <mo>&prime;</mo> </msup> <mo>&NotEqual;</mo> <mi>k</mi> </mrow> </munder> <mo>{</mo> <mi>a</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <msup> <mi>k</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>+</mo> <mi>s</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <msup> <mi>k</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>}</mo> <mo>;</mo> </mrow>

In iterative process after first time iteration, according to information variable a value come more new formula；The renewal of a variables is then to receive All sample point of collection is for the support at each candidate cluster center, and its more new formula is as follows：

<mrow> <mi>a</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>&LeftArrow;</mo> <mi>m</mi> <mi>i</mi> <mi>n</mi> <mo>{</mo> <mn>0</mn> <mo>,</mo> <mi>r</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>+</mo> <munder> <mo>&Sigma;</mo> <mrow> <msup> <mi>i</mi> <mo>&prime;</mo> </msup> <mi>s</mi> <mo>.</mo> <mi>t</mi> <mo>.</mo> <msup> <mi>i</mi> <mo>&prime;</mo> </msup> <mo>&NotElement;</mo> <mrow> <mo>{</mo> <mrow> <mi>i</mi> <mo>,</mo> <mi>k</mi> </mrow> <mo>}</mo> </mrow> </mrow> </munder> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo>{</mo> <mn>0</mn> <mo>,</mo> <mi>r</mi> <mrow> <mo>(</mo> <msup> <mi>i</mi> <mo>&prime;</mo> </msup> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>}</mo> <mo>}</mo> <mo>;</mo> </mrow>

The new methylation sites of all clusters are represented to the data matrix of point composition as the new data that methylate of next iteration, and Its similar matrix is calculated as the input of next iteration, continues cluster process, the stopping criterion for iteration until reaching setting.

6. the method for full-length genome data mining methylation patterns is utilized as claimed in claim 2, it is characterised in that the step According to the Pearson correlation coefficients between methylation level and gene expression with the absolute value 0.3 of coefficient correlation it is threshold value in rapid five.