CN107301330A - A kind of method of utilization full-length genome data mining methylation patterns - Google Patents
A kind of method of utilization full-length genome data mining methylation patterns Download PDFInfo
- Publication number
- CN107301330A CN107301330A CN201710409105.6A CN201710409105A CN107301330A CN 107301330 A CN107301330 A CN 107301330A CN 201710409105 A CN201710409105 A CN 201710409105A CN 107301330 A CN107301330 A CN 107301330A
- Authority
- CN
- China
- Prior art keywords
- methylation
- mrow
- site
- data
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Public Health (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Genetics & Genomics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
Abstract
The invention belongs to the technical field of data processing of bioinformatics, a kind of method of utilization full-length genome data mining methylation patterns is disclosed, including:Genetic chip significance analysis SAM methods are used in a variety of set of data samples, the differential methylation site on full-length genome is filtered out respectively;The methylation differential site of multiple sample sets is taken into common factor, common difference site set is obtained;Calculate the Pearson correlation coefficients between the methylation level and corresponding gene expression in differential methylation site, identification Regulation by Methylation site;AP clusters are carried out to difference site set iteration, the cluster that methylates is obtained, pattern analysis are carried out to each cluster that methylates respectively, and proved by gene annotation and KEGG enrichment analyses.The present invention provides reference for the drug development for demethylation and used for reference;There is general character in different types of disease, have reality and clinical meaning from the relation of full-length genome angle research methylation patterns and disease really on methylation patterns.
Description
Technical field
Dug the invention belongs to the technical field of data processing of bioinformatics, more particularly to one kind using full-length genome data
The method for digging methylation patterns.
Background technology
With making constant progress for high throughput sequencing technologies and biochip technology, the base of efficient magnanimity can be obtained
Factor data, gene data contains many complicated biological phenomenas, makes heredity and the epigenetic for comprehensively exploring disease
Basis is possibly realized, and is studied for modern life science and is provided new direction and thinking.But mass data can not be intuitively
Disclose biological phenomena or reflection biological law, it is necessary to divide using complicated statistical method and some other means and technology
The biological phenomena that mass data contains is explored in analysis.Thus, biological information subject has been derived.Bioinformatics is pupil life
The new branch of science that science and computer science are combined, studies collection, processing, storage, propagation, analysis and the explanation of biological information
Deng disclosing the biology contained of biological data of complexity by comprehensively utilizing biology, computer science and information technology
Secret.Human genome actually includes two category informations:Hereditary information and epigenetic information, have thus expedited the emergence of science of heredity and table
See science of heredity.Science of heredity (Genetics) research biological heredity and variation, including the variation of gene structure, function and expression rule
Rule, i.e., the hereditary information for being changed and being produced by DNA sequence dna;Epigenetics (Epigenetics) is studied in nucleotides sequence
On the premise of row do not change, heredity caused by gene expression changes.Heredity and epigenetic are relative concepts,
Interdependence collectively forms the hereditary information of the mankind again simultaneously.DNA methylation is vital in embry ogenesis and development
Life process, is also one of most common epigenetic modification.Therefore, as the DNA of epigenetic modification important component
Methylating also turns into the emphasis of research, and it achieves significant effect in the early detection of disease, prevention, treatment, prognosis etc..
DNA methylation refers under dnmt rna (DNMT) catalysis, using thio methionine as methyl donor, in CpG dinucleotides
The chemical modification of a methyl group is added on 5 ' carbon atoms of sour cytimidine molecule.DNA methylation can result in some genes
Inactivation and some districts domain dna conformation change, and then the interaction of DNA and protein is influenceed, control gene expression.DNA first
Base is also possible to cause the change of respective regions chromatin Structure in genome, causes DNA to lose core plum, restriction enzyme
Cleavage site, and DNA enzymatic sensitivity site, make chromatin high degree of spiral, it is condensing agglomerating, lose transcriptional activity.Pass through analysis
Methylation level and the relation of gene expression find that methylation level is negatively correlated with gene expression journey, i.e., hypomethylation promotes base
Because expressing, and hyper-methylation inhibition of gene expression.Numerous studies show simultaneously, compared with normal cell, gene in disease cells
The overall methylation level of group is relatively low, but the abnormal hyper-methylation of promoter regional area, and this is to detect disease using methylation level
Generation provide theoretical foundation.What some genes may have that tumour-specific methylates in cancer cell or tissue simultaneously changes
Become, based on this characteristic, the biomarker that can early diagnose DNA methylation as disease, molecular labeling can be further true
Determine the hypotype of disease, this treatment to disease is extremely important;Furthermore, clinically can be by DNA due to the invertibity of epigenetic
The novel targets methylated as disease treatment, can there are some researches show the cell by demethylation drug-treated in vitro culture
To activate the gene of silence due to DNA methylation change.The limitation of sequencing technologies and microarray technology, DNA methylation data
The characteristics of Statistic features of Non-Gaussian Distribution and high heterogeneity, uneven distribution of the DNA methylation data on genome, no
The different dimensions for learning data with group all produce huge challenge to the data analysis that methylates.The source of DNA methylation data mainly leads to
Chip and sequencing technologies are crossed, the full-length genome that can obtain multiple samples using chip methylates data, can statistically study
Effect of the DNA methylation in complex disease, but its coverage rate on genome is relatively low, and not as sequencing data essence
Really;Sequencing data cost is high, it is few to take many, sample size, although coverage rate is high and result is accurate, exists for cancer research
One definite limitation;Conventional difference analysis method such as T is examined, and the statistical method such as ANOVA has certain requirement to data distribution, and
It is not suitable for analyzing DNA methylation data, therefore when recognizing DNA methylation pattern, it is necessary to propose new statistical method or survey
Degree;DNA methylation is different with the dimension of gene expression, moreover, a gene includes multiple methylation sites, how to integrate two
Person, is also a major challenge that researcher faces.Exactly for these reasons, currently, though research on DNA methylation pattern
It is many, but most of researchs are all based on a kind of disease or individual gene and the DNA methylation of smaller area, are seldom based on many
The analysis of DNA methylation pattern on the full-length genome of kind of disease, causes the DNA methylation patterns of a variety of diseases and unintelligible,
The Regulation by Methylation site having now been found that is even more few.
In summary, the problem of prior art is present be:Traditional statistical method is higher to the Spreading requirements of data, that is, requires
What the distribution of data was to determine, and the distribution for the data that actually methylate and indefinite, so there is limitation in traditional statistical method
Property;Difference group learns data its dimension difference, so Data Integration is also current research facing challenges.
The content of the invention
The problem of existing for prior art, full-length genome data mining methylation patterns are utilized the invention provides one kind
Method.
The present invention is achieved in that a kind of method of utilization full-length genome data mining methylation patterns, the utilization
The method of full-length genome data mining methylation patterns includes:Genetic chip significance analysis is used in a variety of set of data samples
SAM methods, filter out the differential methylation site on full-length genome respectively;The methylation differential site of multiple sample sets is taken into friendship
Collection, obtains common difference site set;Between the methylation level and corresponding gene expression that calculate differential methylation site
Pearson correlation coefficients, identification Regulation by Methylation site;AP clusters are carried out to difference site set iteration, the cluster that methylates is obtained,
Pattern analysis is carried out to each cluster that methylates respectively, and is proved by gene annotation and KEGG enrichment analyses.
Further, the method for the utilization full-length genome data mining methylation patterns comprises the following steps:
Step one, the methylation level and gene expression dose to a variety of disease sample data are pre-processed, pretreatment
Process is divided into methylate data prediction and gene expression data pretreatment;
Step 2, differential methylation site is screened with genetic chip significance analysis SAM methods, and every kind of disease is pre-processed
CpG sites afterwards methylate data, take the SAM algorithms of non-matching parameter to carry out differential methylation site screening respectively, every kind of
The repetition that the normal sample of disease and ill sample carry out 100 times tests to adjust SAM threshold value, observes each threshold value corresponding
False positive rate FDR values, choose corresponding value when FDR values are 0 and are used as threshold value Δ;
Step 3, by the differential methylation site of each disease screened, takes common factor, obtains differential methylation site
Set;Analysis differential methylation site is integrated into the distribution of each position of gene;
Step 4, carries out AP clusters to obtained differential methylation site set, obtains the cluster that methylates;
Step 5, takes out differential methylation site and gathers corresponding gene expression dose, the Pearson between calculating is related
Coefficient, threshold value, identification Regulation by Methylation site are set according to the size of coefficient;
Step 6, according to obtained methylate cluster and Regulation by Methylation site, obtains the first on a variety of disease full-length genomes
Base pattern.
Further, the step one is specifically included:
1) methylate data prediction:The Beta values of each sample are mapped to the data produced on genome;Remove
The entitled empty site of gene, and reach comprising 0 number more than 80% site;
2) gene expression data is pre-processed:Remove comprising 0 number reach more than 80% gene, carry out missing values fill out
Fill, normalization of being taken the logarithm after standardization;
3) according to gene structure by site subregion:The methylation sites of full-length genome are divided into according to gene structure as follows
Region:Promoter region, gene body region, tri- regions of 3'UTR;Promoter region is divided into TSS1500, TSS200, first
Extron, tetra- zonules of 5'UTR.
Further, the step 4 is specifically included:
1) methylation level that the ill sample of corresponding every kind of disease is gathered in differential methylation site is taken out, one is obtained
Behavior methylation sites, are classified as the matrix of data set sample, that is, the data set clustered;
2) similar matrix for the data that methylate is calculated, similarity measurement uses Pearson correlation coefficients, obtained similar square
Battle array is symmetrical matrix;
3) similar matrix is regarded into the input that AP is clustered, is made iteratively the AP clusters of differential methylation data, every time repeatedly
In generation, all generates the cluster of certain amount.
Further, it is described 3) in iteration specifically include:
Set iterations be more than or equal to 10 or clusters number be less than or equal to 10 when, cluster terminate;
When iterations is less than 10 and current clusters number is less than 10, by the methylation sites in current each cluster
The methylation level of correspondence sample is averaged, and obtains new methylation sites as the representative point of the cluster;AP cluster process
In, there are two kinds of information to transmit and be continuously updated between each node, Attraction Degree r and degree of membership a are constantly updated by successive ignition
The Attraction Degree and degree of membership of each sample point, until producing multiple high-quality cluster centres, and other sample points are assigned to
In corresponding cluster;In the first iteration, r variable updates formula is as follows:
In iterative process after first time iteration, according to information variable a value come more new formula;The renewal of a variables is then
It is the support for collecting all sample points for each candidate cluster center, its more new formula is as follows:
The new methylation sites of all clusters are represented to the data matrix of point composition as the new number that methylates of next iteration
According to, and its similar matrix is calculated as the input of next iteration, continue cluster process, the iteration ends bar until reaching setting
Part.
Further, in the step 5 according to the Pearson correlation coefficients between methylation level and gene expression with phase relation
Several absolute values 0.3 is threshold value.
Advantages of the present invention and good effect are:The present invention solves traditional distinctions using SAM difference analysis methods and analyzed
To the requirement of data distribution in method, found while method and the T methods of inspection of the present invention are contrasted, difference is little, it was demonstrated that SAM
The validity of method;The AP clustering methods that the present invention is used, have also abandoned and lacking for clusters number are pre-seted in traditional clustering method
Fall into, not only increase cluster efficiency, also reduce FDR (false positive rate).The present invention considers the number that methylates of a variety of diseases
According to expanding to polytype disease by the single disease in previous methods;Expanded to entirely by individual gene or some region
Genome;With reference to its gene expression data, the DNA methylation pattern of disease is summed up, contrast various disease type methylates mould
The similitude and specificity of formula, disclose important function of the methylation patterns to disease development, to methylate clinically
Using offer theoretical foundation and reference.
The present invention utilizes the Pearson correlation coefficients methylated between gene expression, given threshold, screening strong correlation position
Point, identification Regulation by Methylation site.These sites are relevant with a variety of diseases, it is not limited to certain disease, are a variety of diseases
Type is shared.
The present invention can be used for the pathogenesis of explaination complex disease, and risk profile is carried out to disease, and for for demethyl
The drug development of change provides reference and used for reference;There is general character in different types of disease, from full genome really on methylation patterns
The relation of group angle research methylation patterns and disease has reality and clinical meaning.
Brief description of the drawings
Fig. 1 is the method flow diagram of utilization full-length genome data mining methylation patterns provided in an embodiment of the present invention.
Fig. 2 is that the method implementation process of utilization full-length genome data mining methylation patterns provided in an embodiment of the present invention is shown
It is intended to.
Fig. 3 is the experimental result schematic diagram provided in an embodiment of the present invention in True Data;
In figure:(a) each region methylation level distribution situation in tumour cell;(b) each region methylates in normal cell
Horizontal distribution situation.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.
As shown in figure 1, the method for utilization full-length genome data mining methylation patterns provided in an embodiment of the present invention includes
Following steps:
S101:Genetic chip significance analysis SAM methods are used in a variety of set of data samples, full genome is filtered out respectively
Differential methylation site in group;The methylation differential site of multiple sample sets is taken into common factor, common difference site set is obtained;
S102:Calculate Pearson's phase relation between the methylation level and corresponding gene expression in differential methylation site
Number, identification Regulation by Methylation site;
S103:AP clusters are carried out to difference site set iteration, the cluster that methylates is obtained, each cluster that methylates is carried out respectively
Pattern analysis, and proved by gene annotation and KEGG enrichment analyses.
The application principle of the present invention is further described below in conjunction with the accompanying drawings.
As shown in Fig. 2 the method for utilization full-length genome data mining methylation patterns provided in an embodiment of the present invention includes
Following steps:
Step one, data prediction:
Methylate data prediction:It is empty site to remove gene name (gene-symbol), and is reached comprising 0 number
To more than 80% site.
Gene expression data is pre-processed:Remove first reached comprising 0 number more than 80% gene, then lacked
Value filling, takes normalization of being taken the logarithm after standardization.
According to gene structure by site subregion:The methylation sites of full-length genome are divided into promoter according to gene structure
Region (promoter), genosome (Gene body) region, tri- regions of 3'UTR;Promoter region can be divided further again
For TSS1500, TSS200, First Exon (1stExon), tetra- zonules of 5'UTR.
Step 2, the differential methylation site of single disease is screened using SAM methods:
The data that methylate in CpG sites pretreated to various diseases take the SAM algorithms of non-matching parameter to enter respectively
Row differential methylation site is screened, and the repetition experiment that the normal sample of every kind of disease and ill sample carry out 100 times goes to adjust SAM
Threshold value, observe the corresponding FDR values of each threshold value, corresponding value is threshold value (Δ) when finally selection FDR values are 0.It is of the invention real
Applying the threshold value taken in example is respectively:BLCA Δ=4.51;BRCA Δ=4.94;COAD Δ=4.62;LUAD Δ=4.90;
LUSC Δ=4.69;UCEC Δ=5.03.
Step 3, by the differential methylation site of each disease screened in step 2, takes common factor, obtains difference first
Gather in base site;Analysis differential methylation site is integrated into the distribution in each gene region, as shown in Figure 2.
Step 4, the differential methylation site set obtained to step 3 carries out AP clusters, obtains 9 clusters that methylate.AP
The detailed process of cluster is as follows:
First, the methylation level that the ill sample of corresponding every kind of disease is gathered in differential methylation site is taken out, is obtained
One behavior methylation sites, is classified as the matrix of data set sample, that is, the data set clustered;
Secondly, the similar matrix for the data that methylate is calculated, the present invention uses Pearson correlation coefficients, so obtain
Similar matrix (Similarity) is symmetrical matrix;
Finally, similar matrix is regarded into the input that AP is clustered, is made iteratively the AP clusters of differential methylation data, every time
Iteration all generates the cluster of certain amount;
The detailed process of its iteration is:
(1) set iteration ends condition, set iterations be more than or equal to 10 or clusters number be less than or equal to 10 when,
Cluster is terminated;
(2) when iterations is less than 10 and current clusters number is less than 10, by methylating in current each cluster
The methylation level of site correspondence sample is averaged, and obtains new methylation sites as the representative point of the cluster;AP is clustered
During, there are two kinds of information to transmit and be continuously updated between each node, i.e. Attraction Degree r (responsibility) and degree of membership
A (availability), the algorithm constantly updates the Attraction Degree and degree of membership of each sample point by successive ignition, until producing
Multiple high-quality cluster centres, and other sample points are assigned in corresponding cluster;In the first iteration, r variable updates
Formula is as follows:
First time iterative process is only simple data-driven, because only needing to consider that the similitude between sample point is big
It is small, without removing to consider other sample points to current candidate samples point support, but in iterative process after,
Need according to information variable a value come more new formula;The renewal of a variables is then to collect all sample points for each candidate
The support of cluster centre, its more new formula is as follows:
(3) data matrix that the new methylation sites of all clusters are represented to point composition methylates as the new of next iteration
Data, and its similar matrix is calculated as the input of next iteration, continue cluster process, the iteration ends bar until reaching setting
Part.
Step 5, takes out differential methylation site in step 3 and gathers corresponding gene expression dose, the skin between calculating
Ademilson coefficient correlation, Pearson's absolute coefficient is considered strong correlation more than 0.3, identifies Regulation by Methylation site.
Step 6, the Regulation by Methylation site that methylate cluster and the step 5 obtained according to step 4 is obtained sums up a variety of
Methylation patterns on disease full-length genome.
The application effect of the present invention is explained in detail with reference to experiment.
1st, using true case data, full-length genome methylation patterns are excavated.
The complete genome DNA used in experiment methylates data set and gene expression dataset is all from cancer and tumour base
Because of general cancer project (Pan-Cancer Initiative) data in collection of illustrative plates (TheCancer Genome Altas, TCGA)
Storehouse (https://www.synapse.org/#!Synapse:Six kinds of diseases of the offer in syn300013/wiki/70804)
Data set.Including:Urothelial Carcinoma of Bladder (Bladder Urothelial Ca-rcinoma, BLCA), mammary gland infiltration cancer
(Breast invasive Carcinoma, BRCA), colon cancer (Colon Adenocarcinoma, COAD), lung squamous cell
Cancer (Lung Squamous cell Carcinoma, LUSC), carcinoma of endometrium (Uterine Corpus Endometrial
Carcinoma, UCEC), adenocarcinoma of lung (Lung Adenocarcinoma, LUAD).Data are all on Illumina platforms
Level3 horizontal datas, the data that methylate are Illumina microarray platforms (Illumina
InfiniumHumanMethylation 450K Array) on formed, will the Beta values of each sample be mapped to genome
The data above produced;Gene expression data uses IlluminaHiSeqRNASeqV2 data.
Table 1 lists the DNA methylation initial data of this experiment use, comprising 396064 CPG sites, on each gene
There may be multiple sites, i.e., the different methylation levels in each 396064 CPG sites of sample correspondence, are a series of 0 to 1 continuous
Value.The ill sample and normal sample of six kinds of disease types are all uneven samples, can be lost due to being processed into balance sample
Great amount of samples, ignores the error caused by sample non-control.
Table 1
Sequence number | Disease type | CpG number of sites | Ill sample | Normal sample |
1 | BLCA | 396064 | 126 | 18 |
2 | BRCA | 396064 | 578 | 96 |
3 | COAD | 396064 | 255 | 38 |
4 | LUAD | 396064 | 306 | 32 |
5 | LUSC | 396064 | 225 | 42 |
6 | UCEC | 396064 | 383 | 42 |
2nd, the specific implementation step of experiment is as follows:
Data in table 1 are pre-processed, the present invention uses the gene in the FEM bags provided on Bioconductor
Information, it is empty site then to remove gene name (gene-symbol), and reaches comprising 0 number more than 80% site
Finally obtain the value that methylates using this 248592 sites in 248592 CpG sites, following step.
The data that methylate in 248592 CpG sites pretreated to six kinds of cancers take non-matching parameter respectively
SAM algorithms carry out differential methylation site screening, and the normal sample and disease sample of every kind of cancer carry out the repetition experiment of 100 times
Adjustment SAM threshold value is gone, the corresponding FDR values of each threshold value are observed, it is threshold value (Δ) finally to choose corresponding value when FDR values are 0,
Then the corresponding threshold value of each cancer is respectively:BLCA Δ=4.51;BRCA Δ=4.94;COAD Δ=4.62;LUAD Δs=
4.90;LUSC Δ=4.69;UCEC Δ=5.03.Table 2 gives the differential methylation result of six kinds of diseases.
Table 2
3rd, in order to analyze the methylation patterns on a variety of disease type full-length genomes, the present invention uses six kinds of disease difference first
Common factor data its result after base.The difference CPG sites for taking common factor to obtain are 2184, and gene is 2728, wherein high first
1489 and 1591, base CpG sites (up) gene;692, hypomethylation CpG sites (low) CpG sites and 611 bases
Cause;Gene number is less than by hyper-methylation site, some sites are inferred on multiple genes, such as gene junction.It is overall next
See, differential methylation site number is less than gene number, further illustrates, same gene pairs answers multiple sites, and its
The methylation level gap of different loci is larger;Thus infer, the site that methylation level differs greatly is not in geneBody
Region, but gene intersection, i.e. promoter region.To sum up, take what common factor was obtained after difference being used only in ensuing experiment
2184 CPG sites and 2728 genes, correspond to and are analyzed on six regions divided before, its distribution in each region
Situation is as shown in table 3.From table and Fig. 3, in oncogene, First Exon is the maximum region of methylation differential, its
Secondary is that 3'UTR, geneBody, TSS1500 are the larger regions of methylation differential, be thereby it is assumed that, the DNA of this subregion
Methylate and participate in the part basic function of human body, if the methylation level in these regions produces larger change, be easily caused it
The disorder of correlation function, causes cancer, and this phenomenon embodies the similitude between cancer.
Table 3
4. the above-mentioned SAM variance analyses of pair process and take common factor 2184 differential methylation sites and its 2728 genes,
Carry out AP clusters.The methylation level of the ill sample of the corresponding every kind of cancer of 2184 differential methylations is taken out first, is obtained
One 2184 row, the matrix of 1874 row, that is, the data set clustered.Secondly the similar matrix for the data that methylate is calculated, the present invention makes
It is Pearson correlation coefficients, so obtained similar matrix (Similarity) is symmetrical matrix.Similarity matrix is worked as
The input of AP clusters is done, the AP clusters of differential methylation data are made iteratively, each iteration all generates the cluster of certain amount,
Its specific cluster process is:The condition of iteration ends is set first, sets iterations to be more than or equal to 10 or cluster numbers here
When mesh is less than or equal to 10, cluster is terminated;, will be currently each when less than 10, current clusters number is less than 10 to iterations simultaneously
The methylation level of methylation sites correspondence sample in cluster is averaged, and obtains new methylation sites as the cluster
Represent a little, the data matrix that then the new methylation sites of all clusters are represented to point composition methylates as the new of next iteration
Data, and its similar matrix is calculated as the input of next iteration, continue cluster process, the iteration ends bar until reaching setting
Part.In the present invention during iteration ends, iteration twice has been carried out altogether, finally generates 9 clusters that methylate, the generation for the cluster that each methylates
Table point is the average value of the methylation level of all methylation sites in the cluster that methylates, and its result is as shown in table 4.By table 4
Understand, 2184 CPG sites are not overlapped between dividing in 9 different clusters that methylate.Observe gene number to find, 9 clusters
The total number of middle gene is 1406, but 2184 CpG sites correspond to 1239 genes altogether.It could therefore be concluded that going out to have portion
Gene is divided to be divided into multiple clusters that methylate.
Table 4
5. recognize Regulation by Methylation site:Respectively calculate 9 clusters that methylate in CPG sites methylation level value with it is corresponding
Pearson correlation coefficients between gene expression dose.9 clusters that methylate are altogether comprising 2184 CPG sites in experiment, from TCGA numbers
According to the initial data that gene expression is obtained in storehouse.By the data prediction of early stage, remove the expression value of some genes,
2184 difference CPG sites correspond to remaining 1721 sites in gene expression.Observe the methylation level in this 1721 sites
Found with the Pearson correlation coefficients of gene expression dose, in general, most CPG sites methylation and gene
The absolute value of coefficient correlation between expression is below 0.1, in addition have the coefficient correlation in more than 200 CPG site close to
Zero, it is believed that be uncorrelated;The absolute value of the coefficient correlation in only 8 CPG sites is more than 0.3.They are located at different dyes
On colour solid, and distribution concentrates on the 3rd, 4,5 and methylated in cluster.The Pearson correlation coefficients in this site of cg19883813 among these
It is strong negatively correlated for -0.63, it is possible thereby to the unconventionality expression for being inferred to this 8 genes be probably by opposing bases site it is too high or
Too low methylation level is extremely caused, and table 5 gives the specifying information in 8 Regulation by Methylation sites.
Table 5
6. pair each cluster that methylates carries out gene annotation using DAVID softwares by databases such as GO, and uses R software kits
GOStats carries out pathway enrichment analyses with reference to KEGG databases.Pathway paths enrichment analysis result (as shown in table 6),
Show that the 3rd cluster that methylates is not engaged in any path i.e. bioprocess, illustrate such relevance with each cancer may very little, this
It is identical with the result of DAVID gene annotations.Thus longitudinal 2 observation table 6 infers this part base it can be found that its OR value is all higher than 1
Because being the hazards of disease, with tumour close relation.9 classes methylate, and to occur conspicuousness in 23 biological pathways rich for cluster
Collection, this shows that aberrant DNA methylation level affects multiple different cancer related pathways, and related in the tumour of multiple types
Key effect is played in path.
Table 6
As shown in Table 6, the primary biological process that the cluster gene that respectively methylates is participated in has:Promote acceptor and part in nerve fiber
Interaction, induce arrhythmogenic right ventricular cardiomyopathy (Arrhythmogenic right ventricular
Cardiomyopathy ARVC), hypertrophic cardiomyopathy (Hypertrophic cardiomyopathy, HCM), dilated cardiomyopathy
The generation of the diseases such as disease, maturity onset diabetes of the young, type ii diabetes;In Ca2+ oscillations path, chemotactic factor (CF) signal path, Notch
Conspicuousness is enriched with the associated signal paths such as signal path, insulin signaling pathway;Participate in scent signal conduction, cell adherence point
The related biological processes such as sub (Cell adhesion molecules, CAM) adhesive connection, gastric acid secretion, amino acid metabolism.
It is enriched with result and shows that these genes not only play an important role in cancer, and its abnormal expression may also cause it
The generation of his disease;This also indicates that between cancer there are some identical related genes between various diseases.
Karnovsky etc. has probed into the related pathways of multiple cancer types by analyzing the specific expressed of DNA methylation, and shows cancer
There is similar path between disease, this has a similar conclusion to the present invention, it was demonstrated that effectiveness of the invention.
Methylating and gene expression data for the comprehensive a variety of diseases of the present invention, contrasts various disease type methylation patterns
Similitude and specificity, disclose important function of the methylation patterns to disease development, for the application methylated clinically
Theoretical foundation and reference are provided.The present invention is directed to the limitation in above-mentioned research, and the angle from a variety of diseases is in full base
Because analyzing DNA methylation data and gene expression pattern in group, DNA methylation and the association mode of gene expression are summed up,
The similitude and specificity of DNA methylation between various disease type are found, strives the treatment method of single disease being transplanted to
It is that a new way is explored in the diagnosis, treatment, prognosis of disease in the treatment of other similar diseases.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
Any modifications, equivalent substitutions and improvements made within refreshing and principle etc., should be included in the scope of the protection.
Claims (6)
1. a kind of method of utilization full-length genome data mining methylation patterns, it is characterised in that the utilization full-length genome number
Include according to the method for excavating methylation patterns:Genetic chip significance analysis SAM methods are used in a variety of set of data samples, point
The differential methylation site on full-length genome is not filtered out;The methylation differential site of multiple sample sets is taken into common factor, is total to
With the set of difference site;The methylation level for calculating differential methylation site is related to the Pearson between corresponding gene expression
Coefficient, identification Regulation by Methylation site;AP clusters are carried out to difference site set iteration, the cluster that methylates are obtained, respectively to each
The cluster that methylates carries out pattern analysis, and is proved by gene annotation and KEGG enrichment analyses.
2. the method for full-length genome data mining methylation patterns is utilized as claimed in claim 1, it is characterised in that the profit
Comprised the following steps with the method for full-length genome data mining methylation patterns:
Step one, the methylation level and gene expression dose to a variety of disease sample data are pre-processed, preprocessing process
It is divided into methylate data prediction and gene expression data pretreatment;
Step 2, differential methylation site is screened with genetic chip significance analysis SAM methods, pretreated to every kind of disease
CpG sites methylate data, take the SAM algorithms of non-matching parameter to carry out differential methylation site screening, every kind of disease respectively
Normal sample and ill sample carry out the repetition of 100 times and test to adjust SAM threshold value, observe the corresponding false sun of each threshold value
Property rate FDR values, choose when FDR values are 0 it is corresponding be worth be used as threshold value Δ;
Step 3, by the differential methylation site of each disease screened, takes common factor, obtains differential methylation site collection
Close;Analysis differential methylation site is integrated into the distribution of each position of gene;
Step 4, carries out AP clusters to obtained differential methylation site set, obtains the cluster that methylates;
Step 5, the corresponding gene expression dose of taking-up differential methylation site set, the Pearson correlation coefficients between calculating,
Threshold value, identification Regulation by Methylation site are set according to the size of coefficient;
Step 6, according to obtained methylate cluster and Regulation by Methylation site, obtains methylating on a variety of disease full-length genomes
Pattern.
3. the method for full-length genome data mining methylation patterns is utilized as claimed in claim 2, it is characterised in that the step
Rapid one specifically includes:
1) methylate data prediction:The Beta values of each sample are mapped to the data produced on genome;Remove gene
Entitled empty site, and reach comprising 0 number more than 80% site;
2) gene expression data is pre-processed:Remove comprising 0 number reach more than 80% gene, carry out Missing Data Filling, mark
Taken the logarithm after standardization normalization;
3) according to gene structure by site subregion:The methylation sites of full-length genome are divided into following area according to gene structure
Domain:Promoter region, gene body region, tri- regions of 3'UTR;Promoter region is divided into TSS1500, TSS200, outside first
Aobvious son, tetra- zonules of 5'UTR.
4. the method for full-length genome data mining methylation patterns is utilized as claimed in claim 2, it is characterised in that the step
Rapid four specifically include:
1) methylation level that the ill sample of corresponding every kind of disease is gathered in differential methylation site is taken out, a behavior is obtained
Methylation sites, are classified as the matrix of data set sample, that is, the data set clustered;
2) similar matrix for the data that methylate is calculated, similarity measurement uses Pearson correlation coefficients, and obtained similar matrix is
Symmetrical matrix;
3) similar matrix is regarded into the input that AP is clustered, is made iteratively the AP clusters of differential methylation data, each iteration is all
Generate the cluster of certain amount.
5. as claimed in claim 4 using full-length genome data mining methylation patterns method, it is characterised in that it is described 3)
Middle iteration is specifically included:
Set iterations be more than or equal to 10 or clusters number be less than or equal to 10 when, cluster terminate;
When iterations is less than 10 and current clusters number is less than 10, by the methylation sites correspondence in current each cluster
The methylation level of sample is averaged, and obtains new methylation sites as the representative point of the cluster;In AP cluster process, have
Two kinds of information are transmitted and are continuously updated between each node, Attraction Degree r and degree of membership a, are constantly updated by successive ignition each
The Attraction Degree and degree of membership of sample point, until producing multiple high-quality cluster centres, and other sample points are assigned to accordingly
Cluster in;In the first iteration, r variable updates formula is as follows:
<mrow>
<mi>r</mi>
<mrow>
<mo>(</mo>
<mi>i</mi>
<mo>,</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>&LeftArrow;</mo>
<mi>s</mi>
<mrow>
<mo>(</mo>
<mi>i</mi>
<mo>,</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>-</mo>
<munder>
<mrow>
<mi>m</mi>
<mi>a</mi>
<mi>x</mi>
</mrow>
<mrow>
<msup>
<mi>k</mi>
<mo>&prime;</mo>
</msup>
<mi>s</mi>
<mo>.</mo>
<mi>t</mi>
<mo>.</mo>
<msup>
<mi>k</mi>
<mo>&prime;</mo>
</msup>
<mo>&NotEqual;</mo>
<mi>k</mi>
</mrow>
</munder>
<mo>{</mo>
<mi>a</mi>
<mrow>
<mo>(</mo>
<mi>i</mi>
<mo>,</mo>
<msup>
<mi>k</mi>
<mo>&prime;</mo>
</msup>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>s</mi>
<mrow>
<mo>(</mo>
<mi>i</mi>
<mo>,</mo>
<msup>
<mi>k</mi>
<mo>&prime;</mo>
</msup>
<mo>)</mo>
</mrow>
<mo>}</mo>
<mo>;</mo>
</mrow>
In iterative process after first time iteration, according to information variable a value come more new formula;The renewal of a variables is then to receive
All sample point of collection is for the support at each candidate cluster center, and its more new formula is as follows:
<mrow>
<mi>a</mi>
<mrow>
<mo>(</mo>
<mi>i</mi>
<mo>,</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>&LeftArrow;</mo>
<mi>m</mi>
<mi>i</mi>
<mi>n</mi>
<mo>{</mo>
<mn>0</mn>
<mo>,</mo>
<mi>r</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>,</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>+</mo>
<munder>
<mo>&Sigma;</mo>
<mrow>
<msup>
<mi>i</mi>
<mo>&prime;</mo>
</msup>
<mi>s</mi>
<mo>.</mo>
<mi>t</mi>
<mo>.</mo>
<msup>
<mi>i</mi>
<mo>&prime;</mo>
</msup>
<mo>&NotElement;</mo>
<mrow>
<mo>{</mo>
<mrow>
<mi>i</mi>
<mo>,</mo>
<mi>k</mi>
</mrow>
<mo>}</mo>
</mrow>
</mrow>
</munder>
<mi>m</mi>
<mi>a</mi>
<mi>x</mi>
<mo>{</mo>
<mn>0</mn>
<mo>,</mo>
<mi>r</mi>
<mrow>
<mo>(</mo>
<msup>
<mi>i</mi>
<mo>&prime;</mo>
</msup>
<mo>,</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>}</mo>
<mo>}</mo>
<mo>;</mo>
</mrow>
The new methylation sites of all clusters are represented to the data matrix of point composition as the new data that methylate of next iteration, and
Its similar matrix is calculated as the input of next iteration, continues cluster process, the stopping criterion for iteration until reaching setting.
6. the method for full-length genome data mining methylation patterns is utilized as claimed in claim 2, it is characterised in that the step
According to the Pearson correlation coefficients between methylation level and gene expression with the absolute value 0.3 of coefficient correlation it is threshold value in rapid five.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710409105.6A CN107301330A (en) | 2017-06-02 | 2017-06-02 | A kind of method of utilization full-length genome data mining methylation patterns |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710409105.6A CN107301330A (en) | 2017-06-02 | 2017-06-02 | A kind of method of utilization full-length genome data mining methylation patterns |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107301330A true CN107301330A (en) | 2017-10-27 |
Family
ID=60134638
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710409105.6A Pending CN107301330A (en) | 2017-06-02 | 2017-06-02 | A kind of method of utilization full-length genome data mining methylation patterns |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107301330A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108319984A (en) * | 2018-02-06 | 2018-07-24 | 北京林业大学 | The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level |
CN108410980A (en) * | 2018-01-22 | 2018-08-17 | 深圳华大基因股份有限公司 | Screen method, kit and the application of the target area for the PCR detections that methylate |
CN109411019A (en) * | 2018-12-12 | 2019-03-01 | 中国人民解放军军事科学院军事医学研究院 | A kind of drug prediction technique, device, server and storage medium |
CN109830264A (en) * | 2019-03-15 | 2019-05-31 | 杭州慕谷科技有限公司 | The method that tumor patient is classified based on methylation sites |
CN109859796A (en) * | 2019-01-04 | 2019-06-07 | 王俊 | A kind of Dimension Reduction Analysis method that the DNA methylation about gastric cancer is composed |
CN111091867A (en) * | 2019-12-18 | 2020-05-01 | 中国科学院大学 | Gene variation site screening method and system |
CN112470229A (en) * | 2018-02-27 | 2021-03-09 | 基因组学公开有限公司 | Computer-implemented method of analyzing genetic data about an organism |
CN113889184A (en) * | 2021-09-27 | 2022-01-04 | 中国矿业大学 | M fused with genome characteristics6A methylation local functional spectrum decomposition method |
CN114287903A (en) * | 2021-12-31 | 2022-04-08 | 佳禾智能科技股份有限公司 | Heart rate detection method and device based on piezoelectric sensor and storage medium |
CN114373502A (en) * | 2022-01-07 | 2022-04-19 | 吉林大学第一医院 | Tumor data analysis system based on methylation |
CN116312794A (en) * | 2023-01-09 | 2023-06-23 | 哈尔滨医科大学 | Methylation sample clustering method fused with single cell analysis method |
WO2023236347A1 (en) * | 2022-06-10 | 2023-12-14 | 江苏品生医疗科技集团有限公司 | Protein data processing method and apparatus, electronic device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1495261A (en) * | 2002-07-23 | 2004-05-12 | 奈良先端科学技术大学院大学 | Joint application of caffeine biosynthetic system genome |
US20070243161A1 (en) * | 2006-02-28 | 2007-10-18 | Sven Olek | Epigenetic modification of the loci for CAMTA1 and/or FOXP3 as a marker for cancer treatment |
CN103122390A (en) * | 2013-03-07 | 2013-05-29 | 上海市疾病预防控制中心 | FRAT1 gene serving as serum marker for thyroid cancer detection and application thereof |
-
2017
- 2017-06-02 CN CN201710409105.6A patent/CN107301330A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1495261A (en) * | 2002-07-23 | 2004-05-12 | 奈良先端科学技术大学院大学 | Joint application of caffeine biosynthetic system genome |
US20070243161A1 (en) * | 2006-02-28 | 2007-10-18 | Sven Olek | Epigenetic modification of the loci for CAMTA1 and/or FOXP3 as a marker for cancer treatment |
CN103122390A (en) * | 2013-03-07 | 2013-05-29 | 上海市疾病预防控制中心 | FRAT1 gene serving as serum marker for thyroid cancer detection and application thereof |
Non-Patent Citations (4)
Title |
---|
XIAOFEI YANG等: "comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns", 《BRIEFINGS IN BIOINFORMATION》 * |
刘志敏: "基于AP聚类的蒸馏算法筛选乳腺癌致病基因", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 * |
崔斌等: "《R语言在生物医学领域的应用》", 31 October 2016 * |
林昊等: "《简明生物信息学》", 30 November 2014, 成都:电子科技大学出版社 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108410980A (en) * | 2018-01-22 | 2018-08-17 | 深圳华大基因股份有限公司 | Screen method, kit and the application of the target area for the PCR detections that methylate |
CN108319984A (en) * | 2018-02-06 | 2018-07-24 | 北京林业大学 | The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level |
CN112470229A (en) * | 2018-02-27 | 2021-03-09 | 基因组学公开有限公司 | Computer-implemented method of analyzing genetic data about an organism |
CN109411019B (en) * | 2018-12-12 | 2020-05-05 | 中国人民解放军军事科学院军事医学研究院 | Medicine prediction method, device, server and storage medium |
CN109411019A (en) * | 2018-12-12 | 2019-03-01 | 中国人民解放军军事科学院军事医学研究院 | A kind of drug prediction technique, device, server and storage medium |
CN109859796B (en) * | 2019-01-04 | 2023-04-25 | 浙江大学 | Dimension reduction analysis method for DNA methylation spectrum of gastric cancer |
CN109859796A (en) * | 2019-01-04 | 2019-06-07 | 王俊 | A kind of Dimension Reduction Analysis method that the DNA methylation about gastric cancer is composed |
CN109830264A (en) * | 2019-03-15 | 2019-05-31 | 杭州慕谷科技有限公司 | The method that tumor patient is classified based on methylation sites |
CN111091867A (en) * | 2019-12-18 | 2020-05-01 | 中国科学院大学 | Gene variation site screening method and system |
CN113889184A (en) * | 2021-09-27 | 2022-01-04 | 中国矿业大学 | M fused with genome characteristics6A methylation local functional spectrum decomposition method |
CN113889184B (en) * | 2021-09-27 | 2023-08-11 | 中国矿业大学 | M fusing genome features 6 A methylation local functional spectrum decomposition method |
CN114287903A (en) * | 2021-12-31 | 2022-04-08 | 佳禾智能科技股份有限公司 | Heart rate detection method and device based on piezoelectric sensor and storage medium |
CN114373502A (en) * | 2022-01-07 | 2022-04-19 | 吉林大学第一医院 | Tumor data analysis system based on methylation |
CN114373502B (en) * | 2022-01-07 | 2022-12-06 | 吉林大学第一医院 | Tumor data analysis system based on methylation |
WO2023236347A1 (en) * | 2022-06-10 | 2023-12-14 | 江苏品生医疗科技集团有限公司 | Protein data processing method and apparatus, electronic device and storage medium |
CN116312794A (en) * | 2023-01-09 | 2023-06-23 | 哈尔滨医科大学 | Methylation sample clustering method fused with single cell analysis method |
CN116312794B (en) * | 2023-01-09 | 2023-11-14 | 哈尔滨医科大学 | Methylation sample clustering method fused with single cell analysis method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107301330A (en) | A kind of method of utilization full-length genome data mining methylation patterns | |
Lal et al. | Molecular signatures in breast cancer | |
CN109689891A (en) | The method of segment group spectrum analysis for cell-free nucleic acid | |
Bhattacharyya et al. | MicroRNA signatures highlight new breast cancer subtypes | |
CN111128299B (en) | Construction method of ceRNA regulation and control network with significant correlation to colorectal cancer prognosis | |
CN110770838B (en) | Methods and systems for determining somatically mutated clonality | |
CN106795562A (en) | Tissue methylation patterns analysis in DNA mixtures | |
Wang et al. | Integrating omics data with a multiplex network-based approach for the identification of cancer subtypes | |
CN101173313B (en) | Mammary cancer metastasis and prognosis molecule typing gene group, gene chip producing and using method | |
CN109859796B (en) | Dimension reduction analysis method for DNA methylation spectrum of gastric cancer | |
CN109830264A (en) | The method that tumor patient is classified based on methylation sites | |
Ye et al. | Machine learning identifies 10 feature miRNAs for lung squamous cell carcinoma | |
Ruan et al. | A novel algorithm for network-based prediction of cancer recurrence | |
CN116312785A (en) | Breast cancer diagnosis marker gene and screening method thereof | |
CN114913919A (en) | Intelligent reading and reporting method, system and server for genetic variation of single-gene disease | |
Maind et al. | Identifying condition specific key genes from basal-like breast cancer gene expression data | |
CN108251524A (en) | A kind of screening technique of the accurate medicine of hepatic metastases triple negative breast cancer | |
KR20200105069A (en) | Method for identifying condition-specific micro rna targets with big data | |
CN107075586A (en) | Glycosyltransferase gene express spectra for identifying kinds cancer type and hypotype | |
Yousef et al. | GediNET-discover disease-disease gene associations utilizing knowledge-based machine learning | |
CN109300502A (en) | A kind of system and method for the analyzing and associating changing pattern from multiple groups data | |
WO2021227950A1 (en) | Cancer prognostic method | |
Deng et al. | Identification of EMT‐Related lncRNAs as Potential Prognostic Biomarkers and Therapeutic Targets for Pancreatic Adenocarcinoma | |
Madjar | Survival models with selection of genomic covariates in heterogeneous cancer studies | |
CN114520060B (en) | Medicine path prediction method based on network reasoning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171027 |
|
RJ01 | Rejection of invention patent application after publication |